CN102024065A

CN102024065A - SIMD optimization-based webpage duplication elimination and concurrency method

Info

Publication number: CN102024065A
Application number: CN 201110021002
Authority: CN
Inventors: 龙军; 张祖平; 袁鑫攀; 罗跃逸
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2011-01-18
Filing date: 2011-01-18
Publication date: 2011-04-20
Anticipated expiration: 2031-01-18
Also published as: CN102024065B

Abstract

The invention discloses an SIMD (single instruction multiple data) optimization-based webpage duplication elimination and concurrency method, which comprises the following steps of: 1, extracting text information of webpages, namely extracting effective information of webpages; 2, extracting Shingle, namely extracting webpage characteristics and generating a Shingles set; 3, clustering to reduce comparison times and reduce time and space complexity; and 4, comparing fingerprints to find similar webpages and delete the similar webpages. The SIMD optimization-based webpage duplication elimination and concurrency method can ensure the precision rate and the recall ratio, and effectively improves the rate of detecting webpage similarity.

Description

Removing duplicate webpages parallel method based on SIMD optimization

Technical field

The invention belongs to the Computer Applied Technology field, relate to a kind of removing duplicate webpages parallel method of optimizing based on SIMD.SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)) be that controller of a kind of employing is controlled a plurality of processors, simultaneously to one group of data (claim again " data vector ") thus in each carry out the technology of the concurrency on the identical operations implementation space respectively.In microprocessor, the single instruction stream multiple data stream technology then is that a controller is controlled a plurality of parallel processing infinitesimals, for example the 3D of the MMX of Intel or SSE and AMD Now! Technology.

Technical background

Along with computer science and technology and rapid development of network technology, network has become the important channel that people obtain important information.At present the greatest difficulty that faces of search engine is exactly the information that the result set that returns has comprised a large amount of repetitions, the information of these repetitions has not only taken user's plenty of time and has increased the weight of user's burden, simultaneously also take a large amount of storage spaces and bandwidth, reduced the efficient of index.Therefore, how search engine result set is classified or removing duplicate webpages becomes the important step that improves search engine retrieving efficient.

Based on the removing duplicate webpages algorithm of " approximate fingerprint ", because the character string of text is mapped to the hash value set, the problem of string matching is changed into numeric ratio problem, computing velocity is fast, is fit to extensive computing.But, there are many difficulties in selection in text block size and quantity, most complete text block is that text is used as a text block in full, and such text comparison can only detect letter perfect text and duplicate, and this method can only solve the problem of duplicating of " word does not leak ".By the text participle, extract the shingle feature based on " Shingle " similarity detection algorithm, the number of more common shingle calculates similarity.Algorithm need be considered the influence of parameters such as similarity threshold, shingle moving window size, shingle weight coefficient and paper attribute to the accuracy rate and the recall rate of removing duplicate webpages algorithm, and eliminates the blindness that similarity threshold is set.

Streaming SIMD Extensions SSE4.2[3] be Intel upgrading expansion once to ISA expansion instruction set maximum after SSE2.The new instruction of SSE4.2 is respectively that character string and character are handled the new instruction STTNI of usefulness and handled ATA towards the acceleration of concrete application towards two big fields.New instructions has strengthened the performance from multimedia application to the high-performance calculation application, also utilizes some special circuits to realize quickening for application-specific simultaneously.The present invention adopts to embed compilation SSE coded system, to guarantee once to compare simultaneously 128 fingerprint according to the architecture optimization code of " Intel Core i7 " series processors.Analyzing experiment and practical application shows: this algorithm can improve the speed that the document similarity detects effectively when not losing any assurance precision ratio and recall ratio.

Summary of the invention

The objective of the invention is to propose a kind of removing duplicate webpages parallel method of optimizing based on SIMD, should can in high precision ratio and high recall ratio, improve the speed that the webpage similarity detects effectively based on the removing duplicate webpages parallel method that SIMD optimizes.

Technical solution of the present invention is as follows:

A kind of removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:

Step 1: web page text information extraction step: this process is used for the webpage effective information is extracted;

Step 2:Shingle extraction step: this process is used to extract web page characteristics, and generates the Shingles set;

Step 3: cluster step: this process is used for reducing the comparison number of times, reduces time and space complexity;

Step 4: fingerprint comparison step: this process is used to seek out similar web page, and similar webpage is rejected.

The concrete steps of step 1 are:

File to HTML, XHTML, XML webpage format scans, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.

The concrete steps of step 2 are:

At first, the web page text information of extracting is carried out forward maximum match word segmentation processing, generate the set of word;

Then, the vocabulary of make up stopping using, and use the vocabulary of stopping using to filter out noise in the webpage, by the window size of setting, generate the Shingles set; Noise is an existing insignificant speech in the set of word.

The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.

From precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking.

The concrete steps of step 3 are:

At first, for the Shingle set that generates, establishing the Shingle set sizes is L, selects 1 Shingle as its sampling table Sample_Shingle_List every L/n Shingle from the Shingle set;

Then, Sample_Shingle_List is used the displacement Hash function of M different independent random, the Hash function that adopts is converted to one 128 fingerprint S set ample_Finger_List respectively with the feature of all shingle among the Sample_Shingle_List, selects the fingerprint of minimum fingerprint as this webpage from each Sample_Finger_List;

At last, the fingerprint that generates for N webpage carries out cluster, and during cluster, it is a class that the webpage that will have identical fingerprints incorporates into, finally obtains the collections of web pages after the cluster.

M gets the integer between the 7-10.

Adopt displacement Hash (being Hash) the function π of M independent random ₁, π ₂..., π _M, be the Sample_Shingle_List of any one document d (document) set S just _dBe converted to Sample_Finger_List:

Sample_Finger_List＝(min{π ₁(S _d)}，min{π ₂(S _d)}，...，min{π _M(S _d)})

Give an example

Ω=and 1,2,3,4,5}, S1={1,2,3}, S2={1,2,4}, Ω represents complete or collected works.

π 1{1,2,3,4,5}-＞3,2,1,5,4}, in the displacement hash function of M independent random of π 1 expression one.

π2{1，2，3，4，5}-＞{2，3，5，4，1}

...

πM{1，2，3，4，5}-＞{5，3，1，2，4}

π1(S1)＝{3，2，1}；π2(S1)＝{2，3，5}；πM(S1)＝{5，3，1}；

π1(S2)＝{3，2，5}；π2(S2)＝{2，3，4}；πM(S2)＝{5，3，2}；

π 1 is that { 1,2,3,4,5}-＞{ 3,2,1,5,4} then is exactly

1-＞3,2-＞2,3-＞1,4-＞5,5-＞4, then π 1 (S1)=π 1 (1,2,3})=3,2,1}

Analogizing of other.

min(π(S1))＝Sample_Finger_List(S1)＝{1，2，1}

Min (π (S1)) is illustrated in each set π 1 (S1), π 2 (S1) ..., get a minimum value among the π M (S1) respectively, the set of forming by these minimum value.

min(π(S2))＝Sample_Finger_List(S2)＝{2，2，2}

They have identical fingerprint { 2} then are classified as a class then.Because they have { this element of 2}.

The concrete steps of step 4 are:

According to cluster result, the webpage ID in each class is taken out, establishing such has n webpage, gathers for the fingerprint of all webpages to be

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n} & 1 \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}];

Definition

Defined function

v (x) = \{\begin{matrix} 1, & x &GreaterEqual; 0 \\ 0, & x < 0 \end{matrix};

Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprints of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V, for negative element is classified as 0;

V_{1} = [v (\begin{matrix} {finger}_{1,1} (1) \\ + \\ {finger}_{1,2} (1) \\ + \\ . . . \\ + \\ {finger}_{1, M} (1) \end{matrix}), v (\begin{matrix} finger \end{matrix}) (\begin{matrix} _{1,1} (2) \\ + \\ {finger}_{1,2} (2) \\ + \\ . . . \\ + \\ {finger}_{1, M} (2) \end{matrix}), . . ., v (\begin{matrix} finger \end{matrix}) (\begin{matrix} _{1,1} (128) \\ + \\ {finger}_{1,2} (128) \\ + \\ . . . \\ + \\ {finger}_{1, M} (128) \end{matrix})]

V_{2} = [v (\begin{matrix} {finger}_{2, 1} (1) \\ + \\ {finger}_{2, 2} (1) \\ + \\ . . . \\ + \\ {finger}_{2, M} (1) \end{matrix}), v (\begin{matrix} {finger}_{2, 1} (2) \\ + \\ {finger}_{2, 2} (2) \\ + \\ . . . \\ + \\ {finger}_{2, M} (2) \end{matrix}), . . ., v (\begin{matrix} {finger}_{2, 1} (128) \\ + \\ {finger}_{2, 2} (128) \\ + \\ . . . \\ + \\ {finger}_{2, M} (128) \end{matrix})];

...

V_{n} = [v (\begin{matrix} {finger}_{n, 1} (1) \\ + \\ {finger}_{n, 2} (1) \\ + \\ . . . \\ + \\ {finger}_{n, M} (1) \end{matrix}), v (\begin{matrix} {finger}_{n, 1} (2) \\ + \\ {finger}_{n, 2} (2) \\ + \\ . . . \\ + \\ {finger}_{n, M} (2) \end{matrix}), . . ., v (\begin{matrix} {finger}_{n, 1} (128) \\ + \\ {finger}_{n, 2} (128) \\ + \\ . . . \\ + \\ {finger}_{n, M} (128) \end{matrix})]

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n} & 1 \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}] = [V_{1}, V_{2} . . . V_{n}];

Vector V _iBe used to represent the fingerprint of n document, i=1 wherein, 2 ..., n; V _iBe 128 long number, this long numerical table is levied the fingerprint of this webpage;

The fingerprint comparison idiographic flow is as follows:

Step a: use x, y, z represent three 128 xmm registers (general-purpose register) respectively, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B, and a, b are loaded into register x, y respectively;

Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value Mask, and 1 number is exactly the hamming distance of a and b in the XOR value;

Step c: use _ mm_cvtsd_f64 obtains low 64 Mask_low of Mask, use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high of Mask;

Steps d: use _ popcnt64 calculates among Mask_low and the Mask_high 1 number respectively, and the addition assignment is given count;

Step e: think webpage A and webpage B dissmilarity when Count＞Dx, otherwise similar, Dx is first distance threshold;

Method 2:

Consider the fingerprint set:

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n 1} \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}],

For two webpages: webpage i and webpage j, their fingerprint set is:

Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum, when distance sum during, think webpage i and webpage j dissmilarity greater than second distance threshold value Dy, on the contrary similar.

General Study and experimental result think if fingerprint first apart from dx greater than＞6, then be repeated pages not.

Dy gets 0.8M.

Beneficial effect:

The present invention is directed to removing duplicate webpages algorithm speed slow, search shortcomings such as DeGrain, recall ratio are low and improve, removing duplicate webpages parallel algorithm based on SIMD optimization has been proposed, with algorithm application in the removing duplicate webpages system development, can when not losing any assurance precision ratio and recall ratio, improve the speed that the webpage similarity detects effectively.

In order to achieve the above object, analyzing execution time in each stage of removing duplicate webpages system.Subprocess such as the removing duplicate webpages algorithm mainly comprises feature extraction, takes the fingerprint, fingerprint comparison have higher inherent SIMD concurrency.Through Intel (R) VTune (TM) Performance Analyzer[4] statistical study, the fingerprint comparison subprocess has taken total system 58.6% time loss, and the fingerprint comparison process is optimized, the travelling speed of raising system that can be bigger.

According to above-mentioned analysis, we have adopted, and comparison is optimized to shingle based on the webpage similarity algorithm of SIMD.By data are carried out the SIMD parallelization by SSE data type tissue and to main subprocess, designed the removing duplicate webpages algorithm of optimizing based on SSE, this algorithm overcomes the characteristics that general removing duplicate webpages algorithm speed is slow, recall ratio is low, search DeGrain, have that travelling speed is fast, effect obviously, be easy to characteristics such as realization.

Because SSE used 128 storage unit, this can leave 4 for 32 floating number, that is to say, all among the SSE are calculated and all disposablely finished at 4 floating numbers, and this batch processing has brought the lifting of efficient.In addition, the SSE4.2 instruction set has added STTNI (the character string text newly instructs) and ATA (application oriented accelerator) two big optimizations are instructed, STTNI is primarily aimed at that XML carries out document and data processing is optimized, and makes this application performance on the one hand reach 3.8 times of previous generation product.The feature extraction that the removing duplicate webpages algorithm mainly comprises, take the fingerprint, subprocess such as fingerprint contrast handles the processes such as processing that XML carries out document and data processing, character string that relate generally to.By using the SSE technology, can obviously improve the speed of removing duplicate webpages theoretically.

Description of drawings

Fig. 1 is a main flow chart of the present invention;

Fig. 2 info web extracts process flow diagram;

The process flow diagram that Fig. 3 Shingle extracts;

Fig. 4 cluster process flow diagram;

Fig. 5 fingerprint comparison method one process flow diagram;

Fig. 6 fingerprint comparison method two process flow diagram.

Embodiment

Below with reference to the drawings and specific embodiments the present invention is described in further details:

Embodiment 1:

The removing duplicate webpages algorithm key step based on SIMD of this example is as follows:

1, web page text information extraction.This process mainly is that the webpage effective information is extracted;

2, Shingle extracts.This process mainly is to extract web page characteristics;

3, cluster.This process mainly is to reduce the comparison number of times, reduces time and space complexity;

4, fingerprint comparison.This process mainly is to seek out similar web page, with its rejecting.

Each step division is as follows:

1 web page text information extraction

The text structure of webpage mainly comprises physical arrangement and logical organization.Physical arrangement is meant the composition situation of webpage, mainly comprises information such as webpage label, web page title, web page contents, article title, advertisement; Logical organization is meant mainly in the web page contents that structure, text between the paragraph organize the logical course of style and expression.

This process mainly is by common webpage formats such as HTML, XHTML, XML are carried out information processing, this process mainly comprises noises such as the advertisement bar of rejecting in the webpage, navigation bar, site marker, picture, and key messages such as the title in the webpage, text, author are extracted.This process mainly is by HTML, XHTML, XML are scanned, and utilizes the label information of webpage to extract the title of text, filters out the information with text-independent simultaneously.

At present, domestic and international parsing to HTML, XHTML, XML mainly is by the DOM syntax tree web page text data to be resolved.SSE4.2 has increased 4 character strings and the peculiar instruction of character newly, can be used for quickening many character strings and the application program relevant with text, for example: the pattern match of string, character string comparison or the like.This step mainly is the extraction algorithm that quickens existing main flow info web by the newly-increased peculiar instruction of character string of adopting SSE 4.2, accelerates to extract the speed of info web.

It is as follows that info web extracts flow process:

101 definition common HTML, XHTML, XML labels are as＜body 〉,＜/body,＜title,＜/title,＜div etc. common text message label, utilize among the SSE 4.2 _ the mm_set_epi8 instruction is pre-loaded onto the binary code of these labels in 128 bit variables of SSE 4.2._ mm_set_epi8 instruction major function is that 16 bytes or 128 binary codes are reprinted in 128 bit variables of SSE 4.2.

Function among the 102 use SSE 4.2 mates the label of HTML, XHTML, XML.This process mainly is by adopting SSE 4.2 processor computation capability, improving the speed to information extraction.By _ mm_cmpestri, _ mm_cvtsi128_si32, _ the peculiar instruction of SSE4.2 such as mm_cmpestrm compares the label (sse128 bit variable) of info web and preloaded.Certain the label success of then explanation coupling, write down the position of this label, repeating this process then is SSE 4.2 distinctive instructions to finding all html label _ mm_cmpestri only, its major function is that 128 variable is compared, if two character strings are identical, return 0 so, certain the label success of then explanation coupling, write down the position of this label, repeat this process then up to finding all html labels._ mm_cvtsi128_si32 instruction is to be used for 32 int types that change into 32 behind 128 the variable.

103 information translation and filtration.This process mainly is the conversion to the title that extracts, text message, some irrelevant informations in the text, irrelevant mark, hyperlinked information are filtered, and carry out the conversion of some necessity, keep structured messages such as necessary placeholder, text structure information, paragraph information.Filter mainly by constantly reading info web, utilize regular expression to mate, utilize the method for study simultaneously, optimize the regular expression content, the information irrelevant information is filtered.

2 Shingle extract

Web page characteristics for webpage go heavily have great importance, it is to influence total system to go the heavy accuracy rate and the key factor of recall rate., therefore, only need cut apart web page text just according to text and can take its feature owing to be to open between the English alphabet for English webpage with space-separated.For Chinese web page, owing to do not have natural separator between them, so the text feature (word, speech, phrase, sentence) that obtains what type is particularly important in characteristic extraction procedure.

Web page text information is regarded the sequence of word (tokens) composition of one piece of text as, and Shingle is meant a subsequence in the sequence of terms.Sequence by shingles characterizes webpage.(D is the set of the continuation character string of ω for window among the D (document document) ω) for a given collections of web pages S.

Sentence for example: { Shingle is widely used in extensive similar text detection technology },

With the sequence that obtains word behind the sentence forward maximum match participle be: { " Shingle ", " extensively ", " application ", " on a large scale ", " similar ", " text ", " detection ", " technology " }

Window ω is that 2 S (D, 2) set is: { " Shingle is extensive ", " widespread use ", " using extensive ", " similar on a large scale ", " similar text ", " text detection ", " detection technique " }

When if window ω is 4 (usually being made as 4 in the real system), define the S set that the character string of two documents and sign is formed ₁(D, ω), S ₂(D, ω), a=|S ₁I S ₂| expression S ₁, S ₂Public identical sequence, we represent S with R so ₁, S ₂Likelihood:

R = \frac{| S_{1} \cap S_{2} |}{| S_{1} \cup S_{2} |} = \frac{a}{f_{1} + f_{2} - a},

f ₁＝|S ₁|，f ₂＝|S ₂|，a＝|S ₁∩S ₂|

The concrete meaning of R is: the shingle number that two documents are identical and the ratio of the shingle number that they are total, when R approaches 1 more, these two pieces of documents are similar more so.

The web page text information of extracting in the 201 pairs of first segments is carried out forward maximum match word segmentation processing, generates the set of word (tokens).The forward maximum match divides the main flow process of word algorithm as follows: suppose that the Chinese character number is MAX in the long word bar in the automatic word segmentation dictionary, then get preceding MAX word in the pending text as matching field, search the participle dictionary, if in the dictionary such MAX words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find such MAX words in the dictionary, then it fails to match, and the last character of this matching field is removed, and repeats above process, till the match is successful.So just finish once coupling, promptly matched a speech.And then go on by top step, till all speech in being syncopated as text.

The word frequency of the word of 202 statistical systems, the function word that filters out the low justice of high frequency make up the initial vocabulary of stopping using as stop words.Discern the neologisms that repeatedly occur by The result of statistics, and neologisms are joined in the dictionary database.(dictionary database is the data source of forward maximum matching algorithm, has only by continuing to optimize dictionary database, and the algorithm participle just can be more accurate)

The 203 inactive vocabularys that used for second step made up filter out insignificant noise in the webpage.By the window size of setting, generate the Shingles set.(from precision ratio and recall ratio, window size is the smaller the better; From display effect, window size is big more, and the effect of demonstration is good more.Window is got 2-4 and is advisable generally speaking)

3 clusters

Cluster is the process that concrete or abstract set grouping is become a plurality of classes of being made up of similar object, its main foundation is that similar or close object is classified as a class, the object that difference is big is in the different classes, what generated like this bunch is the set of one group of data object, these objects are similar each other to other object in same bunch, and different each other with the object in other bunch.

In the Shingle of webpage calculation process, for the webpage collection that a quantity is N, each webpage comprises n word, and the length of Shingle is m, and then Shingle is less than n-m+1 (having rejected the part stop words).Therefore the time complexity that calculates the Shingle algorithm of whole webpages is O (N*n).But the space complexity that characterizes the feature of a webpage then is O (n*m), and then all the space complexity of webpage is O (N*n*m), and can produce a large amount of interim results in computation process, has taken a large amount of storage spaces equally.For example: for a webpage collection that 100,000 webpages are arranged, each piece webpage includes 100 words, and adopting window size is that the time complexity of 2 Shingle algorithm is O (10 ⁸), space complexity is O (2*10 ⁸), obviously, adopt the Shingle algorithm that webpage is directly gone heavily to handle and not only take the plenty of time, also take a large amount of spaces.

Therefore, the web page text ensemble of communication is divided into the similar collections of web pages of possibility, excluding does not need the webpage compared fully, only carries out subsequent operation when two pieces of webpages have similar possibility again, can reduce the number of times that every piece of document need be compared significantly like this.

According to the minwise hashing principle of Ander broder, definition Hash function π has

\Pr (\min (π (S_{1})) = \min (π (S_{2})) = \frac{| S_{1} I S_{2} |}{| S_{1} U S_{2} |} = R

The practical significance of above-mentioned formula is: adopt different Hash functions to S ₁, S ₂The fingerprint set of generation is shone upon in the Shingle set, and the probability that the minimum value of fingerprint equates in the set equals S ₁, S ₂Likelihood.That is to say that for webpage A and B, the Shingle set that they generate is S ₁, S ₂,, use Hash function π, respectively to S ₁, S ₂Carry out the hash mapping, if the probability that the minimum value of their fingerprints equates is exactly these two pieces of webpage similarities.

Yet, because the cause of Hash only uses a fingerprint to tend to miss a lot of documents that may be similar, therefore should use a plurality of fingerprints, each fingerprint is found and may similar document merge, can reach accuracy rate and recall rate preferably like this.

Definition R ₀Be similarity threshold, definition accuracy rate and recall rate:

Based on above-mentioned narration, the present invention designs clustering method, and webpage is generated fingerprint, and webpage is carried out clustering processing, and concrete steps are as follows:

301 for the Shingle set that generates in the step 2, and establishing set sizes is L, therefrom selects 1 Shingle as its Sample_Shingle_List that samples every L/n Shingle; If on the reasonable machine of performance, we can get n=1, under the poor environment of machines configurations, can suitably improve n, and n gets more for a short time in a word, and the effect of removing duplicate webpages is good more.

302 couples of Sample_Shingle_List use M different Hash function, Sample_Shingle_List is used the displacement Hash function of the individual different independent random of M (7-10), the Hash function that adopts can be one 128 fingerprint S set ample_Finger_List with the Feature Conversion of all shingle among the Sample_Shingle_List, from each Sample_Finger_List, the fingerprint of middle selection minimum is as the fingerprint of this webpage; For the displacement hash function of M independent random, so just the Sample_Shingle_List of any one document d (document) set is converted to Sample_Finger_List Sample_Finger_List=(min{ π ₁(S _d), min{ π ₂(S _d) ..., min{ π _M(S _d))

Give an example

Ω＝{1，2，3，4，5}，S1＝{1，2，3}，S2＝{1，2，4}

π1{1，2，3，4，5}-＞{3，2，1，5，4}

π2{1，2，3，4，5}-＞{2，3，5，4，1}

...

πM{1，2，3，4，5}-＞{5，3，1，2，4}

π1(S1)＝{3，2，1}；π2(S1)＝{2，3，5}；πM(S1)＝{5，3，1}；

π1(S2)＝{3，2，5}；π2(S2)＝{2，3，4}；πM(S2)＝{5，3，2}；

Set?of?min(π(S1))＝Sample_Finger_List(S1)＝{1，2，1}

Set?of?min(π(S2))＝Sample_Finger_List(S2)＝{2，2，2}

They have identical fingerprint elements { 2} then are classified as a class then

For n webpage, have

The fingerprint of 303 pairs of webpage generations carries out cluster, as long as identical fingerprint is arranged, just it being incorporated into is a class, and concrete grammar is as follows:

4 fingerprint comparisons

By to the collections of web pages after the cluster in 3, adopt one of following two kinds of methods to handle to each set:

Method 1:

According to cluster result, the webpage ID in each class is taken out, establishing such has k webpage, for the fingerprint set of each webpage ID

[\begin{matrix} {finger}_{11} & {finger}_{12} & . . . & {finger}_{1 M} \\ {finger}_{21} & {finger}_{22} & . . . & {finger}_{2 M} \\ . . . & . . . & . . . & . . . \\ {finger}_{k 1} & {finger}_{k 2} & . . . & {finger}_{kM} \end{matrix}];

Fingerprint matrices is carried out DUAL PROBLEMS OF VECTOR MAPPING, make in its vector space that is mapped to a m dimension; Consider the fingerprint of each row, at first generate the vectorial V of one 128 dimension, each all is initialized as 0, considers each bit in each fingerprint of this row, if this position is 1, then the corresponding positions of vectorial V+1 is 0 as if this position, then the corresponding positions of vectorial V-1; After stack, for being classified as 1 for positive element among the vectorial V,, negative element should be converted to 128 long array by vector V for being classified as 0., just can obtain the fingerprint of this row.Is [V with fingerprint as the vector space that m ties up ₁V ₂... V _n] ^T

The fingerprint comparison idiographic flow is as follows:

410 load fingerprint: use x, y, z represent three 128 xmm registers, use a to represent 128 fingerprints of webpage A, use b to represent 128 fingerprints of webpage B.A, b are loaded into register x, y.

410 calculated fingerprint distance: use _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculate a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.

410 obtain high 64 fingerprints and obtain low 64 fingerprints: use _ _ mm_cvtsd_f64 obtain low 64 Mask_low and use mm_unpacklo_epi64, _ mm_cvtsd_f64 obtains high 64 Mask_high.

The different fingerprint sum of 410 statistics: uses _ popcnt64 calculates Mask_low respectively, 1 the number of Mask_high, and the addition assignment is to count.

410 similar web pages are judged: when Count＞Dx distance threshold, think dissimilar, on the contrary similar.

Use the fingerprint comparison of SIMD, to two pieces of webpages only need use _ mm xor_si128, _ mm_unpacklo_epi64, _ mm_cvtsd_f64, _ _ the popcnt64 four instructions just can finish.Finish the comparison of two pieces of webpages, only called 6 instructions altogether and just finished.And adopt general processing, and for the webpage of 128 fingerprints, then need to compare 128 times, be higher than far away and use 6 instructions.Therefore, for the data of magnanimity, use the SIMD technology can reduce the number of times that uses CPU, significantly the performance of elevator system.

Method 2:

Consider the fingerprint set:

[\begin{matrix} {finger}_{11} & {finger}_{12} & . . . & {finger}_{1 M} \\ {finger}_{21} & {finger}_{22} & . . . & {finger}_{2 M} \\ . . . & . . . & . . . & . . . \\ {finger}_{k 1} & {finger}_{k 2} & . . . & {finger}_{kM} \end{matrix}],

Also can adopt following fingerprint comparison method two to handle:

For two webpage i, j, their fingerprint set is:

Adopt each row to carry out fingerprint comparison, calculate then between the fingerprint apart from sum.

410 initialization, definition K=0, Cnt=0;

410 load fingerprint: have M fingerprint for each webpage, use x, y, z represent three 128 xmm registers, and respectively with i, the k row fingerprint of j is loaded into register x, y respectively.

410 calculate every pair of fingerprint distance: use _ _ m128i_mm_xor_si128 (_ m128ia, _ m128i b) calculates a, b XOR value (Mask), and 1 number of XOR value is exactly hamming (Haming) distance of a and b.

410 calculate total distance: K++ (expression K adds 1), and Cnt adds the fingerprint distance when the prostatitis, if K＞M then forwards 409 to, otherwise forwards 407 to

410 similar web pages are judged: when Count＞Dy distance threshold, think dissimilar, on the contrary similar.

Claims

1. a removing duplicate webpages parallel method of optimizing based on SIMD is characterized in that, may further comprise the steps:

2. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 1 are:

3. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 2 are:

4. the removing duplicate webpages parallel method of optimizing based on SIMD according to claim 1 is characterized in that the concrete steps of step 3 are:

5. according to each described removing duplicate webpages parallel method of optimizing based on SIMD of claim 1-4, it is characterized in that the concrete steps of step 4 are:

Adopt any method in following two kinds of methods:

Method 1:

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n 1} \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}];

Definition

Defined function

v (x) = \{\begin{matrix} 1, & x &GreaterEqual; 0 \\ 0, & x < 0 \end{matrix};

V_{1} = [v (\begin{matrix} {finger}_{1,1} (1) \\ + \\ {finger}_{1,2} (1) \\ + \\ . . . \\ + \\ {finger}_{1, M} (1) \end{matrix}), v (\begin{matrix} {finger}_{1,1} (2) \\ + \\ {finger}_{1,2} (2) \\ + \\ . . . \\ + \\ {finger}_{1, M} (2) \end{matrix}), . . ., v (\begin{matrix} {finger}_{1,1} (128) \\ + \\ {finger}_{1,2} (128) \\ + \\ . . . \\ + \\ {finger}_{1, M} (128) \end{matrix})]

V_{2} = [v (\begin{matrix} {finger}_{2, 1} (1) \\ + \\ {finger}_{2, 2} (1) \\ + \\ . . . \\ + \\ {finger}_{2, M} (1) \end{matrix}), v (\begin{matrix} {finger}_{2, 1} (2) \\ + \\ {finger}_{2, 2} (2) \\ + \\ . . . \\ + \\ {finger}_{2, M} (2) \end{matrix}), . . ., v (\begin{matrix} {finger}_{2, 1} (128) \\ + \\ {finger}_{2, 2} (128) \\ + \\ . . . \\ + \\ {finger}_{2, M} (128) \end{matrix})];

...

V_{n} = [v (\begin{matrix} {finger}_{n, 1} (1) \\ + \\ {finger}_{n, 2} (1) \\ + \\ . . . \\ + \\ {finger}_{n, M} (1) \end{matrix}), v (\begin{matrix} {finger}_{n, 1} (2) \\ + \\ {finger}_{n, 2} (2) \\ + \\ . . . \\ + \\ {finger}_{n, M} (2) \end{matrix}), . . ., v (\begin{matrix} {finger}_{n, 1} (128) \\ + \\ {finger}_{n, 2} (128) \\ + \\ . . . \\ + \\ {finger}_{n, M} (128) \end{matrix})]

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n 1} \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}] = [V_{1}, V_{2} . . . V_{n}];

The fingerprint comparison idiographic flow is as follows:

Step b: use _ m128i_mm_xor_si128 (_ m128ia, _ m128ib) calculate a, b XOR value Mask, 1 number is exactly the hamming distance of a and b in the XOR value;

Method 2:

Consider the fingerprint set:

{Matrix}_{finger} = [\begin{matrix} {finger}_{11} & {finger}_{21} & . . . & {finger}_{n 1} \\ {finger}_{12} & {finger}_{22} & . . . & {finger}_{n 2} \\ . . . & . . . . & . . . & . . . \\ {finger}_{1 M} & {finger}_{2 M} & . . . & {finger}_{nM} \end{matrix}],

For two webpages: webpage i and webpage j, their fingerprint set is: