CN102682104A

CN102682104A - Method for searching similar texts and link bit similarity measuring algorithm

Info

Publication number: CN102682104A
Application number: CN2012101353393A
Authority: CN
Inventors: 龙军; 袁鑫攀; 罗跃逸
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2012-05-04
Filing date: 2012-05-04
Publication date: 2012-09-19

Abstract

The invention discloses a method for searching similar texts, which comprises the following steps: step 1, text feature extraction: the step is used for extracting a text characteristic set Sshgs; step 2, link bit fingerprint generation: the step is used for generating a link bit fingerprint from the Sshgs, and the link bit fingerprint is recorded as Sdn; step 3, link bit similarity measurement: the step is used for comparing the similarity of the link bit fingerprints of two texts; and step 4, according to the result of the link bit similarity, a needed text is obtained. The invention also correspondingly discloses a link bit similarity measuring algorithm, and according to the experimental data, the facts that under the condition of slightly sacrificing precision, the comparison times are exponentially reduced in the algorithm and the performance of the algorithm is improved are proved.

Description

An a kind of method and a connection position similarity measurement algorithm of searching similar text

Technical field

The present invention relates to information retrieval field, relate in particular to a kind of method of estimating similarity, the measuring similarity that can be applicable between the magnanimity document is estimated, is specially adapted to search fast in the magnanimity information similar text message.

Technical background

The fast development of Internet technology makes that the data message on the network presents exponential growth, how in the information of magnanimity, to search effective information fast, and it is more and more important to become.Text similarity this notion of tolerance and correlation technique are also arisen at the historic moment.A good text similarity measure has great importance in research fields such as automatically request-answering system, intelligent retrieval, removing duplicate webpages, natural language processings.

Text similarity is meant a metric parameter of the matching degree between two or more texts, and similarity is high more, representes that the similarity degree between two texts is big more, otherwise low more.To be vector space model (VSM) wait to look into the frequency vector inner product that document and a certain piece of writing of data centralization document have weight through calculating to the traditional text method for measuring similarity, obtains the similarity of two pieces of documents.Algorithm need be stored shortcomings such as number of characteristics vocabulary, comparison speed is slow, accuracy rate is low, can't be applied to measuring similarity in the mass data.Based on minwise similarity measurement algorithm through the similarity problem being converted into the probability of happening problem of an incident; This method is mapped to the text feature lexical set in the hash value set; Character string comparison problem is converted into numeric ratio to problem; Be applicable to the mass data measuring similarity, but algorithm need be compared a large amount of fingerprints, take a large amount of storage spaces.2010; People such as Ping Li improve on the basis of minwise similarity measurement algorithm; Proposed b position minwise similarity measurement algorithm, this algorithm is estimated the similarity of two documents through using still less b position, but algorithm still need be compared a large amount of fingerprints.

Summary of the invention

The present invention proposes a kind of new method of searching similar text, to overcome aforementioned all deficiencies of the prior art.

According to the method for the invention, may further comprise the steps:

Step 1, the text feature extraction step: this step is used to extract the text feature S set _Shgs

Step 2, connect the position fingerprint and generate step: this step is used for S _ShgsGenerate a connection position fingerprint, be designated as S _Dn

Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison;

Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.

The present invention also provides a kind of connection position similarity

algorithm; It is characterized in that comprising aforesaid step 1; Step 2, step 3.

Description of drawings

Fig. 1 is a main schematic flow sheet according to the method for the invention

Fig. 2 is the graph of a relation that connects a similarity and variance according to an embodiment of the invention

Fig. 3 connects the accuracy rate of position and the experimental result of calling rate according to embodiments of the invention in the XX data centralization

Fig. 4 be according to embodiments of the invention the XX data set actual efficiency comparison figure

Embodiment

Below with reference to accompanying drawing method provided by the invention is described in detail, and will will carry out bright specifically to the advantage of the method according to this invention in conjunction with instance and experimental data.Experiment shows that method of the present invention is being sacrificed under the situation of very little precision slightly, but can reduce the number of times of comparison exponentially, promotes and searches performance.

The method of searching similar text that the present invention proposes specifically comprises the steps:

Preferably, in step 1, specifically comprise:

At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set Sshgs that branch set of words behind the text noise data is document.Noise is insignificant word in the text, generally is the low adopted auxiliary word of high frequency, function word etc.;

Preferably, in step 2, specifically comprise:

1) forms the minwise fingerprint

File characteristics S set to the step 1 generation _ShgsAdopt the Rabin function, shine upon 32 integer, mapping back set called after S _dSuppose complete or collected works Ω=0,1 ..., D-1}, a ₀a ₁... a _D-1An arrangement on the Hang Seng Index Ω, vector (a ₀, a ₁..., a _D-1) represent the displacement of Ω:

π = (\begin{matrix} 0 & 1 & . . . & D - 1 \\ a_{0} & a_{1} & . . . & a_{D - 1} \end{matrix})

If for data set X ∈ Ω and x ∈ X, exist one to arrange π, make

\Pr (\min {π (X)} = π (x)) = \frac{1}{| X |}

Then π is a minwise arrangement at random.In other words, any element x among the data set X is in the minimum value that all has under the displacement π after identical probability is this displacement.Like this, the permutation group π through k independent random ₁, π ₂..., π _k, just S set _dConvert the minwise characteristic fingerprint into:

{\overset{&OverBar;}{S}}_{d} = (Min {π_{1} (S_{d})}, Min {π_{2} (S_{d})}, . . ., Min {π_{k} (S_{d})}) .

2) form b position minwise fingerprint

Defined function: B (x, b)=x&&2 ^B-1, B (x, b) for getting the b bit function, the b in the function for the figure place that will get.

gets the b position for each element in

, forms b position minwise characteristic fingerprint:

B (\overset{&OverBar;}{S_{d}}, b) = (B (\min {π_{1} (S_{d})}, b), B (\min {π_{2} (S_{d})}, b), . . ., B (\min {π_{k} (S_{d})}, b)) .

3) form a connection position fingerprint

Right

Connect n b position fingerprint, obtain connecting position characteristic fingerprint S _Dn

Below specify the process of step 2 with instance 1, specialize be instance among the application only as the effect of example description, do not constitute qualification of the present invention.

Instance 1 fingerprint forms: suppose complete or collected works Ω=0,1,2,3,4,5,6,7}, S ₁=1,2,4}, S ₂={ 1,4,3,6} gets k=6, random alignment π ₁, π ₂, π ₃, π ₄, π ₅, π ₆For:

π_{1} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 2 & 3 & 0 & 4 & 6 & 7 & 1 & 5 \end{matrix})

π_{2} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 1 & 6 & 5 & 7 & 2 & 0 & 4 & 3 \end{matrix})

π_{3} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 5 & 1 & 7 & 2 & 6 & 3 & 4 & 0 \end{matrix})

π_{4} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 7 & 1 & 5 & 4 & 3 & 2 & 6 & 0 \end{matrix})

π_{5} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 3 & 7 & 6 & 0 & 4 & 5 & 1 & 2 \end{matrix})

π_{6} = (\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 4 & 1 & 5 & 0 & 3 & 6 & 7 & 2 \end{matrix})

1) forms the minwise fingerprint

Pass through π ₁, π ₂, π ₃, π ₄, π ₅, π ₆To S ₁After the mapping be:

π ₁(S ₁)＝{3，0，6}，π ₂(S ₁)＝{6，5，2}，π ₃(S ₁)＝{1，7，6}，π ₄(S ₁)＝{1，5，3}，π ₅(S ₁)＝{7，6，4}，π ₆(S ₁)＝{1，5，3}；

The minwise fingerprint of document 1 is:

{\overset{&OverBar;}{S}}_{1} = (Min {π_{1} (S_{1})}, Min {π_{2} (S_{1})}, . . ., Min {π_{6} (S_{1})}) = (0,2,1,1,4,1)

Pass through π ₁, π ₂, π ₃, π ₄, π ₅, π ₆To S ₂After the mapping be:

π ₁(S ₂)＝{3，6，4，1}，π ₂(S ₂)＝{6，2，7，4}，π ₃(S ₂)＝{1，6，2，4}，π ₄(S ₂)＝{1，3，4，6}，π ₅(S ₂)＝{7，4，0，1}，π ₆(S ₂)＝{1，3，0，7}；

The minwise fingerprint of document 2 is:

{\overset{&OverBar;}{S}}_{2} = (Min {π_{1} (S_{2})}, Min {π_{2} (S_{2})}, . . ., Min {π_{6} (S_{2})}) = (1, 2,1,1, 0, 0)

Therefore, S ₁And S ₂At π ₁, π ₂, π ₃, π ₄, π ₅, π ₆The minwise set that generates behind the random permutation is respectively With

2) form b position minwise fingerprint

After got the b=1 position, the b position minwise fingerprint of trying to achieve:

B (\overset{&OverBar;}{S_{1}}, b) = (B (\min {π_{1} (S_{1})}, b), B (\min {π_{2} (S_{1})}, b), . . ., B (\min {π_{6} (S_{1})}, b)) = (0,0,1,1,0,1)

After

got the b=1 position, the b position minwise fingerprint of trying to achieve:

B (\overset{&OverBar;}{S_{2}}, b) = (B (\min {π_{1} (S_{2})}, b), B (\min {π_{2} (S_{2})}, b), . . ., B (\min {π_{6} (S_{2})}, b)) = (1,0,1,1,0,0)

3) form a connection position fingerprint

Right Connect n=2 b position fingerprint: S _1n={ 0-0,1-1,0-1}={00,11,01}

Right Connect n=2 b position fingerprint: S _2n={ 1-0,1-1,0-0}={10,11,00}

Preferably, step 3 specifically comprises:

1) the minwise similarity is estimated

In minwise similarity measurement algorithm, the nothing of the likelihood R of two documents estimates that partially is:

{\hat{R}}_{M} = \frac{1}{k} Σ_{j = 1}^{k} 1 {\min (π_{j} (S_{1})) = \min (π_{j} (S_{2}))} .

2) minwise similarity in b position is estimated

Definition z ₁, z ₂Be that a random permutation crowd π acts on S set ₁And S ₂On minimum value:

z ₁＝min{π(S ₁)}，z ₂＝min{π(S ₂)}

e _{1, i}Be z ₁Minimum i position, e _{2, i}Be z ₂Minimum i position.In b position minwise similarity was estimated, the nothing of the similarity of two documents was estimated partially:

{\hat{R}}_{b} = \frac{{\hat{E}}_{b} - C_{1, b}}{1 - C_{2, b}}

Wherein

{\hat{E}}_{b} = \frac{1}{k} Σ_{j = 1}^{k} (Π_{i = 1}^{b} 1 {e_{1, i, π_{j}} = e_{2, i π_{j}}} = 1)

C_{1, b} = A_{1, b} \frac{r_{2}}{r_{1} + r_{2}} + A_{2, b} \frac{r_{1}}{r_{1} + r_{2}}

C_{2, b} = A_{1, b} \frac{r_{1}}{r_{1} + r_{2}} + A_{2, b} \frac{r_{2}}{r_{1} + r_{2}}

A_{1, b} = \frac{r_{1} {[1 - r_{1}]}^{2^{b} - 1}}{1 - {[1 - r_{1}]}^{2^{b}}}

A_{2, b} = \frac{r_{2} {[1 - r_{2}]}^{2^{b} - 1}}{1 - {[1 - r_{2}]}^{2^{b}}}

r_{1} = \frac{f_{1}}{D},

r_{2} = \frac{f_{2}}{D}, f_{1} = | S_{1} |, f_{2} = | S_{2} |

3) connecting position minwise similarity estimates

Definition

Be illustrated in π _jEffect is z down ₁(z ₂) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position ₁,

x_{1} = e_{1,1, π_{1}} e_{1,2, π_{1} . . .} e_{1, b, π_{1}} e_{1,1, π_{2}} e_{1,2, π_{2} . . .} e_{1, b, π_{2}} . . . e_{1,1, π_{c}} e_{1,2 π_{c} . . .} e_{1, b, π_{n}},

x_{2} = e_{2,1, π_{1}} e_{2,2, π_{1} . . .} e_{2, b, π_{1}} e_{2,1, π_{2}} e_{2,2, π_{2} . . .} e_{2, b, π_{2}} . . . e_{2,1, π_{c}} e_{2,2, π_{c} . . .} e_{2, b, π_{n}}

Have only and work as

e_{1, i, π_{j}} = e_{2, i, π_{j}} (i &Element; [1, b], j &Element; [1, n])

The time, x ₁=x ₂

Set symbol G _{B, n}Expression x ₁=x ₂Probability, wherein b is a figure place, n representes linking number, then can get:

G _b，n＝E _b ⁿ，

Being estimated as of document 1, document 2 similarities then:

{\hat{R}}_{b, n} = \frac{{\hat{G}}_{b, n}^{\frac{1}{n}} - C_{1, b}}{1 - C_{2, b}}

Wherein

Below specify the implementation procedure of step 3 with instance 2.

Instance 2 similarities are estimated:

1) the minwise similarity is estimated

The likelihood minwise similarity of S1 and S2 is confirmed as

2) minwise similarity in b position is estimated

Here get b=1, then f ₁=3, f ₂=4,

4 _{1, b}=0.385, A _{2, b}=0.333, C _{1, b}=0.367, C _{2, b}=0.353,

{\hat{E}}_{b} = \frac{1}{k} Σ_{j = 1}^{k} (Π_{i = 1}^{b} 1 {e_{1, i, π_{j}} = e_{2, i, π_{j}}} = 1) = \frac{4}{6} = 0.667,

Then

{\hat{R}}_{b} = \frac{{\hat{E}}_{b} - C_{1, b}}{1 - C_{2, b}} = 0.4721 .

3) connecting position fingerprint similarity estimates

If b=1, n=2, then

{\hat{R}}_{b, n} = \frac{\sqrt{{\hat{G}}_{b, n} - C_{1, b}}}{1 - C_{2, b}} = 0.3330 .

4) Jie Kade (Jacard) similarity

Why estimated value is not equal to actual value is because k is too little equally; Shown in Fig. 2 variance curve distributes; When k was very little, variance can be very big, when k is big more; Estimated value

likewise also can will be more and more approaching with actual value R, and valuation is just accurate more.

The present invention has advantage compared with prior art: can promote b position minwise similarity measurement algorithm exponentially with respect to existing, the present invention can reduction at double compare number of times, has obtained the lifting at double of performance.Below prove this advantage from 3 aspects:

1) variance analysis

The present invention has obtained promoting at double of performance and has had very strong practical application meaning through the minimum trueness error of loss.As shown in Figure 2, when k=1000, for given four kinds selected r ₁=r ₂(from 10 ^-10To 0.9), b=1, b=2, n=2, R _1,2And R _2,2The relation of similarity (R)-variance (Var).Connect position R _2,2The variance of variance ratio b=2 want big, precision descends to some extent, but because connected 22, so need the number of times of comparison to reduce half.In the similarity of mass data detected, removing duplicate webpages for example usually had and more than one hundred millionly need carry out the estimation of similarity to webpage, through losing minimum trueness error, has obtained promoting at double of performance and has had very strong practical application meaning.

2) accuracy rate and recall rate analysis

Fig. 3 has shown that connection position similarity measurement algorithm is at similarity R>=R ₀Accuracy rate and the experimental result of recall rate.Recall rate curve among Fig. 3 is almost as broad as long, and accuracy rate but has certain difference, analyzes the experimental result of accuracy rate through following two aspects.At first, work as R ₀=0.5, accuracy rate is 0.8 o'clock, estimator

Required k=100,500,700,300,450.With estimator

for example; If estimator

will reach identical accuracy rate; Connect the required sample number 700 in position

greater than the required sample number 500 in b position; But because valuation is carried out in 2 of connections; The number of times of

comparison only needs 700/2=350 time, and the number of times of

comparison needs 500 times.But undeniablely be;

required sample number lacks 200, and then the space of

storage will be lacked than

.Secondly, work as R ₀=0.5, during k=600, estimator

Accuracy rate be respectively 0.9,0.88,0.84,0.86,0.79.Still for example with estimator

; If estimator

is when identical sample number k=600; The accuracy rate of is 0.88; The rate of accuracy reached to 0.86 of

; This shows that connection position

accuracy rate is slightly poorer than

, but gap is very little.And the number of times of

comparison only needs 600/2=300 time, and the number of times of

comparison needs 600 times.And because identical sample number k=600, the space of storage is the same.

Can reach a conclusion from the analysis of accuracy rate and recall rate: when k is big; The of the present invention connection under the quite approaching situation of position minwise similarity measurement algorithm and b position minwise similarity measurement algorithm accuracy rate; Use a connection position similarity measurement algorithm to estimate that similarity can reduce the number of times of comparison, and obtain the lifting of efficient.And under the less situation of k, then connect position similarity measurement algorithm and b position minwise similarity measurement algorithm efficient and space are had his own strong points, can accept or reject according to system requirements.

3) efficiency analysis

Select 10000 documents to carrying out the time-consuming measurement of cpu at random, as shown in Figure 4, wherein test selected k=600.It is minimum required working time that Fig. 4 shows

; This because as long as the comparison k/2=300 time 1; And

to compare k=600 time 1;

comparison k/2 time 2, and

to compare k time 2.Experimental result has shown that the required cpu of connection position similarity measurement algorithm is still less consuming time, approaches the half the of b position minwise similarity measurement algorithm.Therefore, algorithm described in the present invention can promote the performance of b position minwise similarity measurement algorithm exponentially.

Claims

1. method of searching similar text is characterized in that may further comprise the steps:

2. method of searching similar text according to claim 1 is characterized in that step 1 specifically comprises:

At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set S that branch set of words behind the text noise data is document _Shgs

3. according to the described method of searching similar text of claim 1-2, it is characterized in that the concrete steps of step 2 comprise:

At first, form the minwise fingerprint; Then, form b position minwise fingerprint; Form at last and connect the position fingerprint.

4. according to the described connection of claim 1-3 position similarity measurement algorithm, it is characterized in that the concrete steps of step 3 comprise:

Definition z ₁, z ₂It is the minwise fingerprint S set that a random permutation crowd π acts on document 1, document 2 ₁And S ₂On minimum value:

z ₁＝min{π(S ₁)}，z ₂＝min{π(S ₂)}，

Definition

Be illustrated in π _jEffect is z down ₁(z ₁) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position ₁, x ₂

Have only and work as

The time, x ₁=x ₂

G _b，n＝E _b ⁿ，

Being estimated as of document 1, document 2 similarities then:

Wherein

。

5. one kind connects position similarity measurement algorithm, it is characterized in that comprising:

Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison.