CN103176962B

CN103176962B - The statistical method of text similarity and system

Info

Publication number: CN103176962B
Application number: CN201310074669.0A
Authority: CN
Inventors: 朱定局
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: China Southern Power Grid Internet Service Co ltd; Ourchem Information Consulting Co ltd
Priority date: 2013-03-08
Filing date: 2013-03-08
Publication date: 2015-11-04
Anticipated expiration: 2033-03-08
Also published as: CN103176962A

Abstract

The invention discloses a kind of statistical method of text similarity, comprising: obtain the first and second texts needing to differentiate similarity; Divide yardstick with first and first and second texts are divided into some text fragments respectively, under calculating first division yardstick, in first and second text, identical text fragments quantity accounts for the ratio of the text fragments sum of the first text; In the first and second texts, delete identical text fragments, obtain the first residue text and the second residue text respectively; Divide yardstick with second and first and second residue texts are divided into some text fragments respectively, under calculating second division yardstick, in first and second residue text, identical text fragments quantity accounts for the ratio of the text fragments sum of the first residue text; Calculate the comprehensive similarity of the first text and the second text.The present invention can comparatively accurately reflect by people be upset words sentence order text between similarity degree, by by deliberately upset word order, sentence sequence, Duan Xu Similar Text detect.

Description

The statistical method of text similarity and system

Technical field

The present invention relates to text-processing, particularly relate to a kind of statistical method of text similarity, also relate to a kind of statistical system of text similarity.

Background technology

Judge the similarity of two texts in prior art, be generally by two texts are carried out participle, then judge the words sentence string repeated in two texts in order.

If but the order of words sentence has deliberately been upset in text, even if be so in fact that between similar (such as plagiarizing) text, the similarity obtained according to existing similarity statistical is lower, cannot reflect the similarity degree of itself.

Summary of the invention

Based on this, in order to solve traditional text similarity statistical method be difficult to accurately reflection by people be upset words sentence order text between the problem of similarity degree, be necessary to provide a kind of can comparatively accurately reflect by people be upset words sentence order text between the statistical method of text similarity of similarity degree.

A statistical method for text similarity, comprising: obtain the first text and the second text that need to differentiate similarity; Divide yardstick with first and described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; In the first text and the second text, delete identical text fragments, obtain the first residue text and the second residue text respectively; Divide yardstick with second and first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; X1 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, to calculate the comprehensive similarity of the first text and the second text.

Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.

Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some sentences respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.

Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some sentences respectively; The statistical method of described text similarity also comprises deletes identical sentence in the first residue text and the second residue text, obtain text T5 and text T6 respectively, text T5 and text T6 is divided into some words respectively, compared by whole word in words whole in text T5 and text T6, calculating text T5 and identical word in text T6 account for the step of the ratio z1 of word sum in text T5; The step of the comprehensive similarity of described calculating first text and the second text is calculated by following formula: comprehensive similarity M1=x1*c1+ (1-x1*c1) [y1*c2+ (1-y1*c2) z1]; Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.

Wherein in an embodiment, also comprise and judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge the step that described first text is similar to the second text.

Wherein in an embodiment, also comprise the following steps: that calculating first divides the ratio x2 that the text fragments quantity identical with the second text of the first text under yardstick accounts for the text fragments sum of the second text; Under calculating the second division yardstick, the first residue text and second remains the ratio y2 that text fragments quantity identical in text accounts for the text fragments sum of the second residue text; X2 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y2 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the second text and the first text; Judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, whether the comprehensive similarity of described second text and the first text is greater than described similarity threshold, if the two has any one to be greater than described similarity threshold, then judge that described first text is similar to the second text.

The present invention is the corresponding statistical system providing a kind of text similarity also.

7, a statistical system for text similarity, comprising: read module, for obtaining the first text and the second text that need to differentiate similarity; First segmentation comparison module, for dividing yardstick with first, described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; First removing module, for deleting identical text fragments in the first text and the second text, obtains the first residue text and the second residue text respectively; Segmentation comparison module, for dividing yardstick with second, first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; Comprehensive similarity computing module, for x1 being multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the first text and the second text.

Wherein in an embodiment, also comprising judge module, for judging whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judging that described first text is similar to the second text.

The statistical method of above-mentioned text similarity and system, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the statistical method of embodiment one Chinese version similarity;

Fig. 2 is the process flow diagram of the statistical method of embodiment two Chinese version similarity;

Fig. 3 is the process flow diagram of the statistical method of embodiment three Chinese version similarity.

Embodiment

For enabling object of the present invention, feature and advantage more become apparent, and are described in detail the specific embodiment of the present invention below in conjunction with accompanying drawing.

Embodiment one:

Fig. 1 is the process flow diagram of the statistical method of an embodiment Chinese version similarity, comprises the following steps:

S110, obtains the text T1 and text T2 that need to differentiate similarity.

S120, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.

In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and the quantity of identical paragragh is designated as k3.

S130, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.

The identical each paragragh drawn more afterwards by step S120 is deleted from text T1 and text T2, obtains text T3 and text T4 respectively.Identical paragragh is there is not in the text T3 obtained after deletion with between text T4.

S140, is divided into some sentences respectively by text T3 and text T4, is compared by sentences whole in sentences whole in text T3 and text T4, the quantity of identical sentence is designated as k6.

In the present embodiment, the sentence quantity of text T3 is designated as k4, the sentence quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th that compares text T3 identical with the jth sentence of text T4, and the quantity of identical sentence is designated as k6.

S150, in text T3 and text T4, delete identical sentence, text T3 obtains text T5 after deletion, and text T4 obtains text T6 after deletion.

The identical each sentence drawn more afterwards by step S140 is deleted from text T3 and text T4, obtains text T5 and text T6 respectively.Identical sentence is there is not between the text T5 obtained after deletion and text T6.

S160, is divided into some words respectively by text T5 and text T6, is compared by words whole in words whole in text T5 and text T6, the quantity of identical word is designated as k9.

Be divided into word can adopt the algorithm of prior art.In the present embodiment, the sentence quantity of text T5 is designated as k7, the sentence quantity of text T6 is designated as k8.I is from 1 to k7, j from 1 to k8, and whether i-th word comparing text T5 be identical with a jth word of text T6, and the quantity of identical word is designated as k9.

S170, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.

The comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:

M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]

The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:

M2=k3/k2*c1+(1-k3/k2*c1)*[k6/k5*c2+(1-k6/k5*c2)*k9/k8]

Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.Suitable empirical value can be got and (but need c1>0 be ensured, 1-k3/k1*c1>0,1-k3/k2*c1>0, c2>0,1-k6/k4*c2>0,1-k6/k5*c2>0), the proportion that different demarcation yardstick is shared in comprehensive similarity is adjusted.

Wherein in an embodiment, c1=c2=1, then the comprehensive similarity of text T1 and text T2 is:

M1=k3/k1+(1-k3/k1)*[k6/k4+(1-k6/k4)*k9/k7]

The comprehensive similarity of text T2 and text T1 is:

M2=k3/k2+(1-k3/k2)*[k6/k5+(1-k6/k5)*k9/k8]

The comprehensive similarity of text T1 and text T2 needs not be equal to the comprehensive similarity of text T2 and text T1.Such as, text T1 is the half of text T2, then text T1 can find completely from text T2, and text T2 only has half can finding from text T1, in this case, the comprehensive similarity of obvious text T1 and text T2 is greater than the comprehensive similarity of text T2 and text T1.

In another embodiment, calculate M1, M2 and can adopt different weights, that is:

M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]

M2=k3/k2*c3+(1-k3/k2*c3)*[k6/k5*c4+(1-k6/k5*c4)*k9/k8]

Wherein c1, c2, c3, c4 are weights, suitable empirical value can be got, and c1>0, c2>0,1-k3/k1*c1>0,1-k6/k4*c2>0, c3>0, c4>0,1-k3/k2*c3>0,1-k6/k5*c4>0.

The statistical method of above-mentioned text similarity, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.

In the present embodiment, also step is comprised after step S170:

Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1, c2.

In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.In two texts, such as assert that text T1 is the situation having plagiarism suspicion.

In other embodiments, differentiate that two text segmentation of similarity become the division yardstick adopted during some text fragments by needing, also embodiment one can be different from, such as directly from paragragh to word, or directly from sentence to word, or to adopt except paragragh, sentence, word other division yardstick.Below two corresponding embodiments are provided again respectively:

Embodiment two:

S210, obtains the text T1 and text T2 that need to differentiate similarity.

S220, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.

In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and identical paragragh quantity is designated as k3.

S230, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.

S240, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.

In the present embodiment, the word quantity of text T3 is designated as k4, the word quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th word comparing text T3 be identical with a jth word of text T4, and identical word quantity is designated as k6.

S250, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.

In the present embodiment, the comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:

M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4

M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5

Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.

In the present embodiment, also step is comprised after step S250:

Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1.

In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.

Embodiment three:

S310, obtains the text T1 and text T2 that need to differentiate similarity.

S320, is divided into some sentences respectively by text T1 and text T2, is compared by sentences whole in sentences whole in text T1 and text T2, the quantity of identical sentence is designated as k3.

In the present embodiment, the sentence quantity of text T1 is designated as k1, the sentence quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th that compares text T1 identical with the jth sentence of text T2, and identical sentence quantity is designated as k3.

S330, in text T1 and text T2, delete identical sentence, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.

S340, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.

S350, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.

M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4

M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5

Wherein c1 is the weight of sentence yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.

In the present embodiment, also step is comprised after step S350:

The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. a statistical method for text similarity, comprising:

Obtain the first text and the second text that need to differentiate similarity;

Divide yardstick with first and described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text;

In the first text and the second text, delete identical text fragments, obtain the first residue text and the second residue text respectively;

Divide yardstick with second and first residue text and the second residue text are divided into some text fragments respectively, whole text fragments and second in first residue text under second division yardstick is remained text fragments whole in text compare, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick;

X1 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, to calculate the comprehensive similarity of the first text and the second text.

2. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.

3. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some sentences respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.

4. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some sentences respectively;

The statistical method of described text similarity also comprises deletes identical sentence in the first residue text and the second residue text, obtain text T5 and text T6 respectively, text T5 and text T6 is divided into some words respectively, compared by whole word in words whole in text T5 and text T6, calculating text T5 and identical word in text T6 account for the step of the ratio z1 of word sum in text T5;

The step of the comprehensive similarity of described calculating first text and the second text is calculated by following formula: comprehensive similarity M1=x1*c1+ (1-x1*c1) [y1*c2+ (1-y1*c2) z1]; Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.

5. according to the statistical method of the text similarity in claim 1-4 described in any one, it is characterized in that, also comprise and judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge the step that described first text is similar to the second text.

6. according to the statistical method of the text similarity in claim 1-3 described in any one, it is characterized in that, also comprise the following steps:

The text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x2 of the text fragments sum of the second text;

Under calculating the second division yardstick, the first residue text and second remains the ratio y2 that text fragments quantity identical in text accounts for the text fragments sum of the second residue text;

X2 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y2 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the second text and the first text;

Judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, whether the comprehensive similarity of described second text and the first text is greater than described similarity threshold, if the two has any one to be greater than described similarity threshold, then judge that described first text is similar to the second text.

7. a statistical system for text similarity, is characterized in that, comprising:

Read module, for obtaining the first text and the second text that need to differentiate similarity;

First segmentation comparison module, for dividing yardstick with first, described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text;

First removing module, for deleting identical text fragments in the first text and the second text, obtains the first residue text and the second residue text respectively;

Segmentation comparison module, for dividing yardstick with second, first residue text and the second residue text are divided into some text fragments respectively, whole text fragments and second in first residue text under second division yardstick is remained text fragments whole in text compare, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick;

Comprehensive similarity computing module, for x1 being multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the first text and the second text.

8. according to the statistical system of the text similarity described in claim 7, it is characterized in that, also comprise judge module, for judging whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge that described first text is similar to the second text.