Summary of the invention
Based on this, in order to solve traditional text similarity statistical method be difficult to accurately reflection by people be upset words sentence order text between the problem of similarity degree, be necessary to provide a kind of can comparatively accurately reflect by people be upset words sentence order text between the statistical method of text similarity of similarity degree.
A statistical method for text similarity, comprising: obtain the first text and the second text that need to differentiate similarity; Divide yardstick with first and described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; In the first text and the second text, delete identical text fragments, obtain the first residue text and the second residue text respectively; Divide yardstick with second and first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; X1 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, to calculate the comprehensive similarity of the first text and the second text.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some sentences respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some sentences respectively; The statistical method of described text similarity also comprises deletes identical sentence in the first residue text and the second residue text, obtain text T5 and text T6 respectively, text T5 and text T6 is divided into some words respectively, compared by whole word in words whole in text T5 and text T6, calculating text T5 and identical word in text T6 account for the step of the ratio z1 of word sum in text T5; The step of the comprehensive similarity of described calculating first text and the second text is calculated by following formula: comprehensive similarity M1=x1*c1+ (1-x1*c1) [y1*c2+ (1-y1*c2) z1]; Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.
Wherein in an embodiment, also comprise and judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge the step that described first text is similar to the second text.
Wherein in an embodiment, also comprise the following steps: that calculating first divides the ratio x2 that the text fragments quantity identical with the second text of the first text under yardstick accounts for the text fragments sum of the second text; Under calculating the second division yardstick, the first residue text and second remains the ratio y2 that text fragments quantity identical in text accounts for the text fragments sum of the second residue text; X2 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y2 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the second text and the first text; Judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, whether the comprehensive similarity of described second text and the first text is greater than described similarity threshold, if the two has any one to be greater than described similarity threshold, then judge that described first text is similar to the second text.
The present invention is the corresponding statistical system providing a kind of text similarity also.
7, a statistical system for text similarity, comprising: read module, for obtaining the first text and the second text that need to differentiate similarity; First segmentation comparison module, for dividing yardstick with first, described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; First removing module, for deleting identical text fragments in the first text and the second text, obtains the first residue text and the second residue text respectively; Segmentation comparison module, for dividing yardstick with second, first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; Comprehensive similarity computing module, for x1 being multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the first text and the second text.
Wherein in an embodiment, also comprising judge module, for judging whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judging that described first text is similar to the second text.
The statistical method of above-mentioned text similarity and system, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.
Embodiment
For enabling object of the present invention, feature and advantage more become apparent, and are described in detail the specific embodiment of the present invention below in conjunction with accompanying drawing.
Embodiment one:
Fig. 1 is the process flow diagram of the statistical method of an embodiment Chinese version similarity, comprises the following steps:
S110, obtains the text T1 and text T2 that need to differentiate similarity.
S120, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.
In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and the quantity of identical paragragh is designated as k3.
S130, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
The identical each paragragh drawn more afterwards by step S120 is deleted from text T1 and text T2, obtains text T3 and text T4 respectively.Identical paragragh is there is not in the text T3 obtained after deletion with between text T4.
S140, is divided into some sentences respectively by text T3 and text T4, is compared by sentences whole in sentences whole in text T3 and text T4, the quantity of identical sentence is designated as k6.
In the present embodiment, the sentence quantity of text T3 is designated as k4, the sentence quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th that compares text T3 identical with the jth sentence of text T4, and the quantity of identical sentence is designated as k6.
S150, in text T3 and text T4, delete identical sentence, text T3 obtains text T5 after deletion, and text T4 obtains text T6 after deletion.
The identical each sentence drawn more afterwards by step S140 is deleted from text T3 and text T4, obtains text T5 and text T6 respectively.Identical sentence is there is not between the text T5 obtained after deletion and text T6.
S160, is divided into some words respectively by text T5 and text T6, is compared by words whole in words whole in text T5 and text T6, the quantity of identical word is designated as k9.
Be divided into word can adopt the algorithm of prior art.In the present embodiment, the sentence quantity of text T5 is designated as k7, the sentence quantity of text T6 is designated as k8.I is from 1 to k7, j from 1 to k8, and whether i-th word comparing text T5 be identical with a jth word of text T6, and the quantity of identical word is designated as k9.
S170, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
The comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*[k6/k5*c2+(1-k6/k5*c2)*k9/k8]
Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.Suitable empirical value can be got and (but need c1>0 be ensured, 1-k3/k1*c1>0,1-k3/k2*c1>0, c2>0,1-k6/k4*c2>0,1-k6/k5*c2>0), the proportion that different demarcation yardstick is shared in comprehensive similarity is adjusted.
Wherein in an embodiment, c1=c2=1, then the comprehensive similarity of text T1 and text T2 is:
M1=k3/k1+(1-k3/k1)*[k6/k4+(1-k6/k4)*k9/k7]
The comprehensive similarity of text T2 and text T1 is:
M2=k3/k2+(1-k3/k2)*[k6/k5+(1-k6/k5)*k9/k8]
The comprehensive similarity of text T1 and text T2 needs not be equal to the comprehensive similarity of text T2 and text T1.Such as, text T1 is the half of text T2, then text T1 can find completely from text T2, and text T2 only has half can finding from text T1, in this case, the comprehensive similarity of obvious text T1 and text T2 is greater than the comprehensive similarity of text T2 and text T1.
In another embodiment, calculate M1, M2 and can adopt different weights, that is:
M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]
M2=k3/k2*c3+(1-k3/k2*c3)*[k6/k5*c4+(1-k6/k5*c4)*k9/k8]
Wherein c1, c2, c3, c4 are weights, suitable empirical value can be got, and c1>0, c2>0,1-k3/k1*c1>0,1-k6/k4*c2>0, c3>0, c4>0,1-k3/k2*c3>0,1-k6/k5*c4>0.
The statistical method of above-mentioned text similarity, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.
In the present embodiment, also step is comprised after step S170:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1, c2.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.In two texts, such as assert that text T1 is the situation having plagiarism suspicion.
In other embodiments, differentiate that two text segmentation of similarity become the division yardstick adopted during some text fragments by needing, also embodiment one can be different from, such as directly from paragragh to word, or directly from sentence to word, or to adopt except paragragh, sentence, word other division yardstick.Below two corresponding embodiments are provided again respectively:
Embodiment two:
S210, obtains the text T1 and text T2 that need to differentiate similarity.
S220, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.
In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and identical paragragh quantity is designated as k3.
S230, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
S240, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.
In the present embodiment, the word quantity of text T3 is designated as k4, the word quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th word comparing text T3 be identical with a jth word of text T4, and identical word quantity is designated as k6.
S250, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
In the present embodiment, the comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5
Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.
In the present embodiment, also step is comprised after step S250:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.
Embodiment three:
S310, obtains the text T1 and text T2 that need to differentiate similarity.
S320, is divided into some sentences respectively by text T1 and text T2, is compared by sentences whole in sentences whole in text T1 and text T2, the quantity of identical sentence is designated as k3.
In the present embodiment, the sentence quantity of text T1 is designated as k1, the sentence quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th that compares text T1 identical with the jth sentence of text T2, and identical sentence quantity is designated as k3.
S330, in text T1 and text T2, delete identical sentence, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
S340, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.
In the present embodiment, the word quantity of text T3 is designated as k4, the word quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th word comparing text T3 be identical with a jth word of text T4, and identical word quantity is designated as k6.
S350, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
In the present embodiment, the comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5
Wherein c1 is the weight of sentence yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.
In the present embodiment, also step is comprised after step S350:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.