CN103176962B - The statistical method of text similarity and system - Google Patents

The statistical method of text similarity and system Download PDF

Info

Publication number
CN103176962B
CN103176962B CN201310074669.0A CN201310074669A CN103176962B CN 103176962 B CN103176962 B CN 103176962B CN 201310074669 A CN201310074669 A CN 201310074669A CN 103176962 B CN103176962 B CN 103176962B
Authority
CN
China
Prior art keywords
text
similarity
residue
yardstick
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310074669.0A
Other languages
Chinese (zh)
Other versions
CN103176962A (en
Inventor
朱定局
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Ourchem Information Consulting Co ltd
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310074669.0A priority Critical patent/CN103176962B/en
Publication of CN103176962A publication Critical patent/CN103176962A/en
Application granted granted Critical
Publication of CN103176962B publication Critical patent/CN103176962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of statistical method of text similarity, comprising: obtain the first and second texts needing to differentiate similarity; Divide yardstick with first and first and second texts are divided into some text fragments respectively, under calculating first division yardstick, in first and second text, identical text fragments quantity accounts for the ratio of the text fragments sum of the first text; In the first and second texts, delete identical text fragments, obtain the first residue text and the second residue text respectively; Divide yardstick with second and first and second residue texts are divided into some text fragments respectively, under calculating second division yardstick, in first and second residue text, identical text fragments quantity accounts for the ratio of the text fragments sum of the first residue text; Calculate the comprehensive similarity of the first text and the second text.The present invention can comparatively accurately reflect by people be upset words sentence order text between similarity degree, by by deliberately upset word order, sentence sequence, Duan Xu Similar Text detect.

Description

The statistical method of text similarity and system
Technical field
The present invention relates to text-processing, particularly relate to a kind of statistical method of text similarity, also relate to a kind of statistical system of text similarity.
Background technology
Judge the similarity of two texts in prior art, be generally by two texts are carried out participle, then judge the words sentence string repeated in two texts in order.
If but the order of words sentence has deliberately been upset in text, even if be so in fact that between similar (such as plagiarizing) text, the similarity obtained according to existing similarity statistical is lower, cannot reflect the similarity degree of itself.
Summary of the invention
Based on this, in order to solve traditional text similarity statistical method be difficult to accurately reflection by people be upset words sentence order text between the problem of similarity degree, be necessary to provide a kind of can comparatively accurately reflect by people be upset words sentence order text between the statistical method of text similarity of similarity degree.
A statistical method for text similarity, comprising: obtain the first text and the second text that need to differentiate similarity; Divide yardstick with first and described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; In the first text and the second text, delete identical text fragments, obtain the first residue text and the second residue text respectively; Divide yardstick with second and first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; X1 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, to calculate the comprehensive similarity of the first text and the second text.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some sentences respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
Wherein in an embodiment, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some sentences respectively; The statistical method of described text similarity also comprises deletes identical sentence in the first residue text and the second residue text, obtain text T5 and text T6 respectively, text T5 and text T6 is divided into some words respectively, compared by whole word in words whole in text T5 and text T6, calculating text T5 and identical word in text T6 account for the step of the ratio z1 of word sum in text T5; The step of the comprehensive similarity of described calculating first text and the second text is calculated by following formula: comprehensive similarity M1=x1*c1+ (1-x1*c1) [y1*c2+ (1-y1*c2) z1]; Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.
Wherein in an embodiment, also comprise and judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge the step that described first text is similar to the second text.
Wherein in an embodiment, also comprise the following steps: that calculating first divides the ratio x2 that the text fragments quantity identical with the second text of the first text under yardstick accounts for the text fragments sum of the second text; Under calculating the second division yardstick, the first residue text and second remains the ratio y2 that text fragments quantity identical in text accounts for the text fragments sum of the second residue text; X2 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y2 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the second text and the first text; Judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, whether the comprehensive similarity of described second text and the first text is greater than described similarity threshold, if the two has any one to be greater than described similarity threshold, then judge that described first text is similar to the second text.
The present invention is the corresponding statistical system providing a kind of text similarity also.
7, a statistical system for text similarity, comprising: read module, for obtaining the first text and the second text that need to differentiate similarity; First segmentation comparison module, for dividing yardstick with first, described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text; First removing module, for deleting identical text fragments in the first text and the second text, obtains the first residue text and the second residue text respectively; Segmentation comparison module, for dividing yardstick with second, first residue text and the second residue text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first residue text under second division yardstick and the second text, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick; Comprehensive similarity computing module, for x1 being multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the first text and the second text.
Wherein in an embodiment, also comprising judge module, for judging whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judging that described first text is similar to the second text.
The statistical method of above-mentioned text similarity and system, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the statistical method of embodiment one Chinese version similarity;
Fig. 2 is the process flow diagram of the statistical method of embodiment two Chinese version similarity;
Fig. 3 is the process flow diagram of the statistical method of embodiment three Chinese version similarity.
Embodiment
For enabling object of the present invention, feature and advantage more become apparent, and are described in detail the specific embodiment of the present invention below in conjunction with accompanying drawing.
Embodiment one:
Fig. 1 is the process flow diagram of the statistical method of an embodiment Chinese version similarity, comprises the following steps:
S110, obtains the text T1 and text T2 that need to differentiate similarity.
S120, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.
In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and the quantity of identical paragragh is designated as k3.
S130, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
The identical each paragragh drawn more afterwards by step S120 is deleted from text T1 and text T2, obtains text T3 and text T4 respectively.Identical paragragh is there is not in the text T3 obtained after deletion with between text T4.
S140, is divided into some sentences respectively by text T3 and text T4, is compared by sentences whole in sentences whole in text T3 and text T4, the quantity of identical sentence is designated as k6.
In the present embodiment, the sentence quantity of text T3 is designated as k4, the sentence quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th that compares text T3 identical with the jth sentence of text T4, and the quantity of identical sentence is designated as k6.
S150, in text T3 and text T4, delete identical sentence, text T3 obtains text T5 after deletion, and text T4 obtains text T6 after deletion.
The identical each sentence drawn more afterwards by step S140 is deleted from text T3 and text T4, obtains text T5 and text T6 respectively.Identical sentence is there is not between the text T5 obtained after deletion and text T6.
S160, is divided into some words respectively by text T5 and text T6, is compared by words whole in words whole in text T5 and text T6, the quantity of identical word is designated as k9.
Be divided into word can adopt the algorithm of prior art.In the present embodiment, the sentence quantity of text T5 is designated as k7, the sentence quantity of text T6 is designated as k8.I is from 1 to k7, j from 1 to k8, and whether i-th word comparing text T5 be identical with a jth word of text T6, and the quantity of identical word is designated as k9.
S170, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
The comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*[k6/k5*c2+(1-k6/k5*c2)*k9/k8]
Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.Suitable empirical value can be got and (but need c1>0 be ensured, 1-k3/k1*c1>0,1-k3/k2*c1>0, c2>0,1-k6/k4*c2>0,1-k6/k5*c2>0), the proportion that different demarcation yardstick is shared in comprehensive similarity is adjusted.
Wherein in an embodiment, c1=c2=1, then the comprehensive similarity of text T1 and text T2 is:
M1=k3/k1+(1-k3/k1)*[k6/k4+(1-k6/k4)*k9/k7]
The comprehensive similarity of text T2 and text T1 is:
M2=k3/k2+(1-k3/k2)*[k6/k5+(1-k6/k5)*k9/k8]
The comprehensive similarity of text T1 and text T2 needs not be equal to the comprehensive similarity of text T2 and text T1.Such as, text T1 is the half of text T2, then text T1 can find completely from text T2, and text T2 only has half can finding from text T1, in this case, the comprehensive similarity of obvious text T1 and text T2 is greater than the comprehensive similarity of text T2 and text T1.
In another embodiment, calculate M1, M2 and can adopt different weights, that is:
M1=k3/k1*c1+(1-k3/k1*c1)*[k6/k4*c2+(1-k6/k4*c2)*k9/k7]
M2=k3/k2*c3+(1-k3/k2*c3)*[k6/k5*c4+(1-k6/k5*c4)*k9/k8]
Wherein c1, c2, c3, c4 are weights, suitable empirical value can be got, and c1>0, c2>0,1-k3/k1*c1>0,1-k6/k4*c2>0, c3>0, c4>0,1-k3/k2*c3>0,1-k6/k5*c4>0.
The statistical method of above-mentioned text similarity, successively with the section of text, sentence, word for yardstick,-the comprehensive similarity of deletion afterwards between calculating text is split-compared to text, can comparatively accurately reflect by people be upset words sentence order text between similarity degree, make by deliberately upset word order, sentence sequence, Duan Xu Similar Text also can be detected.
In the present embodiment, also step is comprised after step S170:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1, c2.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.In two texts, such as assert that text T1 is the situation having plagiarism suspicion.
In other embodiments, differentiate that two text segmentation of similarity become the division yardstick adopted during some text fragments by needing, also embodiment one can be different from, such as directly from paragragh to word, or directly from sentence to word, or to adopt except paragragh, sentence, word other division yardstick.Below two corresponding embodiments are provided again respectively:
Embodiment two:
S210, obtains the text T1 and text T2 that need to differentiate similarity.
S220, is divided into some paragraghs respectively by text T1 and text T2, is compared by paragraghs whole in paragraghs whole in text T1 and text T2, the quantity of identical paragragh is designated as k3.
In the present embodiment, the paragragh quantity of text T1 is designated as k1, the paragragh quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th section that compares text T1 identical with the jth section of text T2, and identical paragragh quantity is designated as k3.
S230, in text T1 and text T2, delete identical paragragh, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
S240, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.
In the present embodiment, the word quantity of text T3 is designated as k4, the word quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th word comparing text T3 be identical with a jth word of text T4, and identical word quantity is designated as k6.
S250, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
In the present embodiment, the comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5
Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.
In the present embodiment, also step is comprised after step S250:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.
Embodiment three:
S310, obtains the text T1 and text T2 that need to differentiate similarity.
S320, is divided into some sentences respectively by text T1 and text T2, is compared by sentences whole in sentences whole in text T1 and text T2, the quantity of identical sentence is designated as k3.
In the present embodiment, the sentence quantity of text T1 is designated as k1, the sentence quantity of text T2 is designated as k2.I is from 1 to k1, j from 1 to k2, and whether i-th that compares text T1 identical with the jth sentence of text T2, and identical sentence quantity is designated as k3.
S330, in text T1 and text T2, delete identical sentence, text T1 obtains text T3 after deletion, and text T2 obtains text T4 after deletion.
S340, is divided into some words respectively by text T3 and text T4, is compared by words whole in words whole in text T3 and text T4, the quantity of identical word is designated as k6.
In the present embodiment, the word quantity of text T3 is designated as k4, the word quantity of text T4 is designated as k5.I is from 1 to k4, j from 1 to k5, and whether i-th word comparing text T3 be identical with a jth word of text T4, and identical word quantity is designated as k6.
S350, calculates the comprehensive similarity of text T1 and text T2, calculates the comprehensive similarity of text T2 and text T1.
In the present embodiment, the comprehensive similarity M1 of text T1 and text T2 is calculated by following formula:
M1=k3/k1*c1+(1-k3/k1*c1)*k6/k4
The comprehensive similarity M2 of text T2 and text T1 is calculated by following formula:
M2=k3/k2*c1+(1-k3/k2*c1)*k6/k5
Wherein c1 is the weight of sentence yardstick in comprehensive similarity, can get suitable empirical value, but need ensure c1>0,1-k3/k1*c1>0,1-k3/k2*c1>0.
In the present embodiment, also step is comprised after step S350:
Judge whether the comprehensive similarity of text T1 and text T2 is greater than similarity threshold θ, and whether the comprehensive similarity of text T2 and text T1 is greater than similarity threshold θ, if the two has any one to be greater than similarity threshold θ, then judges that text T1 is similar to text T2.Similarity threshold θ can be an empirical value, and its value is relevant with c1.
In other embodiments, also only can calculate a comprehensive similarity (such as the comprehensive similarity of text T1 and text T2), and only judge whether this comprehensive similarity is greater than similarity threshold θ.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (8)

1. a statistical method for text similarity, comprising:
Obtain the first text and the second text that need to differentiate similarity;
Divide yardstick with first and described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text;
In the first text and the second text, delete identical text fragments, obtain the first residue text and the second residue text respectively;
Divide yardstick with second and first residue text and the second residue text are divided into some text fragments respectively, whole text fragments and second in first residue text under second division yardstick is remained text fragments whole in text compare, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick;
X1 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, to calculate the comprehensive similarity of the first text and the second text.
2. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
3. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some sentences respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some words respectively.
4. the statistical method of text similarity according to claim 1, it is characterized in that, the described step with the first division yardstick, described first text and the second text being divided into respectively some text fragments is that described first text and the second text are divided into some paragraghs respectively; The described step with the second division yardstick, the first residue text and the second residue text being divided into respectively some text fragments is that described first residue text and the second residue text are divided into some sentences respectively;
The statistical method of described text similarity also comprises deletes identical sentence in the first residue text and the second residue text, obtain text T5 and text T6 respectively, text T5 and text T6 is divided into some words respectively, compared by whole word in words whole in text T5 and text T6, calculating text T5 and identical word in text T6 account for the step of the ratio z1 of word sum in text T5;
The step of the comprehensive similarity of described calculating first text and the second text is calculated by following formula: comprehensive similarity M1=x1*c1+ (1-x1*c1) [y1*c2+ (1-y1*c2) z1]; Wherein c1 is the weight of paragragh yardstick in comprehensive similarity, and c2 is the weight of sentence yardstick in comprehensive similarity.
5. according to the statistical method of the text similarity in claim 1-4 described in any one, it is characterized in that, also comprise and judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge the step that described first text is similar to the second text.
6. according to the statistical method of the text similarity in claim 1-3 described in any one, it is characterized in that, also comprise the following steps:
The text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x2 of the text fragments sum of the second text;
Under calculating the second division yardstick, the first residue text and second remains the ratio y2 that text fragments quantity identical in text accounts for the text fragments sum of the second residue text;
X2 is multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y2 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the second text and the first text;
Judge whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, whether the comprehensive similarity of described second text and the first text is greater than described similarity threshold, if the two has any one to be greater than described similarity threshold, then judge that described first text is similar to the second text.
7. a statistical system for text similarity, is characterized in that, comprising:
Read module, for obtaining the first text and the second text that need to differentiate similarity;
First segmentation comparison module, for dividing yardstick with first, described first text and the second text are divided into some text fragments respectively, compared by text fragments whole in whole text fragments in first text under first division yardstick and the second text, the text fragments quantity that under calculating first division yardstick, the first text is identical with the second text accounts for the ratio x1 of the text fragments sum of the first text;
First removing module, for deleting identical text fragments in the first text and the second text, obtains the first residue text and the second residue text respectively;
Segmentation comparison module, for dividing yardstick with second, first residue text and the second residue text are divided into some text fragments respectively, whole text fragments and second in first residue text under second division yardstick is remained text fragments whole in text compare, under calculating the second division yardstick, the first residue text and second remains the ratio y1 that text fragments quantity identical in text accounts for the text fragments sum of the first residue text; It is little that described second division scale ratio first divides yardstick;
Comprehensive similarity computing module, for x1 being multiplied by the weight of the first division yardstick in comprehensive similarity, obtain the similarity of the first division yardstick, one deduct the similarity of the first division yardstick after be multiplied by y1 again, then add the similarity of the first division yardstick, calculate the comprehensive similarity of the first text and the second text.
8. according to the statistical system of the text similarity described in claim 7, it is characterized in that, also comprise judge module, for judging whether the comprehensive similarity of described first text and the second text is greater than similarity threshold, if so, then judge that described first text is similar to the second text.
CN201310074669.0A 2013-03-08 2013-03-08 The statistical method of text similarity and system Active CN103176962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310074669.0A CN103176962B (en) 2013-03-08 2013-03-08 The statistical method of text similarity and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310074669.0A CN103176962B (en) 2013-03-08 2013-03-08 The statistical method of text similarity and system

Publications (2)

Publication Number Publication Date
CN103176962A CN103176962A (en) 2013-06-26
CN103176962B true CN103176962B (en) 2015-11-04

Family

ID=48636848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310074669.0A Active CN103176962B (en) 2013-03-08 2013-03-08 The statistical method of text similarity and system

Country Status (1)

Country Link
CN (1) CN103176962B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133842A (en) * 2014-06-24 2014-11-05 国家电网公司 Data processing method and data processing system with intelligent expert detection function
CN104133840A (en) * 2014-06-24 2014-11-05 国家电网公司 Data processing method and data processing system with system detection and biological recognition functions
CN104156386A (en) * 2014-06-24 2014-11-19 国家电网公司 Data processing method and system with image recognition function
CN104133838A (en) * 2014-06-24 2014-11-05 国家电网公司 Data processing method and system with system detection function
CN104133839A (en) * 2014-06-24 2014-11-05 国家电网公司 Data processing method and system with intelligent detection function
CN106202055A (en) * 2016-07-27 2016-12-07 湖南蚁坊软件有限公司 A kind of similarity determination method for long text
CN107329947B (en) * 2017-05-15 2019-07-26 中国移动通信集团湖北有限公司 The determination method, device and equipment of Similar Text
CN108304378B (en) * 2018-01-12 2019-09-24 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium
CN108363767A (en) * 2018-02-07 2018-08-03 深圳中兴网信科技有限公司 File input method, device, computer equipment and readable storage medium storing program for executing
CN109615001B (en) * 2018-12-05 2020-03-10 上海恺英网络科技有限公司 Method and device for identifying similar articles
CN112528630B (en) * 2019-09-19 2024-09-20 北京国双科技有限公司 Text similarity determination method and device, storage medium and electronic equipment
CN111914532B (en) * 2020-09-14 2024-05-03 北京阅神智能科技有限公司 Chinese composition scoring method
CN115209188B (en) * 2022-09-07 2023-01-20 北京达佳互联信息技术有限公司 Detection method, device, server and storage medium for simultaneous live broadcast of multiple accounts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN102214232A (en) * 2011-06-28 2011-10-12 东软集团股份有限公司 Method and device for calculating similarity of text data
CN102314418A (en) * 2011-10-09 2012-01-11 北京航空航天大学 Method for comparing Chinese similarity based on context relation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN102214232A (en) * 2011-06-28 2011-10-12 东软集团股份有限公司 Method and device for calculating similarity of text data
CN102314418A (en) * 2011-10-09 2012-01-11 北京航空航天大学 Method for comparing Chinese similarity based on context relation

Also Published As

Publication number Publication date
CN103176962A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103176962B (en) The statistical method of text similarity and system
US11023534B2 (en) Classification method and a classification device for service data
CN106528532B (en) Text error correction method, device and terminal
CN106453437B (en) equipment identification code acquisition method and device
CN106407484B (en) Video tag extraction method based on barrage semantic association
CN103177099B (en) Video comparison method and video comparison system
US9361343B2 (en) Method for parallel mining of temporal relations in large event file
CN105654201B (en) Advertisement traffic prediction method and device
WO2012099801A4 (en) Ordering document content
WO2011152925A3 (en) Detection of junk in search result ranking
CN105095222B (en) Uniterm replacement method, searching method and device
US9183598B2 (en) Identifying event-specific social discussion threads
CN107944760B (en) Enterprise bid competitiveness analysis method and system
Tu et al. Density-based hierarchical clustering for streaming data
CN105701450B (en) The recognition methods of K line morphology and device
JP2018509664A (en) Model generation method, word weighting method, apparatus, device, and computer storage medium
MX2022005322A (en) Page simulation system.
CN104216925A (en) Repetition deleting processing method for video content
US9754023B2 (en) Stochastic document clustering using rare features
CN112989235B (en) Knowledge base-based inner link construction method, device, equipment and storage medium
CN102799676A (en) Recursive and multilevel Chinese word segmentation method
MX2014013314A (en) Entity resolution from documents.
CN111368867A (en) Archive classification method and system and computer readable storage medium
US20120328167A1 (en) Merging face clusters
CN106204140A (en) A kind of colony based on KL distance viewpoint migrates detection method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230103

Address after: Room 301, No. 235, Kexue Avenue, Huangpu District, Guangzhou, Guangdong 510000

Patentee after: OURCHEM INFORMATION CONSULTING CO.,LTD.

Address before: 1068 No. 518055 Guangdong city in Shenzhen Province, Nanshan District City Xili University School Avenue

Patentee before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY

Effective date of registration: 20230103

Address after: 510000 room 606-609, compound office complex building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou City, Guangdong Province (not for plant use)

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 301, No. 235, Kexue Avenue, Huangpu District, Guangzhou, Guangdong 510000

Patentee before: OURCHEM INFORMATION CONSULTING CO.,LTD.

TR01 Transfer of patent right