CN102081598B

CN102081598B - Method for detecting duplicated texts

Info

Publication number: CN102081598B
Application number: CN2011100294938A
Authority: CN
Inventors: 李蕾; 聂洋; 赵青
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2011-01-27
Filing date: 2011-01-27
Publication date: 2012-07-04
Anticipated expiration: 2031-01-27
Also published as: CN102081598A

Abstract

The invention discloses a method for detecting duplicated texts, which comprises the following steps of: obtaining weights of words according to term frequency (TF) values of each word in a text and the occurrence of the word in a title, and sequentially extracting a plurality of words with the highest weights from the text to form a keyword set frame; for any two texts of which the keyword set frames are obtained, sequentially judging whether each word in the keyword set frame of one text is in the keyword set frame of the other text or not, adding 1 to a matching value representative of a matching degree when the word belongs to the keyword set frames of the two text and the weight values of the word in the two texts are matched until the last word in the keyword set frame of one text is detected, and obtaining the similarity of the two texts according to the obtained matching values; and judging whether the two texts are duplicated texts or not according to the similarity and a similarity threshold. The method for detecting the duplicated texts can effectively detect the texts with duplicated information and improve the efficiency of searching for effective information from a plurality of texts.

Description

A kind of method that detects the text repetition

Technical field

The present invention relates to the text-processing technical field, particularly a kind of method that detects the text repetition.

Background technology

At present, increasing text has appearred in all trades and professions, and the information scale of text is unlimited, generally, needs and can find effective information in numerous texts with fast speeds.Yet existing text much all is repetition, has also just seriously reduced the speed that from numerous texts, finds effective information.Therefore, how can in numerous texts, find effective information to become to need badly now the problem of solution apace.

Summary of the invention

In view of this, the invention provides a kind of method that text repeats that detects, can detect the text that information repeats effectively, thereby improve the efficient of in numerous texts, searching effective information.

For achieving the above object, technical scheme of the present invention specifically is achieved in that

A kind of method that detects the text repetition, this method comprises:

For each piece text:

With the word frequency TF value of each word in one piece of text weights, and adjust the weights of the word in the present text title according to the sentence number of the text as this word; Behind all words of the descending series arrangement of weights, take out and come the keyword set of the word of front some as text; From text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text;

For any two pieces of text A and the text B that obtain the keyword set framework:

Judge that successively each word in the keyword set framework of text B is whether in the keyword set framework of text A; If; Judge then whether the weights of this word in text A and text B mate, if coupling is then deleted the word that matees in the keyword set framework of text A, the word word before that reaches this coupling; With the keyword set framework of the keyword set framework behind the deletion word as text A; And the matching value that will characterize matching degree adds 1, last word in text B keyword set framework, wherein; Whether each word in the said keyword set framework of judging text B in the keyword set framework of text A is: with the current word of text B successively with the keyword set framework of text A in each word compare, to judge current word whether in the keyword set framework of text A;

The total sum of word in the sum of deleting word in the word keyword set framework before the first time of calculating text A and the keyword set framework of text B; With the matching value that obtains with calculate and half ask the merchant, with the merchant who obtains as the similarity that is used to characterize text A and text B similarity degree;

A similarity that obtains and a similarity threshold are compared, and when similarity during greater than a similarity threshold, being judged to be text A and text B is repeated text; Otherwise then being judged to be text A and text B is not repeated text.

The TF value of each word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in said one piece of text.

The weights that said sentence number according to the text adjusts the word in the present text title comprise:

The weights of each word in the text title and half of the total sentence number of the text are multiplied each other, and the value that will obtain after will multiplying each other is as the weights of the word in the text title.

The word that said taking-up comes the front some comprises as the keyword set of text: choose the number of word according to the number of words of text, when the text number of words is 320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is 320 words when above,, take out and come the keyword set of the word of front some as text with per 70～90 principles that a word got in word.

Saidly judge whether the weights of this word in text A and text B mate and comprise: when the difference of the weights of this word in two pieces of texts less than total sentence number in a certain text wherein 1/4 the time, then be judged to be the weights of this word in two pieces of texts and mate; Otherwise it is unmatched being judged to be.

Said similarity threshold is to confirm according to accuracy rate that obtains behind the experiment statistics and recall rate.

Said similarity threshold is 0.4～0.6.To sum up; The method that the detection text that the present invention adopted repeats; At first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.Because the inventive method is to compare through the keyword set framework to two pieces of texts; And when certain word in judging the keyword set framework all belongs to two pieces of texts; Further, need judge that also the weights coupling of this word in two pieces of texts just is judged to be this word and belongs to two pieces of texts really, therefore; Use the method that detection text of the present invention repeats; Can detect the text that information repeats effectively, and then can the text of the repetition in numerous texts be deleted, also just improve the efficient of in numerous texts, searching effective information.

Description of drawings

Fig. 1 is the workflow diagram of keyword set abstracting method embodiment of the present invention;

Fig. 2 is the workflow diagram of Text similarity computing method embodiment of the present invention.

Embodiment

For solving the problem that exists in the prior art; The present invention proposes the method that a kind of new detection text repeats; That is: at first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.

Before introducing concrete implementation, at first introduce the notion of keyword set and keyword set framework, the notion of text similarity.Keyword set is all words with the meaning represented that best embody in the specific area to a specific area; The keyword set framework is for one piece of text, all words in the keyword set that belongs to the text that from this paper, takes out in turn; Text similarity is meant the close degree of two texts on meaning.

Based on above-mentioned introduction, the concrete realization of scheme according to the invention comprises:

For each piece text:

Judge that successively each word in the keyword set framework of text B is whether in the keyword set framework of text A; If; Judge then whether the weights of this word in text A and text B mate; If coupling is then deleted the word that matees in the keyword set framework of text A, the word word before that reaches this coupling, with the keyword set framework of the keyword set framework behind the deletion word as text A; And the matching value that will characterize matching degree adds 1, last word in text B keyword set framework;

For making the object of the invention, technical scheme and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, to further explain of the present invention.

Fig. 1 is the workflow diagram of keyword set abstracting method of the present invention.As shown in Figure 1, this flow process comprises:

Step 101: one piece of text is carried out participle, obtain the text behind the participle.

Because existing text is very many; And relate to every field; In this step, need to take out one piece of text wherein, and it is carried out word segmentation processing; The form of expression that also is about to sentence in the text converts the form of expression of words all in the text into, and the text behind the participle that obtains has also promptly obtained words all in the text.

Step 102: the TF value of each word in the text behind the calculating participle, and the TF value of each word that will obtain after will calculating is as the weights of this word.

In this step, the TF value of word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in the text.

Step 103: the weights that adjust the word in the present text header according to the sentence number of text.

Generally speaking, occurring words all is very important word in the title, therefore, need adjust the weights of occurring words in the title.In this step, be to adjust its weights according to the sentence number of text, be specially: the weights of each word in the text header and half of the total sentence number of the text are multiplied each other, and the value that will obtain after will multiplying each other is as the weights of the word in the text title.In the reality, also can carry out other forms of adjustment, be as the criterion with the realization that does not influence the embodiment of the invention to the weights of occurring words in the title.

Step 104: according to all words of the descending series arrangement of weights.

After having adjusted the weights that appear at the word in the text header, can arrange all words according to the descending order of weights, concrete arrangement mode can have multiple, is as the criterion with the realization that does not influence the embodiment of the invention.

Step 105: the number of words taking-up according to text is arranged in the word of front some, and all words that will take out are as the keyword set of text.

For the text of different numbers of words, needed information is different, and the text that the many needed information of text of number of words is lacked than number of words is many, and therefore, in this step, the number of word is how much to confirm according to the number of words of text in the keyword set of text.Generally speaking, when the text number of words is 280～320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is that 320 words are when above; Get the principle of a word with the individual word of per 70～90 (80 is optimum value), and after the result rounded up, take out and come the keyword set of the word of front some as text; For example: if the length of text is 670 (getting a word with per 80 words here is example); 670 ÷ 80=8.375 ≈ 8, then the number of word is 8 in the keyword set of the text, promptly answers the weighting value to come the keyword set of preceding 8 word as text.

Step 106: from text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text.

After the keyword set that has obtained text, also need from text, to take out all in turn and belong to the word in the keyword set, be specially: scan from the beginning of text, if the word in the text that scans is present in the keyword set, then with its taking-up; Otherwise, do not take out, after the ending that scans text, all words that can will take out in turn are as the keyword set framework of text.

Need to prove that when obtaining the keyword set framework of text, also need keep the weights of each word in the keyword set framework, each word in the keyword set framework that promptly obtains has all comprised this attribute of weights.

So far, promptly accomplished the whole workflow of keyword set abstracting method of the present invention.

Need to prove, extraction process shown in Figure 1 concerning one piece of text, in practical application, need to needs carry out text whether all texts of duplicate detection all carry out above-mentioned processing, thereby obtain the keyword set framework of each piece text.

After having obtained the keyword set framework of text; Can carry out calculation of similarity degree to text according to the keyword set framework that obtains; The workflow of the Text similarity computing method embodiment that specifically can provide referring to Fig. 2; And this flow process with two texts, be that text A and text B are that example is explained, suppose relatively whether text B similar with text A, can with text A as by comparison other, text B as comparison other.As shown in Figure 2, this flow process comprises:

Step 201: first word in the keyword set framework of text B as current word, and will be provided with the current matching value of 0 value as the matching degree that is used to characterize text A and text B.

Need to prove that in this step, matching value is to be used for explanatory text A and the matching degree of text B on absolute sense, its initial value should be 0.

Step 202-203: with the current word of text B successively with the keyword set framework of text A in each word compare, judging current word whether in the keyword set framework of text A, if, execution in step 204; Otherwise, execution in step 206.

Step 204: when in the keyword set framework of current word at text A, judge further whether weights and its weights among text Bs of current word in text A mate, if, execution in step 205; Otherwise, execution in step 206.

Because each word in the keyword set framework of text all has this attribute of weights; Therefore; When judging current word also in the keyword set framework at text A the time; Also need further to judge whether weights and its weights among text Bs of current word in text A mate; Be specially: when the difference of the weights of current word in text A and text B less than text A or text B in during 1/5～1/3 (optimum value is 1/4) of total sentence number, then be judged to be weights and its weights among text Bs of current word in text A and mate; Otherwise it is unmatched being judged to be.

Step 205: all words before the current word that matees in the keyword set framework of deletion text A, the current word that reaches this coupling; With the keyword set framework of the residue keyword set framework of deletion behind the word, and the current matching value of text A and text B added 1 as current matching value as text A.

When the weights coupling of weights and its of current word in text A in text B; Need the current word that matees in the keyword set framework of deletion text A, current word all words before that reach this coupling, with the keyword set framework of the residue keyword set framework after the deletion word as text A; Simultaneously, also need the current matching value of text A and text B is added 1 as current matching value.

Step 206: judge that whether current word is last word in the keyword set framework of text B, if, execution in step 207; Otherwise, execution in step 208.

Step 207: calculate word sum in the keyword set framework that text A deletes word sum and text B in the keyword set framework before the word for the first time with; With current matching value with calculate and half ask the merchant; And after the similarity of the merchant that will obtain as text A and text B, finish whole workflow.

When current word is last word in the keyword set framework of text B; Can be with the final matching value of current matching value as text A and text B; Yet all being not more than 5 matching value for the word sum of the keyword set framework of two texts is that the word sum of the keyword set framework of 3 and two texts is 3 greater than 10 matching value all, and its meaning obviously is different; It is the similarity degree that matching value can not reflect two texts very definitely; Therefore, in this step, need utilize the current matching value that obtains in the step 205 further to obtain the similarity of text A and text B; Be specially: calculate word sum in the keyword set framework that text A deletes word sum and text B in the keyword set framework before the word for the first time with; With current matching value with calculate and half ask the merchant, and after the similarity of the merchant that will obtain as text A and text B, finish whole workflow.

Step 208: with the next word of current word in the keyword set framework of text B as current word after, return execution in step 202, be last word in the keyword set framework of text B until current word.

So far, promptly accomplished the whole workflow of Text similarity computing method of the present invention.

After the similarity that has obtained two texts; Can judge whether text is repetition according to the similarity of text; In the present embodiment; Can through a text similarity and a similarity threshold be compared to judge whether two texts repeat, promptly when similarity during greater than this similarity threshold, being judged to be two texts is repetitions; Otherwise it is unduplicated being judged as two texts.

After having obtained the result whether text repeat, can handle accordingly the text that repeats, like deletion etc., thereby make the text that repeats in numerous texts reduce, improved and found out the wherein efficient of effective information.

Need to prove that in the present embodiment, similarity threshold is to confirm through accuracy rate that obtains behind the experiment statistics and recall rate, wherein, what accuracy rate reflected is that how much number percent is correct result account among the result who judges; What recall rate reflected is in all desired result (or claiming model answer), to judge correct how much number percent that accounts for; Draw through after the experimental analysis; Under accuracy rate and all reasonable situation of recall rate; Use 0.4～0.6 to be proper as similarity threshold, present embodiment can adopt 0.5 as similarity threshold.

In sum; The method that the detection text that the present invention adopted repeats; At first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.Because the inventive method is to compare through the keyword set framework to two pieces of texts; And when certain word in judging the keyword set framework all belongs to two pieces of texts; Further, need judge that also the weights coupling of this word in two pieces of texts just is judged to be this word and belongs to two pieces of texts really, therefore; Use the method that detection text of the present invention repeats; Can detect the text that information repeats effectively, and then can the text of the repetition in numerous texts be deleted, also just improve the efficient of in numerous texts, searching effective information.

The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims

1. one kind is detected the method that text repeats, and it is characterized in that this method comprises:

For each piece text:

2. method according to claim 1 is characterized in that, the TF value of each word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in said one piece of text.

3. method according to claim 1 is characterized in that, the weights that said sentence number according to the text adjusts the word in the present text title comprise:

4. method according to claim 3; It is characterized in that; The word that said taking-up comes the front some comprises as the keyword set of text: the number of choosing word according to the number of words of text; When the text number of words is 320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is 320 words when above,, take out and come the keyword set of the word of front some as text with per 70～90 principles that a word got in word.

5. method according to claim 1; It is characterized in that; Saidly judge whether the weights of this word in text A and text B mate and comprise: when the difference of the weights of this word in two pieces of texts less than total sentence number in a certain text wherein 1/4 the time, then be judged to be the weights of this word in two pieces of texts and mate; Otherwise it is unmatched being judged to be.

6. method according to claim 1 is characterized in that, said similarity threshold is to confirm according to accuracy rate that obtains behind the experiment statistics and recall rate.

7. method according to claim 6 is characterized in that, said similarity threshold is 0.4～0.6.