CN102650986A

CN102650986A - Synonym expansion method and device both used for text duplication detection

Info

Publication number: CN102650986A
Application number: CN2011100462577A
Authority: CN
Inventors: 孙星明
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-02-27
Filing date: 2011-02-27
Publication date: 2012-08-29

Abstract

The invention discloses a synonym expansion method and a synonym expansion device both used for text duplication detection, which include a text preprocessing unit used for deleting stop words in a suspected text and tagging the part-of-speech, wherein verbs, nouns, and adjectives are taken as the to-be-processed objects; through retrieving synonyms of single words, computing the Cartesian product and obtaining the initial expansion set of all word collocations in the suspected text; through comparing the initial expansion set and an actual corpus, filtering word collocations impossible in an actual language environment, simplifying the set, and obtaining the final expansion set; and during the duplication detection, according to different collocation results, giving the words different weights which are taken as the computation base for the duplication detection results. Through applying the method or the device disclosed by the embodiment of the invention, the problem of synonym replacement in text duplication can be efficiently overcome, the efficiency is higher, and the accuracy of the duplication detection is greatly improved.

Description

A kind of synonym extended method and device that is used for the text copy detection

Technical field

Synonym expansion technique in the relate generally to text copy detection of the present invention is especially designed a kind of excessive method and apparatus of superset in the synonym expansion process that prevents.

Background technology

The magnanimity of Along with computer technology and rapid development of Internet, numerical information increases, and how to prevent that numerical information from having been become a urgent problem by bootlegging and propagation.In these copying of digital information, most often text duplicates.The purpose of text copy detection is exactly through contrasting the corpus of suspicious text and appointment, finding the part of plagiarism in the text.This comparison method has directly duplicated effect preferably to text.But it is powerless for the replacement of the synonym in text phenomenon.To this phenomenon, introduced the synonym expansion technique in some copy detection methods.

The synonym expansion need include profuse semantic information by means of semantic dictionary in the semantic dictionary, can obtain classification relation and the similarity relation between the word through it.Synonym expansion usual way is, with wait that expanding vocabulary inquires about, and obtains the expanded set of a vocabulary in synonymicon.Vocabulary in this set has comprised all and has waited to expand the close word of lexical semantic.In the text copy detection, can the vocabulary in these expanded set be used for the comparison between the text, the detection for having carried out the synonym replacement has certain effect.

The defective of this synonym expansion is that the expanded set that obtains through said method is bigger usually, if each speech in the text to be detected is carried out above-mentioned expansion, can cause vocabulary to be detected too much to have influence on the efficient even the accuracy rate of detection.And consider that the context environmental in the true language phenomenon, the most of vocabulary in the expanded set duplicate as being used to text, can cause statement unclear and coherent or S meaning change and can not adopted by the plagiarist.Therefore, how the vocabulary that does not possess the value of detecting in the expanded set is filtered into for key of problem.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of effective synonym extended method, in conjunction with the context of co-text in the text, the expanded set of vocabulary is filtered, and the expanded set of filtering the back gained is used for the text copy detection.This method has overcome in the synonym expansion, and expanded set is excessive and have influence on detection efficiency and the problem that detects accuracy rate.

The embodiment of the invention realizes through following technical scheme:

The text pre-service;

Through semantic dictionary, obtain the initial extension collection of waiting to expand vocabulary;

The context of co-text of combined belt expansion vocabulary in text to be detected filters the initial extension collection through the real text corpus;

Match condition according to copy detection is that weights are calculated in the synonym collocation.

The embodiment of the invention also provides a kind of synonym expanding unit that is used for the text copy detection, comprising: text pre-processing module, initial extension collection acquisition module, filtering module.Wherein:

The text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked;

Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary;

Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, and the common factor of the initial extension collection through calculating the corresponding vocabulary of bigram obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset;

The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.

Concrete technology implementation scheme by the invention described above example can be found out; When the embodiment of the invention is expanded vocabulary; Considered the context relation under the true language environment, the expansion vocabulary that does not have the synonym collocation screened that the synonym that possibly occur under the true language environment that is that is comprised in the superset of final gained is arranged in pairs or groups; Improved the efficient in the copy detection effectively, and the synonym expansion has been carried out improving effectively to the influence of copy detection accuracy rate.

Description of drawings

Fig. 1 is embodiment of the invention text pretreatment process figure

Fig. 2 is an embodiment of the invention initial extension collection calculating chart

Fig. 3 is the final superset calculating chart of the embodiment of the invention

Embodiment

For making the object of the invention, technical scheme and advantage more clear, the technical scheme that the embodiment of the invention proposed is elaborated below in conjunction with accompanying drawing.

The first step of the embodiment of the invention is the text pre-service, comprises the steps: with reference to Fig. 1

Step 1: for suspicious text, use existing natural language processing instrument, it is carried out participle.

Step 2:, delete the stop words in the suspicious text through the vocabulary of stopping using.

Step 3:, the verb in the text after the above-mentioned processing, nouns and adjectives are marked through existing natural language processing instrument.

For given suspicious text,, obtain text through after the above-mentioned pre-treatment step.

With reference to Fig. 2, for handle back gained text, carry out the synonym expansion.In this process, owing to need to introduce contextual information, the therefore bigram that is therefrom to be extracted of expansion here.

Step 1: to the bigram that carries out the bigram cutting, obtain wherein comprising.

Step 2: for given bigram-, right respectively, expand through semantic dictionary, obtain, synonym set.

Step 3: calculate cartesian product, the initial extension collection that obtains.

Initial extension collection characteristics:

1, is that base unit is expanded with bigram, considered the residing context environmental of vocabulary.

2, calculate in the set of cartesian product gained, comprised all synonym collocation of adjacent two vocabulary.

With reference to Fig. 3, the vocabulary collocation that can not under true language environment, occur in the deletion set obtains final expanded set.Its create-rule is following:

Step 1: for given corpus, it is carried out the bigram cutting, obtain set.

Step 2: to making up the gitram index.

Step 3: the bigram in appearing at for each, inquire about index.If be present in the index, then keep, otherwise therefrom deletion.

Step 4: repeating step 3 is processed up to wherein all bigram and finishes the expanded set that finally obtains.

Adopt above-mentioned steps that the initial extension collection is filtered, have following advantage:

The initial extension collection adds up to the calculation cartesian product to obtain by the synset of the vocabulary among the bigram, and the collocation that wherein comprises is too much, and overwhelming majority collocation does not exist under true language environment.Through comparing with the real corpus storehouse; Filter out major part collocation wherein, all synonyms collocation that under true language environment, exist that are that final gained superset comprises, and quantity much smaller than; Under the prerequisite that does not influence the copy detection accuracy rate, improved the efficient of copy detection.

For the collocation of the synonym in the final superset, its weights calculate the match condition when depending on copy detection.If there are identical bigram coupling in suspicious text and target text, then weights are got maximal value 2.As if not exclusively mating or not matching fully, then computation rule is following:

Step 1: for the collocation of the vocabulary in the suspicious text, if also there is the vocabulary collocation in the target text, weights then are 2.

Step 2: if in target text, do not exist, but or exist, then the weighting value is 1.

Step 3: if in target text, do not exist, and or do not exist yet, but exist in the target text, then the weighting value does, wherein is the quantity of expansion vocabulary collocation in the set.

Characteristics: when finally carrying out copy detection, fully take into account directly duplicating and synonym replacement situation of vocabulary, give different weights according to different situations.Directly bigram coupling weights are the highest, and part is mated weights and taken second place, if do not have direct coupling or part coupling, then calculate the probability of its synonym replacement according to the size of vocabulary extension set, with these weights as coupling.

In sum; The embodiment of the invention provides the synonym extended method in a kind of text copy detection; Different with common synonym expansion is that this method has not only been considered the expansion of vocabulary, also in expansion, has considered the residing context environmental of vocabulary.

The above is merely the preferable embodiment of the present invention.But protection scope of the present invention is not limited thereto, and any technician who is familiar with the present technique field is in the technical scope that the present invention discloses, and the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims

1. a synonym extended method and a device that is used for the text copy detection is characterized in that, comprising: the text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked; Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary; Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, through the common factor that the initial extension of calculating the corresponding vocabulary of bigram reaches, obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset; The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.

2. text pretreatment unit described in claim 1 is characterized in that, according to the synset of each vocabulary among the cutting gained bigram, calculates cartesian product, obtains vocabulary collocation superset.

3. filter element described in claim 1 is characterized in that, to all vocabulary collocation that initial extension is concentrated, filters through the real corpus storehouse, gets rid of the vocabulary collocation that wherein can not appear under the true language environment.

4. the weight calculation unit described in claim 1; It is characterized in that,, give the highest weight value what original vocabulary was matched to merit according to the match condition in the copy detection; The weights that part is mated original vocabulary take second place; For can not mating the vocabulary that original collection but can be mated expanded set, calculate its probability according to the expanded set size, with this as weights.