CN102650986A - Synonym expansion method and device both used for text duplication detection - Google Patents

Synonym expansion method and device both used for text duplication detection Download PDF

Info

Publication number
CN102650986A
CN102650986A CN2011100462577A CN201110046257A CN102650986A CN 102650986 A CN102650986 A CN 102650986A CN 2011100462577 A CN2011100462577 A CN 2011100462577A CN 201110046257 A CN201110046257 A CN 201110046257A CN 102650986 A CN102650986 A CN 102650986A
Authority
CN
China
Prior art keywords
text
vocabulary
collocation
synonym
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100462577A
Other languages
Chinese (zh)
Inventor
孙星明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2011100462577A priority Critical patent/CN102650986A/en
Publication of CN102650986A publication Critical patent/CN102650986A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a synonym expansion method and a synonym expansion device both used for text duplication detection, which include a text preprocessing unit used for deleting stop words in a suspected text and tagging the part-of-speech, wherein verbs, nouns, and adjectives are taken as the to-be-processed objects; through retrieving synonyms of single words, computing the Cartesian product and obtaining the initial expansion set of all word collocations in the suspected text; through comparing the initial expansion set and an actual corpus, filtering word collocations impossible in an actual language environment, simplifying the set, and obtaining the final expansion set; and during the duplication detection, according to different collocation results, giving the words different weights which are taken as the computation base for the duplication detection results. Through applying the method or the device disclosed by the embodiment of the invention, the problem of synonym replacement in text duplication can be efficiently overcome, the efficiency is higher, and the accuracy of the duplication detection is greatly improved.

Description

A kind of synonym extended method and device that is used for the text copy detection
Technical field
Synonym expansion technique in the relate generally to text copy detection of the present invention is especially designed a kind of excessive method and apparatus of superset in the synonym expansion process that prevents.
Background technology
The magnanimity of Along with computer technology and rapid development of Internet, numerical information increases, and how to prevent that numerical information from having been become a urgent problem by bootlegging and propagation.In these copying of digital information, most often text duplicates.The purpose of text copy detection is exactly through contrasting the corpus of suspicious text and appointment, finding the part of plagiarism in the text.This comparison method has directly duplicated effect preferably to text.But it is powerless for the replacement of the synonym in text phenomenon.To this phenomenon, introduced the synonym expansion technique in some copy detection methods.
The synonym expansion need include profuse semantic information by means of semantic dictionary in the semantic dictionary, can obtain classification relation and the similarity relation between the word through it.Synonym expansion usual way is, with wait that expanding vocabulary inquires about, and obtains the expanded set of a vocabulary in synonymicon.Vocabulary in this set has comprised all and has waited to expand the close word of lexical semantic.In the text copy detection, can the vocabulary in these expanded set be used for the comparison between the text, the detection for having carried out the synonym replacement has certain effect.
The defective of this synonym expansion is that the expanded set that obtains through said method is bigger usually, if each speech in the text to be detected is carried out above-mentioned expansion, can cause vocabulary to be detected too much to have influence on the efficient even the accuracy rate of detection.And consider that the context environmental in the true language phenomenon, the most of vocabulary in the expanded set duplicate as being used to text, can cause statement unclear and coherent or S meaning change and can not adopted by the plagiarist.Therefore, how the vocabulary that does not possess the value of detecting in the expanded set is filtered into for key of problem.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of effective synonym extended method, in conjunction with the context of co-text in the text, the expanded set of vocabulary is filtered, and the expanded set of filtering the back gained is used for the text copy detection.This method has overcome in the synonym expansion, and expanded set is excessive and have influence on detection efficiency and the problem that detects accuracy rate.
The embodiment of the invention realizes through following technical scheme:
The text pre-service;
Through semantic dictionary, obtain the initial extension collection of waiting to expand vocabulary;
The context of co-text of combined belt expansion vocabulary in text to be detected filters the initial extension collection through the real text corpus;
Match condition according to copy detection is that weights are calculated in the synonym collocation.
 
The embodiment of the invention also provides a kind of synonym expanding unit that is used for the text copy detection, comprising: text pre-processing module, initial extension collection acquisition module, filtering module.Wherein:
The text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked;
Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary;
Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, and the common factor of the initial extension collection through calculating the corresponding vocabulary of bigram obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset;
The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.
 
Concrete technology implementation scheme by the invention described above example can be found out; When the embodiment of the invention is expanded vocabulary; Considered the context relation under the true language environment, the expansion vocabulary that does not have the synonym collocation screened that the synonym that possibly occur under the true language environment that is that is comprised in the superset of final gained is arranged in pairs or groups; Improved the efficient in the copy detection effectively, and the synonym expansion has been carried out improving effectively to the influence of copy detection accuracy rate.
    
Description of drawings
Fig. 1 is embodiment of the invention text pretreatment process figure
Fig. 2 is an embodiment of the invention initial extension collection calculating chart
Fig. 3 is the final superset calculating chart of the embodiment of the invention
Embodiment
For making the object of the invention, technical scheme and advantage more clear, the technical scheme that the embodiment of the invention proposed is elaborated below in conjunction with accompanying drawing.
The first step of the embodiment of the invention is the text pre-service, comprises the steps: with reference to Fig. 1
Step 1: for suspicious text, use existing natural language processing instrument, it is carried out participle.
Step 2:, delete the stop words in the suspicious text through the vocabulary of stopping using.
Step 3:, the verb in the text after the above-mentioned processing, nouns and adjectives are marked through existing natural language processing instrument.
For given suspicious text,, obtain text through after the above-mentioned pre-treatment step.
 
With reference to Fig. 2, for handle back gained text, carry out the synonym expansion.In this process, owing to need to introduce contextual information, the therefore bigram that is therefrom to be extracted of expansion here.
Step 1: to the bigram that carries out the bigram cutting, obtain wherein comprising.
Step 2: for given bigram-, right respectively, expand through semantic dictionary, obtain, synonym set.
Step 3: calculate cartesian product, the initial extension collection that obtains.
 
Initial extension collection characteristics:
1, is that base unit is expanded with bigram, considered the residing context environmental of vocabulary.
2, calculate in the set of cartesian product gained, comprised all synonym collocation of adjacent two vocabulary.
 
With reference to Fig. 3, the vocabulary collocation that can not under true language environment, occur in the deletion set obtains final expanded set.Its create-rule is following:
Step 1: for given corpus, it is carried out the bigram cutting, obtain set.
Step 2: to making up the gitram index.
Step 3: the bigram in appearing at for each, inquire about index.If be present in the index, then keep, otherwise therefrom deletion.
Step 4: repeating step 3 is processed up to wherein all bigram and finishes the expanded set that finally obtains.
 
Adopt above-mentioned steps that the initial extension collection is filtered, have following advantage:
The initial extension collection adds up to the calculation cartesian product to obtain by the synset of the vocabulary among the bigram, and the collocation that wherein comprises is too much, and overwhelming majority collocation does not exist under true language environment.Through comparing with the real corpus storehouse; Filter out major part collocation wherein, all synonyms collocation that under true language environment, exist that are that final gained superset comprises, and quantity much smaller than; Under the prerequisite that does not influence the copy detection accuracy rate, improved the efficient of copy detection.
 
For the collocation of the synonym in the final superset, its weights calculate the match condition when depending on copy detection.If there are identical bigram coupling in suspicious text and target text, then weights are got maximal value 2.As if not exclusively mating or not matching fully, then computation rule is following:
Step 1: for the collocation of the vocabulary in the suspicious text, if also there is the vocabulary collocation in the target text, weights then are 2.
Step 2: if in target text, do not exist, but or exist, then the weighting value is 1.
Step 3: if in target text, do not exist, and or do not exist yet, but exist in the target text, then the weighting value does, wherein is the quantity of expansion vocabulary collocation in the set.
Characteristics: when finally carrying out copy detection, fully take into account directly duplicating and synonym replacement situation of vocabulary, give different weights according to different situations.Directly bigram coupling weights are the highest, and part is mated weights and taken second place, if do not have direct coupling or part coupling, then calculate the probability of its synonym replacement according to the size of vocabulary extension set, with these weights as coupling.
 
In sum; The embodiment of the invention provides the synonym extended method in a kind of text copy detection; Different with common synonym expansion is that this method has not only been considered the expansion of vocabulary, also in expansion, has considered the residing context environmental of vocabulary.
The above is merely the preferable embodiment of the present invention.But protection scope of the present invention is not limited thereto, and any technician who is familiar with the present technique field is in the technical scope that the present invention discloses, and the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims (4)

1. a synonym extended method and a device that is used for the text copy detection is characterized in that, comprising: the text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked; Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary; Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, through the common factor that the initial extension of calculating the corresponding vocabulary of bigram reaches, obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset; The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.
2. text pretreatment unit described in claim 1 is characterized in that, according to the synset of each vocabulary among the cutting gained bigram, calculates cartesian product, obtains vocabulary collocation superset.
3. filter element described in claim 1 is characterized in that, to all vocabulary collocation that initial extension is concentrated, filters through the real corpus storehouse, gets rid of the vocabulary collocation that wherein can not appear under the true language environment.
4. the weight calculation unit described in claim 1; It is characterized in that,, give the highest weight value what original vocabulary was matched to merit according to the match condition in the copy detection; The weights that part is mated original vocabulary take second place; For can not mating the vocabulary that original collection but can be mated expanded set, calculate its probability according to the expanded set size, with this as weights.
CN2011100462577A 2011-02-27 2011-02-27 Synonym expansion method and device both used for text duplication detection Pending CN102650986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100462577A CN102650986A (en) 2011-02-27 2011-02-27 Synonym expansion method and device both used for text duplication detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100462577A CN102650986A (en) 2011-02-27 2011-02-27 Synonym expansion method and device both used for text duplication detection

Publications (1)

Publication Number Publication Date
CN102650986A true CN102650986A (en) 2012-08-29

Family

ID=46692994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100462577A Pending CN102650986A (en) 2011-02-27 2011-02-27 Synonym expansion method and device both used for text duplication detection

Country Status (1)

Country Link
CN (1) CN102650986A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530345A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Short text characteristic extension and fitting characteristic library building method and device
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN105159931A (en) * 2015-08-06 2015-12-16 上海智臻智能网络科技股份有限公司 Method and apparatus for generating synonyms
US20160055145A1 (en) * 2014-08-19 2016-02-25 Sandeep Chauhan Essay manager and automated plagiarism detector
DE102014114845A1 (en) 2014-10-14 2016-04-14 Deutsche Telekom Ag Method for interpreting automatic speech recognition
CN108090169A (en) * 2017-12-14 2018-05-29 上海智臻智能网络科技股份有限公司 Question sentence extended method and device, storage medium, terminal
CN108491406A (en) * 2018-01-23 2018-09-04 深圳市阿西莫夫科技有限公司 Information classification approach, device, computer equipment and storage medium
CN110633372A (en) * 2019-09-23 2019-12-31 珠海格力电器股份有限公司 Text augmentation processing method and device and storage medium
WO2021239114A1 (en) * 2020-05-29 2021-12-02 支付宝(杭州)信息技术有限公司 Method for synonym editing and determining creator of text

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN101878476A (en) * 2007-06-22 2010-11-03 谷歌公司 Machine translation for query expansion
CN201654778U (en) * 2009-04-22 2010-11-24 同方知网(北京)技术有限公司 Text copying detecting device
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
US20110047149A1 (en) * 2009-08-21 2011-02-24 Vaeaenaenen Mikko Method and means for data searching and language translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
CN101878476A (en) * 2007-06-22 2010-11-03 谷歌公司 Machine translation for query expansion
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN201654778U (en) * 2009-04-22 2010-11-24 同方知网(北京)技术有限公司 Text copying detecting device
US20110047149A1 (en) * 2009-08-21 2011-02-24 Vaeaenaenen Mikko Method and means for data searching and language translation
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李旭 等: "基于指纹和语义特征的文档复制检测方法", 《燕山大学学报》 *
甘灿 等: "一种改进的基于同义词替换的中文文本信息隐藏方法", 《东南大学学报(自然科学版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530345A (en) * 2013-10-08 2014-01-22 北京百度网讯科技有限公司 Short text characteristic extension and fitting characteristic library building method and device
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN105095222B (en) * 2014-04-25 2019-10-15 阿里巴巴集团控股有限公司 Uniterm replacement method, searching method and device
US20160055145A1 (en) * 2014-08-19 2016-02-25 Sandeep Chauhan Essay manager and automated plagiarism detector
DE102014114845A1 (en) 2014-10-14 2016-04-14 Deutsche Telekom Ag Method for interpreting automatic speech recognition
EP3010014A1 (en) 2014-10-14 2016-04-20 Deutsche Telekom AG Method for interpretation of automatic speech recognition
CN105159931B (en) * 2015-08-06 2018-06-22 上海智臻智能网络科技股份有限公司 For generating the method and apparatus of synonym
CN105159931A (en) * 2015-08-06 2015-12-16 上海智臻智能网络科技股份有限公司 Method and apparatus for generating synonyms
CN108090169A (en) * 2017-12-14 2018-05-29 上海智臻智能网络科技股份有限公司 Question sentence extended method and device, storage medium, terminal
CN108491406A (en) * 2018-01-23 2018-09-04 深圳市阿西莫夫科技有限公司 Information classification approach, device, computer equipment and storage medium
CN108491406B (en) * 2018-01-23 2021-09-24 深圳市阿西莫夫科技有限公司 Information classification method and device, computer equipment and storage medium
CN110633372A (en) * 2019-09-23 2019-12-31 珠海格力电器股份有限公司 Text augmentation processing method and device and storage medium
WO2021239114A1 (en) * 2020-05-29 2021-12-02 支付宝(杭州)信息技术有限公司 Method for synonym editing and determining creator of text

Similar Documents

Publication Publication Date Title
CN102650986A (en) Synonym expansion method and device both used for text duplication detection
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
CN110727880B (en) Sensitive corpus detection method based on word bank and word vector model
EP3933657A1 (en) Conference minutes generation method and apparatus, electronic device, and computer-readable storage medium
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
Kanerva et al. Syntactic n-gram collection from a large-scale corpus of internet finnish
JP2017508214A (en) Provide search recommendations
CN105138511A (en) Method and system for semantically analyzing search keyword
CN106570180A (en) Artificial intelligence based voice searching method and device
CN105224521A (en) Key phrases extraction method and use its method obtaining correlated digital resource and device
JP2015060243A (en) Search device, search method, and program
KR101626247B1 (en) Online plagiarized document detection system using synonym dictionary
Aksenov et al. Abstractive text summarization based on language model conditioning and locality modeling
Mahdabi et al. The effect of citation analysis on query expansion for patent retrieval
CN105653701A (en) Model generating method and device as well as word weighting method and device
Hamdi et al. In-depth analysis of the impact of OCR errors on named entity recognition and linking
CN107480197B (en) Entity word recognition method and device
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
CN105354182A (en) Method for obtaining related digital resources and method and apparatus for generating special topic by using method
EP4080381A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
Kessler et al. Extraction of terminology in the field of construction
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN102135957A (en) Clause translating method and device
CN104572111B (en) A kind of program comprehension and characteristic positioning method based on related subject model
CN105005620B (en) Finite data source data acquisition methods based on query expansion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120829