CN102650986A - Synonym expansion method and device both used for text duplication detection - Google Patents
Synonym expansion method and device both used for text duplication detection Download PDFInfo
- Publication number
- CN102650986A CN102650986A CN2011100462577A CN201110046257A CN102650986A CN 102650986 A CN102650986 A CN 102650986A CN 2011100462577 A CN2011100462577 A CN 2011100462577A CN 201110046257 A CN201110046257 A CN 201110046257A CN 102650986 A CN102650986 A CN 102650986A
- Authority
- CN
- China
- Prior art keywords
- text
- vocabulary
- collocation
- synonym
- expansion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a synonym expansion method and a synonym expansion device both used for text duplication detection, which include a text preprocessing unit used for deleting stop words in a suspected text and tagging the part-of-speech, wherein verbs, nouns, and adjectives are taken as the to-be-processed objects; through retrieving synonyms of single words, computing the Cartesian product and obtaining the initial expansion set of all word collocations in the suspected text; through comparing the initial expansion set and an actual corpus, filtering word collocations impossible in an actual language environment, simplifying the set, and obtaining the final expansion set; and during the duplication detection, according to different collocation results, giving the words different weights which are taken as the computation base for the duplication detection results. Through applying the method or the device disclosed by the embodiment of the invention, the problem of synonym replacement in text duplication can be efficiently overcome, the efficiency is higher, and the accuracy of the duplication detection is greatly improved.
Description
Technical field
Synonym expansion technique in the relate generally to text copy detection of the present invention is especially designed a kind of excessive method and apparatus of superset in the synonym expansion process that prevents.
Background technology
The magnanimity of Along with computer technology and rapid development of Internet, numerical information increases, and how to prevent that numerical information from having been become a urgent problem by bootlegging and propagation.In these copying of digital information, most often text duplicates.The purpose of text copy detection is exactly through contrasting the corpus of suspicious text and appointment, finding the part of plagiarism in the text.This comparison method has directly duplicated effect preferably to text.But it is powerless for the replacement of the synonym in text phenomenon.To this phenomenon, introduced the synonym expansion technique in some copy detection methods.
The synonym expansion need include profuse semantic information by means of semantic dictionary in the semantic dictionary, can obtain classification relation and the similarity relation between the word through it.Synonym expansion usual way is, with wait that expanding vocabulary inquires about, and obtains the expanded set of a vocabulary in synonymicon.Vocabulary in this set has comprised all and has waited to expand the close word of lexical semantic.In the text copy detection, can the vocabulary in these expanded set be used for the comparison between the text, the detection for having carried out the synonym replacement has certain effect.
The defective of this synonym expansion is that the expanded set that obtains through said method is bigger usually, if each speech in the text to be detected is carried out above-mentioned expansion, can cause vocabulary to be detected too much to have influence on the efficient even the accuracy rate of detection.And consider that the context environmental in the true language phenomenon, the most of vocabulary in the expanded set duplicate as being used to text, can cause statement unclear and coherent or S meaning change and can not adopted by the plagiarist.Therefore, how the vocabulary that does not possess the value of detecting in the expanded set is filtered into for key of problem.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of effective synonym extended method, in conjunction with the context of co-text in the text, the expanded set of vocabulary is filtered, and the expanded set of filtering the back gained is used for the text copy detection.This method has overcome in the synonym expansion, and expanded set is excessive and have influence on detection efficiency and the problem that detects accuracy rate.
The embodiment of the invention realizes through following technical scheme:
The text pre-service;
Through semantic dictionary, obtain the initial extension collection of waiting to expand vocabulary;
The context of co-text of combined belt expansion vocabulary in text to be detected filters the initial extension collection through the real text corpus;
Match condition according to copy detection is that weights are calculated in the synonym collocation.
The embodiment of the invention also provides a kind of synonym expanding unit that is used for the text copy detection, comprising: text pre-processing module, initial extension collection acquisition module, filtering module.Wherein:
The text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked;
Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary;
Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, and the common factor of the initial extension collection through calculating the corresponding vocabulary of bigram obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset;
The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.
Concrete technology implementation scheme by the invention described above example can be found out; When the embodiment of the invention is expanded vocabulary; Considered the context relation under the true language environment, the expansion vocabulary that does not have the synonym collocation screened that the synonym that possibly occur under the true language environment that is that is comprised in the superset of final gained is arranged in pairs or groups; Improved the efficient in the copy detection effectively, and the synonym expansion has been carried out improving effectively to the influence of copy detection accuracy rate.
Description of drawings
Fig. 1 is embodiment of the invention text pretreatment process figure
Fig. 2 is an embodiment of the invention initial extension collection calculating chart
Fig. 3 is the final superset calculating chart of the embodiment of the invention
Embodiment
For making the object of the invention, technical scheme and advantage more clear, the technical scheme that the embodiment of the invention proposed is elaborated below in conjunction with accompanying drawing.
The first step of the embodiment of the invention is the text pre-service, comprises the steps: with reference to Fig. 1
Step 1: for suspicious text, use existing natural language processing instrument, it is carried out participle.
Step 2:, delete the stop words in the suspicious text through the vocabulary of stopping using.
Step 3:, the verb in the text after the above-mentioned processing, nouns and adjectives are marked through existing natural language processing instrument.
For given suspicious text,, obtain text through after the above-mentioned pre-treatment step.
With reference to Fig. 2, for handle back gained text, carry out the synonym expansion.In this process, owing to need to introduce contextual information, the therefore bigram that is therefrom to be extracted of expansion here.
Step 1: to the bigram that carries out the bigram cutting, obtain wherein comprising.
Step 2: for given bigram-, right respectively, expand through semantic dictionary, obtain, synonym set.
Step 3: calculate cartesian product, the initial extension collection that obtains.
Initial extension collection characteristics:
1, is that base unit is expanded with bigram, considered the residing context environmental of vocabulary.
2, calculate in the set of cartesian product gained, comprised all synonym collocation of adjacent two vocabulary.
With reference to Fig. 3, the vocabulary collocation that can not under true language environment, occur in the deletion set obtains final expanded set.Its create-rule is following:
Step 1: for given corpus, it is carried out the bigram cutting, obtain set.
Step 2: to making up the gitram index.
Step 3: the bigram in appearing at for each, inquire about index.If be present in the index, then keep, otherwise therefrom deletion.
Step 4: repeating step 3 is processed up to wherein all bigram and finishes the expanded set that finally obtains.
Adopt above-mentioned steps that the initial extension collection is filtered, have following advantage:
The initial extension collection adds up to the calculation cartesian product to obtain by the synset of the vocabulary among the bigram, and the collocation that wherein comprises is too much, and overwhelming majority collocation does not exist under true language environment.Through comparing with the real corpus storehouse; Filter out major part collocation wherein, all synonyms collocation that under true language environment, exist that are that final gained superset comprises, and quantity much smaller than; Under the prerequisite that does not influence the copy detection accuracy rate, improved the efficient of copy detection.
For the collocation of the synonym in the final superset, its weights calculate the match condition when depending on copy detection.If there are identical bigram coupling in suspicious text and target text, then weights are got maximal value 2.As if not exclusively mating or not matching fully, then computation rule is following:
Step 1: for the collocation of the vocabulary in the suspicious text, if also there is the vocabulary collocation in the target text, weights then are 2.
Step 2: if in target text, do not exist, but or exist, then the weighting value is 1.
Step 3: if in target text, do not exist, and or do not exist yet, but exist in the target text, then the weighting value does, wherein is the quantity of expansion vocabulary collocation in the set.
Characteristics: when finally carrying out copy detection, fully take into account directly duplicating and synonym replacement situation of vocabulary, give different weights according to different situations.Directly bigram coupling weights are the highest, and part is mated weights and taken second place, if do not have direct coupling or part coupling, then calculate the probability of its synonym replacement according to the size of vocabulary extension set, with these weights as coupling.
In sum; The embodiment of the invention provides the synonym extended method in a kind of text copy detection; Different with common synonym expansion is that this method has not only been considered the expansion of vocabulary, also in expansion, has considered the residing context environmental of vocabulary.
The above is merely the preferable embodiment of the present invention.But protection scope of the present invention is not limited thereto, and any technician who is familiar with the present technique field is in the technical scope that the present invention discloses, and the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.
Claims (4)
1. a synonym extended method and a device that is used for the text copy detection is characterized in that, comprising: the text pre-processing module, be used for filtering text to be detected stop words, obtain vocabulary to be expanded, and verb, nouns and adjectives are marked; Initial extension collection acquisition module to each vocabulary to be expanded, obtains corresponding initial extension collection through semantic dictionary; Filtering module from pretreated text, obtains the context relation (bigram) of each band expansion vocabulary, through the common factor that the initial extension of calculating the corresponding vocabulary of bigram reaches, obtains its all possible expansion collocation.And through text corpus, the expansion collocation is filtered, obtain final superset; The weights computing module for resulting final superset, when carrying out the text copy detection, is given different weights according to match condition.
2. text pretreatment unit described in claim 1 is characterized in that, according to the synset of each vocabulary among the cutting gained bigram, calculates cartesian product, obtains vocabulary collocation superset.
3. filter element described in claim 1 is characterized in that, to all vocabulary collocation that initial extension is concentrated, filters through the real corpus storehouse, gets rid of the vocabulary collocation that wherein can not appear under the true language environment.
4. the weight calculation unit described in claim 1; It is characterized in that,, give the highest weight value what original vocabulary was matched to merit according to the match condition in the copy detection; The weights that part is mated original vocabulary take second place; For can not mating the vocabulary that original collection but can be mated expanded set, calculate its probability according to the expanded set size, with this as weights.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100462577A CN102650986A (en) | 2011-02-27 | 2011-02-27 | Synonym expansion method and device both used for text duplication detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100462577A CN102650986A (en) | 2011-02-27 | 2011-02-27 | Synonym expansion method and device both used for text duplication detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102650986A true CN102650986A (en) | 2012-08-29 |
Family
ID=46692994
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100462577A Pending CN102650986A (en) | 2011-02-27 | 2011-02-27 | Synonym expansion method and device both used for text duplication detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102650986A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530345A (en) * | 2013-10-08 | 2014-01-22 | 北京百度网讯科技有限公司 | Short text characteristic extension and fitting characteristic library building method and device |
CN105095222A (en) * | 2014-04-25 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Unit word replacing method, search method and replacing apparatus |
CN105159931A (en) * | 2015-08-06 | 2015-12-16 | 上海智臻智能网络科技股份有限公司 | Method and apparatus for generating synonyms |
US20160055145A1 (en) * | 2014-08-19 | 2016-02-25 | Sandeep Chauhan | Essay manager and automated plagiarism detector |
DE102014114845A1 (en) | 2014-10-14 | 2016-04-14 | Deutsche Telekom Ag | Method for interpreting automatic speech recognition |
CN108090169A (en) * | 2017-12-14 | 2018-05-29 | 上海智臻智能网络科技股份有限公司 | Question sentence extended method and device, storage medium, terminal |
CN108491406A (en) * | 2018-01-23 | 2018-09-04 | 深圳市阿西莫夫科技有限公司 | Information classification approach, device, computer equipment and storage medium |
CN110633372A (en) * | 2019-09-23 | 2019-12-31 | 珠海格力电器股份有限公司 | Text augmentation processing method and device and storage medium |
WO2021239114A1 (en) * | 2020-05-29 | 2021-12-02 | 支付宝(杭州)信息技术有限公司 | Method for synonym editing and determining creator of text |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1529263A (en) * | 2003-09-18 | 2004-09-15 | 北京邮电大学 | Chinese text auto-segmenting and text plagiarism discrimination device and method |
CN101404037A (en) * | 2008-11-18 | 2009-04-08 | 西安交通大学 | Method for detecting and positioning electronic text contents plagiary |
CN101878476A (en) * | 2007-06-22 | 2010-11-03 | 谷歌公司 | Machine translation for query expansion |
CN201654778U (en) * | 2009-04-22 | 2010-11-24 | 同方知网(北京)技术有限公司 | Text copying detecting device |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
US20110047149A1 (en) * | 2009-08-21 | 2011-02-24 | Vaeaenaenen Mikko | Method and means for data searching and language translation |
-
2011
- 2011-02-27 CN CN2011100462577A patent/CN102650986A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1529263A (en) * | 2003-09-18 | 2004-09-15 | 北京邮电大学 | Chinese text auto-segmenting and text plagiarism discrimination device and method |
CN101878476A (en) * | 2007-06-22 | 2010-11-03 | 谷歌公司 | Machine translation for query expansion |
CN101404037A (en) * | 2008-11-18 | 2009-04-08 | 西安交通大学 | Method for detecting and positioning electronic text contents plagiary |
CN201654778U (en) * | 2009-04-22 | 2010-11-24 | 同方知网(北京)技术有限公司 | Text copying detecting device |
US20110047149A1 (en) * | 2009-08-21 | 2011-02-24 | Vaeaenaenen Mikko | Method and means for data searching and language translation |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
Non-Patent Citations (2)
Title |
---|
李旭 等: "基于指纹和语义特征的文档复制检测方法", 《燕山大学学报》 * |
甘灿 等: "一种改进的基于同义词替换的中文文本信息隐藏方法", 《东南大学学报(自然科学版)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530345A (en) * | 2013-10-08 | 2014-01-22 | 北京百度网讯科技有限公司 | Short text characteristic extension and fitting characteristic library building method and device |
CN105095222A (en) * | 2014-04-25 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Unit word replacing method, search method and replacing apparatus |
CN105095222B (en) * | 2014-04-25 | 2019-10-15 | 阿里巴巴集团控股有限公司 | Uniterm replacement method, searching method and device |
US20160055145A1 (en) * | 2014-08-19 | 2016-02-25 | Sandeep Chauhan | Essay manager and automated plagiarism detector |
DE102014114845A1 (en) | 2014-10-14 | 2016-04-14 | Deutsche Telekom Ag | Method for interpreting automatic speech recognition |
EP3010014A1 (en) | 2014-10-14 | 2016-04-20 | Deutsche Telekom AG | Method for interpretation of automatic speech recognition |
CN105159931B (en) * | 2015-08-06 | 2018-06-22 | 上海智臻智能网络科技股份有限公司 | For generating the method and apparatus of synonym |
CN105159931A (en) * | 2015-08-06 | 2015-12-16 | 上海智臻智能网络科技股份有限公司 | Method and apparatus for generating synonyms |
CN108090169A (en) * | 2017-12-14 | 2018-05-29 | 上海智臻智能网络科技股份有限公司 | Question sentence extended method and device, storage medium, terminal |
CN108491406A (en) * | 2018-01-23 | 2018-09-04 | 深圳市阿西莫夫科技有限公司 | Information classification approach, device, computer equipment and storage medium |
CN108491406B (en) * | 2018-01-23 | 2021-09-24 | 深圳市阿西莫夫科技有限公司 | Information classification method and device, computer equipment and storage medium |
CN110633372A (en) * | 2019-09-23 | 2019-12-31 | 珠海格力电器股份有限公司 | Text augmentation processing method and device and storage medium |
WO2021239114A1 (en) * | 2020-05-29 | 2021-12-02 | 支付宝(杭州)信息技术有限公司 | Method for synonym editing and determining creator of text |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102650986A (en) | Synonym expansion method and device both used for text duplication detection | |
US10339453B2 (en) | Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation | |
CN110727880B (en) | Sensitive corpus detection method based on word bank and word vector model | |
EP3933657A1 (en) | Conference minutes generation method and apparatus, electronic device, and computer-readable storage medium | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
Kanerva et al. | Syntactic n-gram collection from a large-scale corpus of internet finnish | |
JP2017508214A (en) | Provide search recommendations | |
CN105138511A (en) | Method and system for semantically analyzing search keyword | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN105224521A (en) | Key phrases extraction method and use its method obtaining correlated digital resource and device | |
JP2015060243A (en) | Search device, search method, and program | |
KR101626247B1 (en) | Online plagiarized document detection system using synonym dictionary | |
Aksenov et al. | Abstractive text summarization based on language model conditioning and locality modeling | |
Mahdabi et al. | The effect of citation analysis on query expansion for patent retrieval | |
CN105653701A (en) | Model generating method and device as well as word weighting method and device | |
Hamdi et al. | In-depth analysis of the impact of OCR errors on named entity recognition and linking | |
CN107480197B (en) | Entity word recognition method and device | |
Yusuf et al. | Query expansion method for quran search using semantic search and lucene ranking | |
CN105354182A (en) | Method for obtaining related digital resources and method and apparatus for generating special topic by using method | |
EP4080381A1 (en) | Method and apparatus for generating patent summary information, and electronic device and medium | |
Kessler et al. | Extraction of terminology in the field of construction | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN102135957A (en) | Clause translating method and device | |
CN104572111B (en) | A kind of program comprehension and characteristic positioning method based on related subject model | |
CN105005620B (en) | Finite data source data acquisition methods based on query expansion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120829 |