CN103823862A - Cross-linguistic electronic text plagiarism detection system and detection method - Google Patents

Cross-linguistic electronic text plagiarism detection system and detection method Download PDF

Info

Publication number
CN103823862A
CN103823862A CN201410062327.1A CN201410062327A CN103823862A CN 103823862 A CN103823862 A CN 103823862A CN 201410062327 A CN201410062327 A CN 201410062327A CN 103823862 A CN103823862 A CN 103823862A
Authority
CN
China
Prior art keywords
sequence
concept
evidence
multiple concept
plagiarism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410062327.1A
Other languages
Chinese (zh)
Other versions
CN103823862B (en
Inventor
鲍军鹏
张昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201410062327.1A priority Critical patent/CN103823862B/en
Publication of CN103823862A publication Critical patent/CN103823862A/en
Application granted granted Critical
Publication of CN103823862B publication Critical patent/CN103823862B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种跨语言的电子文本剽窃检测系统及其检测方法,包括以下步骤:分别对待测电子文本和参考电子文本进行段落划分,得到待测段落集和参考段落集;根据跨语言本体,查找待测段落集和参考段落集中词语对应的概念,并根据所查找到的概念,将待测段落集和参考段落集表示为待测多重概念序列和参考多重概念序列;据待测多重概念序列,检索得到与待测多重概念序列共同概念最多的参考多重概念序列;检测多重概念序列,生成剽窃证据列表;对剽窃证据列表进行合并、整理,生成检测结果;输出和显示检测结果。本发明中所建立得多重概念序列,能够将待测电子文本和参考电子文本进行充分的检索,进而提高了检测的准确率。

The invention discloses a cross-language electronic text plagiarism detection system and a detection method thereof. , find the concepts corresponding to the words in the test paragraph set and the reference paragraph set, and express the test paragraph set and the reference paragraph set as the test multiple concept sequence and the reference multiple concept sequence according to the found concepts; according to the test multiple concept Sequence, retrieve the reference multiple concept sequence with the most common concepts with the multiple concept sequence to be tested; detect the multiple concept sequence, generate a plagiarism evidence list; merge and organize the plagiarism evidence list to generate detection results; output and display the detection results. The multi-concept sequence established in the present invention can fully retrieve the electronic text to be tested and the reference electronic text, thereby improving the detection accuracy.

Description

A kind of e-text plagiarism detection system and detection method thereof across language
Technical field
The invention belongs to Intelligent Information Processing and field of computer technology, relate in particular to a kind of e-text plagiarism detection system and detection method thereof across language.
Background technology
Along with the fast development of infotech, on internet, have magnanimity e-text, and its quantity is also increasing always.Protection e-text intellecture property has become the common recognition of domestic and international all circles.Text copy detection, claims again text plagiarism detection, is to judge whether text copies the technology of other one or more texts, for protection e-text intellecture property provides technical support.Along with international day by day deep, copying of text is not confined to single language, copies very general across the text of Language Translation type yet.Therefore, have great significance for the intellecture property of protection e-text across language text copy detection.
In across language text copy detection, text to be measured and referenced text are used respectively different language.Single language text copy detection is mainly based on string matching and statistics.But in across language text copy detection, the character string of different language exists very big difference, the method based on string matching will be helpless.In addition, different language also differs widely on grammer, and the order of for example Chinese and English word in the time of translation may change.So, be a very difficult problem across language text copy detection.
Solving is machine translation method across a kind of approach of language text copy detection problem.First by mechanical translation, different language text translation is become to same language text.Then utilize single language text copy detection algorithm to detect.But the problem of this method is that mechanical translation quality can produce critical impact to testing result.Mechanical translation is also very poor to the translation accuracy of large section of word at present.Mechanical translation quality has been compared huge spread with human translation quality.So, although mechanical translation is same language text by different language text-converted, there will be some wrong translations, synonym to replace and reversed order.These errors all affect follow-up text copy detection quality to a great extent.
Summary of the invention
For above-mentioned defect or deficiency, the object of the present invention is to provide a kind of e-text plagiarism detection method across language, can be for copying and detect across the text of language.
For reaching above object, technical scheme of the present invention is:
Across an e-text plagiarism detection method for language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
Described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
In described step 4, detect specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
The described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
The process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
On each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
The described base unit across language ontology is concept, definite implication of representation of concept, semanteme, the meaning.
Described is the natural language text of different language to e-text to be measured with reference to e-text.
Across an e-text plagiarism detection system for language, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.Compared with the prior art, beneficial effect of the present invention is:
Utilize across language ontology the modeling on concept hierarchy of different language text, can by e-text to be measured with reference to e-text, it carries out unified representation on concept hierarchy.Because concept just represents definite semanteme, the meaning, therefore, the word with synonymy can be mapped to identical conceptive, so just solve to a certain extent synonym replacement problem, then, by detection algorithm, on conceptual model basis, carried out across language text copy detection, further, in the present invention, set up to obtain multiple concept sequence, can and retrieve fully with reference to e-text e-text to be measured, and then improved the accuracy rate detecting.
Accompanying drawing explanation
Fig. 1 is the general module figure of the method for the invention;
Fig. 2 is multiple concept sequential structure schematic diagram of the present invention;
Fig. 3 is multiple concept sequence construct process flow diagram of the present invention;
Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
The invention provides a kind of e-text plagiarism detection method across language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Concrete, comprise inputting e-text to be measured and being converted to unified coded format with reference to e-text, as UTF-8 form, detected e-text is as the natural language text of Chinese, English, French, German, Russian, Japanese, Spanish or other Languages, rather than the information such as audio frequency, video, picture.Text to be measured and referenced text are the natural language texts of different language, rather than monolingual natural language text.
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Use and provide background knowledge across language ontology.Concept across the base unit of language ontology, definite implication of representation of concept, semanteme, the meaning.Have bilingual at least across language ontology, the word of different language can be mapped to unified conceptive.
Text fragment is expressed as to multiple concept sequence.On each position of multiple concept sequence, can there be one or more concepts, rather than on each position, can only have a concept.Multiple concept sequence can be regarded the sequence of concept array as, and it is defined as:
MCS=<Carray 1,Carray 2,…,Carray n>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
Step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
1) retrieval obtains reference multiple concept sequence and multiple concept sequence to be measured have enough common concept;
2) in multiple concept sequence to be measured, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in reference to multiple concept sequence;
3), in reference to multiple concept sequence, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in multiple concept sequence to be measured.
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Detecting multiple concept sequence specifically comprises the following steps:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list; Wherein, the described plagiarization evidence that meets density requirements has following characteristics:
(1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
(2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
(3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than certain threshold value.
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
The process that generates testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
Step 6, output and demonstration testing result.
The present invention also provides a kind of e-text plagiarism detection system across language, comprising:
E-text pretreatment module 10, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module 20, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module 30, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module 40, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module 50, for merging, arrange plagiarizing evidence list, generates testing result.
It is below the preferred embodiment that inventor provides.
With reference to Fig. 1, Fig. 1 is the general module figure of the method for the invention.The method at least comprises e-text pretreatment module 10, generalities module 20, retrieval module 30, testing result generation module 40 and testing result display module 50.E-text pretreatment module 10 is connected with generalities module 20, and generalities module 20 is connected with retrieval module 30, and retrieval module 30 is connected with testing result generation module 40, and testing result generation module 40 is connected with testing result display module 50.
In text pretreatment module 10, e-text is converted to unified coded format.Then e-text is carried out to paragraph division, obtain paragraph collection to be measured and with reference to paragraph collection.
In generalities module 20, utilize across language ontology, the text fragment of different language is expressed as to multiple concept sequence.
In retrieval module 30, the multiple concept sequence to be measured detecting for needs, some with reference to multiple concept sequence from obtaining with reference to retrieval multiple concept sequence sets, these have enough common concept with reference to multiple concept sequence and multiple concept sequence to be measured.
In testing result generation module 40, in multiple concept sequence basis, carry out copy detection, obtain testing result.
Finally, by testing result display module 50, testing result is shown to user.
With reference to Fig. 2, Fig. 2 is the structural representation of multiple concept sequence of the present invention.Multiple concept sequence can be regarded the sequence of a concept array as, and concept array comprises a concept that word is corresponding, and each concept array has a position in this sequence.For example,, for multiple concept sequence MCS=<a 1, a 2..., a n>, concept array a 1be in the 1st position, concept array a 2be in the 2nd position, by that analogy.On each position of multiple concept sequence, can there be multiple concepts, rather than on each position, can only have a concept.
With reference to Fig. 3, Fig. 3 is multiple concept sequence construct process flow diagram of the present invention.
First carry out step 301, a text fragment p is read in computing machine.Then carry out step 302, text fragment p is carried out to participle and stop words filtration.Then carry out step 303, set up a sequence of terms, the word in text fragment p is joined in sequence of terms.Then carry out step 304, from sequence of terms, take out a word w.Then carry out step 305, set up candidate's concept array a, in across language ontology, search the concept that word w is corresponding, these concepts are joined in candidate's concept array a.In step 306, judge the concept of whether only having a kind of part of speech in candidate's concept array a.If so, go to step 307, otherwise carry out step 309.In step 307, in candidate's concept array a, choose N concept at the most.Then carry out step 308, the concept of N at the most of choosing is joined in multiple concept sequence.In step 309, for every kind of part of speech of M kind part of speech in candidate's concept array a, choose respectively N concept at the most.Then carry out step 310, the concept of M × N at the most of choosing is joined in multiple concept sequence.In step 311, judge that in sequence of terms, whether all words are all processed.If so, the multiple concept sequence construct process of text fragment p is finished.Otherwise go to step 304, continue above-mentioned circulation, until handle all words in sequence of terms.
With reference to Fig. 4, Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
First carry out step 401, create plagiarization evidence list list and candidate and plagiarize evidence list list2.Then carry out step 402, to setting up location index with reference to multiple concept sequence mcs2.Location index adopts Hash table structure.The key word of Hash table is concept, and the value of Hash table is to deposit the location sets of concept all positions in reference to multiple concept sequence.Then carry out step 403, take out all concepts of a position sLoc of multiple concept sequence mcs1 to be measured.Then carry out step 404, search sLoc in the position with reference in multiple concept sequence mcs2, and be stored in array rLocArray.Then carry out step 405, the position rLoc in sLoc and rLocArray is formed to position to (sLoc, rLoc).Then carry out step 406, by position, (sLoc, rLoc) upgraded to candidate and plagiarize evidence list list2.In step 407, judge whether that need to plagiarize evidence list list2 to candidate checks.If so, go to step 408, otherwise carry out step 409.In step 408, the evidence that candidate is plagiarized in evidence list list2 checks, satisfactory evidence is joined and plagiarized in evidence list list, then empties candidate and plagiarizes evidence list list2.In step 409, judge that whether all positions of multiple concept sequence mcs1 to be measured are all processed.If so, go to step 410.Otherwise go to step 403, continue above-mentioned circulation, until handle all positions of multiple concept sequence mcs1 to be measured.In step 410, carry out union operation to plagiarizing evidence list list, and remove the evidence that length is less than certain threshold value.
Of the present invention across language e-text plagiarism detection method, its basic ideas are: first, by across language ontology, different language text is set up respectively to multiple concept sequence.Multiple concept sequence represents text on concept hierarchy, thereby has solved the problem there are differences in different language character string aspect.In addition, due to semanteme, implication, the meaning of representation of concept word, the word with synonymy can be mapped in same concept, has solved to a certain extent recurrent synonym and has replaced phenomenon.Then multiple concept sequence is carried out to copy detection.Utilize Hash table to set up the location index with reference to multiple concept sequence, then judge successively the position in multiple concept sequence to be measured and have common concept with reference to which position in multiple concept sequence.Position in multiple concept sequence to be measured and formed position pair with reference to the position in multiple concept sequence, by position to setting up and safeguard the list of candidate's evidence.In the time utilizing position to renewal candidate evidence, and do not require that the position of insertion is orderly to front and back, but can have between certain extension area on the border of former evidence.So just solve to a certain extent the word order inconsistence problems that copies middle existence across Language Translation type.By candidate's evidence list inspection operation, the evidence that does not meet density requirements is filtered, suitable evidence is joined in evidence list.Finally, multiple evidence lists of same document to be measured are merged, arranged, obtain testing result.Testing result comprises concrete plagiarization evidence and text similarity.

Claims (9)

1.一种跨语言的电子文本剽窃检测方法,其特征在于,包括以下步骤:1. A cross-language electronic text plagiarism detection method, is characterized in that, comprises the following steps: 步骤一,分别对待测电子文本和参考电子文本进行段落划分,得到待测段落集和参考段落集;Step 1, the electronic text to be tested and the reference electronic text are divided into paragraphs respectively, and the paragraph set to be tested and the reference paragraph set are obtained; 步骤二,根据跨语言本体,查找待测段落集和参考段落集中词语对应的概念,并根据所查找到的概念,将待测段落集和参考段落集表示为待测多重概念序列和参考多重概念序列;Step 2: According to the cross-language ontology, find the concepts corresponding to the words in the test paragraph set and the reference paragraph set, and express the test paragraph set and the reference paragraph set as the test multiple concept sequence and the reference multiple concept according to the found concepts sequence; 步骤三,根据待测多重概念序列,检索得到与待测多重概念序列共同概念最多的参考多重概念序列;Step 3, according to the multiple concept sequence to be tested, retrieve the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested; 步骤四,检测所查找到的与待测多重概念序列共同概念最多的参考多重概念序列,生成剽窃证据列表;Step 4: Detect the found reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested, and generate a plagiarism evidence list; 步骤五,对剽窃证据列表进行合并、整理,生成检测结果;Step five, merging and sorting out the list of plagiarism evidences to generate detection results; 步骤六,输出和显示检测结果。Step six, output and display the detection result. 2.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述步骤二具体包括以下步骤:2. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, described step 2 specifically comprises the following steps: 1)对待测段落集和参考段落集进行分词和停用词过滤,分别得到待测段落词语序列和参考段落词语序列;1) Word segmentation and stop word filtering are performed on the test paragraph set and the reference paragraph set to obtain the word sequence of the test paragraph and the reference paragraph word sequence respectively; 2)利用跨语言本体查找每个词语序列中词语对应的概念,将词语的所有概念加入到候选概念数组中;2) Use the cross-language ontology to find the concepts corresponding to the words in each word sequence, and add all the concepts of the words to the candidate concept array; 3)如果词语的候选概念数组中只有一种词性的概念,则在候选概念数组中选取至多N个概念存放到多重概念序列中;如果词语的候选概念数组中有M种词性的概念,则对每种词性分别在候选概念数组中选取至多N个概念,将这至多M×N个概念存放到多重概念序列中;3) If there is only one part-of-speech concept in the candidate concept array of the word, select at most N concepts in the candidate concept array and store them in the multiple concept sequence; if there are M kinds of part-of-speech concepts in the candidate concept array of the word, then select Each part of speech selects at most N concepts in the candidate concept array, and stores the at most M×N concepts in the multi-concept sequence; 4)重复以上步骤2)~步骤3),直到词语序列中的所有词语处理完,形成待测多重概念序列和参考多重概念序列。4) Repeat steps 2) to 3) above until all the words in the word sequence are processed, forming the multiple concept sequence to be tested and the reference multiple concept sequence. 3.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述步骤四中,检测所查找到的与待测多重概念序列共同概念最多的参考多重概念序列具体包括以下步骤:3. The cross-lingual electronic text plagiarism detection method according to claim 1, characterized in that, in the step 4, detecting the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested specifically includes the following step: 1)创建候选剽窃证据列表和剽窃证据列表;1) Create a list of candidate plagiarism evidence and a list of plagiarism evidence; 2)对共同概念最多的参考多重概念序列建立位置索引,所述位置索引按照哈希表结构进行组织,以使得通过位置索引查找待测多重概念序列中的概念在参考多重概念序列中出现的位置;2) Establish a position index for the reference multiple concept sequence with the most common concepts, and the position index is organized according to the hash table structure, so that the position of the concept in the multiple concept sequence to be tested appears in the reference multiple concept sequence can be found through the position index ; 3)预设当前间隙变量G并置0;3) Preset the current gap variable G and set it to 0; 4)取出待测多重概念序列的位置上的概念数组,用概念数组中所有概念在位置索引中查找,得到一个位置集合;4) Take out the concept array at the position of the multi-concept sequence to be tested, use all the concepts in the concept array to search in the position index, and obtain a position set; 5)如果位置集合为空,将间隙变量G加1,转步骤8),否则将间隙变量G置0;5) If the location set is empty, add 1 to the gap variable G and go to step 8), otherwise set the gap variable G to 0; 6)将待测多重概念序列的概念和位置集合中的位置组成位置对,对候选剽窃证据列表中的每一条证据,通过位置对更新证据;6) Combining the concepts of the multi-concept sequence to be tested and the positions in the position set to form a position pair, and for each piece of evidence in the candidate plagiarism evidence list, update the evidence through the position pair; 7)当参考多重概念序列中的概念的位置对和候选剽窃证据列表中的所有证据距离大于预设位置阈值,则利用该位置对创建新证据,将新证据加入到候选剽窃证据列表中;7) When the distance between the position pair referring to the concept in the multiple concept sequence and all the evidence in the candidate plagiarism evidence list is greater than the preset position threshold, use the position pair to create new evidence, and add the new evidence to the candidate plagiarism evidence list; 8)如果待测多重概念序列中的位置到达句子末尾或间隙变量G大于预设阈值,则执行候选剽窃证据列表检查操作,将满足密度要求的剽窃证据加入到剽窃证据列表中,然后将间隙变量G置0并清空候选剽窃证据列表;8) If the position in the multi-concept sequence to be tested reaches the end of the sentence or the gap variable G is greater than the preset threshold, perform the check operation of the candidate plagiarism evidence list, add the plagiarism evidence that meets the density requirement to the plagiarism evidence list, and then add the gap variable G to the list of plagiarism evidence Set G to 0 and clear the list of candidate plagiarism evidence; 9)重复上述步骤4)~步骤8),直到待测多重概念序列中的所有位置都处理完;9) Repeat steps 4) to 8) above until all the positions in the multi-concept sequence to be tested are processed; 10)对剽窃证据列表中的证据进行合并,然后去掉长度小于预设位置阈值的证据。10) Merge the evidence in the plagiarism evidence list, and then remove the evidence whose length is less than the preset position threshold. 4.根据权利要求3所述的跨语言的电子文本剽窃检测方法,其特征在于,所述满足密度要求的剽窃证据包括:4. the cross-lingual electronic text plagiarism detection method according to claim 3, is characterized in that, described plagiarism evidence that meets density requirement comprises: 1)剽窃证据包括待测多重概念序列片段和参考多重概念序列片段;1) Plagiarism evidence includes multiple concept sequence fragments to be tested and reference multiple concept sequence fragments; 2)设待测多重概念序列片段总位置数为Ls,检测出的位置数为Ns,Ns/Ls不小于密度阈值T;2) Let Ls be the total number of positions of multiple concept sequence fragments to be tested, Ns be the number of detected positions, and Ns/Ls should not be less than the density threshold T; 3)设参考多重概念序列片段总位置数为Lr,检测出的位置数为Nr,Nr/Lr不小于密度阈值T。3) Let Lr be the total position number of reference multiple concept sequence fragments, Nr be the number of detected positions, and Nr/Lr should not be less than the density threshold T. 5.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述生成检测结果的过程按以下步骤进行:5. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, the process of described generation test result is carried out by the following steps: (1)根据待测多重概念序列的位置,对同一个待测文档的剽窃证据进行合并;(1) According to the position of the multiple concept sequences to be tested, the plagiarism evidence of the same document to be tested is merged; (2)将参考多重概念序列位置映射到文本字符流中的位置;(2) Map reference multiple concept sequence positions to positions in the text character stream; (3)计算待测文本和参考文本的相似度。(3) Calculate the similarity between the test text and the reference text. 6.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述多重概念序列的每一个位置上有一个或多个概念,多重概念序列定义为:6. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, there is one or more concepts on each position of described multiple concept sequence, multiple concept sequence is defined as: MCS=<Carray1,Carray2,…,Carrayn>MCS=<Carray1,Carray2,...,Carrayn> 其中,MCS是多重概念序列,Carrayn是第n个概念数组,在MCS的第n个位置上,n为正整数。Among them, MCS is a sequence of multiple concepts, Carrayn is the nth concept array, and at the nth position of MCS, n is a positive integer. 7.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述跨语言本体的基本单位是概念,概念表示一个确定的含义、语义或意思。7. The cross-language electronic text plagiarism detection method according to claim 1, wherein the basic unit of the cross-language ontology is a concept, and a concept represents a certain meaning, semantics or meaning. 8.根据权利要求1所述的跨语言的电子文本剽窃检测方法,其特征在于,所述对待测电子文本和参考电子文本是不同语言的自然语言文本。8. The cross-language electronic text plagiarism detection method according to claim 1, wherein the electronic text to be tested and the reference electronic text are natural language texts in different languages. 9.一种跨语言的电子文本剽窃检测系统,其特征在于,包括:9. A cross-language electronic text plagiarism detection system, characterized in that it comprises: 电子文本预处理模块,用于将输入的电子文本转换为统一的编码格式,分别对待测电子文本和参考电子文本进行段落划分,得到待测段落集和参考段落集;The electronic text preprocessing module is used to convert the input electronic text into a unified coding format, and divides the electronic text to be tested and the reference electronic text into paragraphs respectively, so as to obtain the paragraph set to be tested and the reference paragraph set; 概念化模块,用于根据跨语言本体,查找待测段落集和参考段落集中词语对应的概念,并根据所查找到的概念,将待测段落集和参考段落集表示为待测多重概念序列和参考多重概念序列;The conceptualization module is used to find the concepts corresponding to the words in the test paragraph set and the reference paragraph set according to the cross-language ontology, and express the test paragraph set and the reference paragraph set as the multiple concept sequences to be tested and reference according to the found concepts. multiple concept sequences; 检索模块,用于根据待测多重概念序列,检索得到与待测多重概念序列共同概念最多的参考多重概念序列;The retrieval module is used to retrieve the reference multiple concept sequence having the most common concepts with the multiple concept sequence to be tested according to the multiple concept sequence to be tested; 检测结果生成模块,用于检测所查找到的与待测多重概念序列共同概念最多的参考多重概念序列,生成剽窃证据列表;The detection result generation module is used to detect the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested, and generate a plagiarism evidence list; 检测结果显示模块,用于对剽窃证据列表进行合并、整理,生成检测结果。The detection result display module is used for merging and sorting the plagiarism evidence list and generating the detection result.
CN201410062327.1A 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method Expired - Fee Related CN103823862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410062327.1A CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410062327.1A CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Publications (2)

Publication Number Publication Date
CN103823862A true CN103823862A (en) 2014-05-28
CN103823862B CN103823862B (en) 2017-02-15

Family

ID=50758926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410062327.1A Expired - Fee Related CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Country Status (1)

Country Link
CN (1) CN103823862B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN109492228A (en) * 2017-06-28 2019-03-19 三角兽(北京)科技有限公司 Information processing unit and its participle processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
US20100114924A1 (en) * 2008-10-17 2010-05-06 Software Analysis And Forensic Engineering Corporation Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114924A1 (en) * 2008-10-17 2010-05-06 Software Analysis And Forensic Engineering Corporation Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何文垒: "基于WordNet的中英文跨语言文本相似度研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN105224518B (en) * 2014-06-17 2020-03-17 腾讯科技(深圳)有限公司 Text similarity calculation method and system and similar text search method and system
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN104657350B (en) * 2015-03-04 2017-06-09 中国科学院自动化研究所 Merge the short text Hash learning method of latent semantic feature
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN109492228A (en) * 2017-06-28 2019-03-19 三角兽(北京)科技有限公司 Information processing unit and its participle processing method
CN109492228B (en) * 2017-06-28 2020-01-14 三角兽(北京)科技有限公司 Information processing apparatus and word segmentation processing method thereof
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN107862045B (en) * 2017-11-07 2022-01-14 哈尔滨工程大学 Cross-language plagiarism detection method based on multiple features

Also Published As

Publication number Publication date
CN103823862B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
Liu et al. Opinion target extraction using word-based translation model
Derczynski et al. Microblog-genre noise and impact on semantic annotation accuracy
Cook et al. Novel word-sense identification
CN111488466B (en) Chinese language marking error corpus generating method, computing device and storage medium
Smith et al. Evaluating visual representations for topic understanding and their effects on manually generated topic labels
Piperski et al. Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
US20180081861A1 (en) Smart document building using natural language processing
CN101093478A (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
CN100524293C (en) Method and system for obtaining word pair translation from bilingual sentence
CN103823862A (en) Cross-linguistic electronic text plagiarism detection system and detection method
Bosco et al. Detecting happiness in Italian tweets: Towards an evaluation dataset for sentiment analysis in Felicitta
CN103294663B (en) A kind of text coherence detection method and device
CN104166550A (en) Software maintenance oriented method for re-customizing modification request
Antici et al. A corpus for sentence-level subjectivity detection on english news articles
Savary et al. Populating a multilingual ontology of proper names from open sources
Kurmi et al. Text summarization using enhanced MMR technique
Zong et al. Research on alignment in the construction of parallel corpus
Li et al. Interpersonal interface system of multimedia intelligent English translation based on deep learning
Zhang et al. Phrasal paraphrase based question reformulation for archived question retrieval
Luotolahti et al. Finnish Internet Parsebank
Wu et al. Summarizing the differences in Chinese-Vietnamese Bilingual news
Ghosh et al. Improving ir performance from ocred text using cooccurrence
Daneshfar et al. Construction of an annotated corpus for KurdishAbstractive text summarization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170215