CN103823862A

CN103823862A - Cross-linguistic electronic text plagiarism detection system and detection method

Info

Publication number: CN103823862A
Application number: CN201410062327.1A
Authority: CN
Inventors: 鲍军鹏; 张昭
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2014-02-24
Filing date: 2014-02-24
Publication date: 2014-05-28
Anticipated expiration: 2034-02-24
Also published as: CN103823862B

Abstract

The invention discloses a cross-language electronic text plagiarism detection system and a detection method thereof. , find the concepts corresponding to the words in the test paragraph set and the reference paragraph set, and express the test paragraph set and the reference paragraph set as the test multiple concept sequence and the reference multiple concept sequence according to the found concepts; according to the test multiple concept Sequence, retrieve the reference multiple concept sequence with the most common concepts with the multiple concept sequence to be tested; detect the multiple concept sequence, generate a plagiarism evidence list; merge and organize the plagiarism evidence list to generate detection results; output and display the detection results. The multi-concept sequence established in the present invention can fully retrieve the electronic text to be tested and the reference electronic text, thereby improving the detection accuracy.

Description

A kind of e-text plagiarism detection system and detection method thereof across language

Technical field

The invention belongs to Intelligent Information Processing and field of computer technology, relate in particular to a kind of e-text plagiarism detection system and detection method thereof across language.

Background technology

Along with the fast development of infotech, on internet, have magnanimity e-text, and its quantity is also increasing always.Protection e-text intellecture property has become the common recognition of domestic and international all circles.Text copy detection, claims again text plagiarism detection, is to judge whether text copies the technology of other one or more texts, for protection e-text intellecture property provides technical support.Along with international day by day deep, copying of text is not confined to single language, copies very general across the text of Language Translation type yet.Therefore, have great significance for the intellecture property of protection e-text across language text copy detection.

In across language text copy detection, text to be measured and referenced text are used respectively different language.Single language text copy detection is mainly based on string matching and statistics.But in across language text copy detection, the character string of different language exists very big difference, the method based on string matching will be helpless.In addition, different language also differs widely on grammer, and the order of for example Chinese and English word in the time of translation may change.So, be a very difficult problem across language text copy detection.

Solving is machine translation method across a kind of approach of language text copy detection problem.First by mechanical translation, different language text translation is become to same language text.Then utilize single language text copy detection algorithm to detect.But the problem of this method is that mechanical translation quality can produce critical impact to testing result.Mechanical translation is also very poor to the translation accuracy of large section of word at present.Mechanical translation quality has been compared huge spread with human translation quality.So, although mechanical translation is same language text by different language text-converted, there will be some wrong translations, synonym to replace and reversed order.These errors all affect follow-up text copy detection quality to a great extent.

Summary of the invention

For above-mentioned defect or deficiency, the object of the present invention is to provide a kind of e-text plagiarism detection method across language, can be for copying and detect across the text of language.

For reaching above object, technical scheme of the present invention is:

Across an e-text plagiarism detection method for language, comprise the following steps:

Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;

Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;

Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;

Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;

Step 5, merges, arranges plagiarizing evidence list, generates testing result;

Step 6, output and demonstration testing result.

Described step 2 specifically comprises the following steps:

1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;

2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;

3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;

4) repeat above step 2)～step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.

In described step 4, detect specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:

1) creating candidate plagiarizes evidence list and plagiarizes evidence list;

2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;

3) preset as anterior diastema variable G and set to 0;

4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;

5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;

6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;

7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;

8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;

9) repeat above-mentioned steps 4)～step 8), until all handle all positions in multiple concept sequence to be measured;

10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.

The described plagiarization evidence that meets density requirements comprises:

1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;

2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;

3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.

The process of described generation testing result is carried out according to the following steps:

(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;

(2) be mapped to the position in text character stream with reference to multiple concept sequence location;

(3) calculate the similarity of text to be measured and referenced text;

On each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:

MCS=<Carray1,Carray2,…,Carrayn>

Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.

The described base unit across language ontology is concept, definite implication of representation of concept, semanteme, the meaning.

Described is the natural language text of different language to e-text to be measured with reference to e-text.

Across an e-text plagiarism detection system for language, comprising:

E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;

Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;

Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;

Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;

Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.Compared with the prior art, beneficial effect of the present invention is:

Utilize across language ontology the modeling on concept hierarchy of different language text, can by e-text to be measured with reference to e-text, it carries out unified representation on concept hierarchy.Because concept just represents definite semanteme, the meaning, therefore, the word with synonymy can be mapped to identical conceptive, so just solve to a certain extent synonym replacement problem, then, by detection algorithm, on conceptual model basis, carried out across language text copy detection, further, in the present invention, set up to obtain multiple concept sequence, can and retrieve fully with reference to e-text e-text to be measured, and then improved the accuracy rate detecting.

Accompanying drawing explanation

Fig. 1 is the general module figure of the method for the invention;

Fig. 2 is multiple concept sequential structure schematic diagram of the present invention;

Fig. 3 is multiple concept sequence construct process flow diagram of the present invention;

Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail.

The invention provides a kind of e-text plagiarism detection method across language, comprise the following steps:

Concrete, comprise inputting e-text to be measured and being converted to unified coded format with reference to e-text, as UTF-8 form, detected e-text is as the natural language text of Chinese, English, French, German, Russian, Japanese, Spanish or other Languages, rather than the information such as audio frequency, video, picture.Text to be measured and referenced text are the natural language texts of different language, rather than monolingual natural language text.

Use and provide background knowledge across language ontology.Concept across the base unit of language ontology, definite implication of representation of concept, semanteme, the meaning.Have bilingual at least across language ontology, the word of different language can be mapped to unified conceptive.

Text fragment is expressed as to multiple concept sequence.On each position of multiple concept sequence, can there be one or more concepts, rather than on each position, can only have a concept.Multiple concept sequence can be regarded the sequence of concept array as, and it is defined as:

MCS=<Carray ₁,Carray ₂,…,Carray _n>

Step 2 specifically comprises the following steps:

4) repeat above step 2)～step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence

1) retrieval obtains reference multiple concept sequence and multiple concept sequence to be measured have enough common concept;

2) in multiple concept sequence to be measured, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in reference to multiple concept sequence;

3), in reference to multiple concept sequence, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in multiple concept sequence to be measured.

Detecting multiple concept sequence specifically comprises the following steps:

1) creating candidate plagiarizes evidence list and plagiarizes evidence list;

3) preset as anterior diastema variable G and set to 0;

8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list; Wherein, the described plagiarization evidence that meets density requirements has following characteristics:

(1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;

(2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;

(3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.

10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than certain threshold value.

Step 5, merges, arranges plagiarizing evidence list, generates testing result;

The process that generates testing result is carried out according to the following steps:

(3) calculate the similarity of text to be measured and referenced text;

Step 6, output and demonstration testing result.

The present invention also provides a kind of e-text plagiarism detection system across language, comprising:

E-text pretreatment module 10, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;

Generalities module 20, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;

Retrieval module 30, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;

Testing result generation module 40, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;

Testing result display module 50, for merging, arrange plagiarizing evidence list, generates testing result.

It is below the preferred embodiment that inventor provides.

With reference to Fig. 1, Fig. 1 is the general module figure of the method for the invention.The method at least comprises e-text pretreatment module 10, generalities module 20, retrieval module 30, testing result generation module 40 and testing result display module 50.E-text pretreatment module 10 is connected with generalities module 20, and generalities module 20 is connected with retrieval module 30, and retrieval module 30 is connected with testing result generation module 40, and testing result generation module 40 is connected with testing result display module 50.

In text pretreatment module 10, e-text is converted to unified coded format.Then e-text is carried out to paragraph division, obtain paragraph collection to be measured and with reference to paragraph collection.

In generalities module 20, utilize across language ontology, the text fragment of different language is expressed as to multiple concept sequence.

In retrieval module 30, the multiple concept sequence to be measured detecting for needs, some with reference to multiple concept sequence from obtaining with reference to retrieval multiple concept sequence sets, these have enough common concept with reference to multiple concept sequence and multiple concept sequence to be measured.

In testing result generation module 40, in multiple concept sequence basis, carry out copy detection, obtain testing result.

Finally, by testing result display module 50, testing result is shown to user.

With reference to Fig. 2, Fig. 2 is the structural representation of multiple concept sequence of the present invention.Multiple concept sequence can be regarded the sequence of a concept array as, and concept array comprises a concept that word is corresponding, and each concept array has a position in this sequence.For example,, for multiple concept sequence MCS=<a ₁, a ₂..., a _n>, concept array a ₁be in the 1st position, concept array a ₂be in the 2nd position, by that analogy.On each position of multiple concept sequence, can there be multiple concepts, rather than on each position, can only have a concept.

With reference to Fig. 3, Fig. 3 is multiple concept sequence construct process flow diagram of the present invention.

First carry out step 301, a text fragment p is read in computing machine.Then carry out step 302, text fragment p is carried out to participle and stop words filtration.Then carry out step 303, set up a sequence of terms, the word in text fragment p is joined in sequence of terms.Then carry out step 304, from sequence of terms, take out a word w.Then carry out step 305, set up candidate's concept array a, in across language ontology, search the concept that word w is corresponding, these concepts are joined in candidate's concept array a.In step 306, judge the concept of whether only having a kind of part of speech in candidate's concept array a.If so, go to step 307, otherwise carry out step 309.In step 307, in candidate's concept array a, choose N concept at the most.Then carry out step 308, the concept of N at the most of choosing is joined in multiple concept sequence.In step 309, for every kind of part of speech of M kind part of speech in candidate's concept array a, choose respectively N concept at the most.Then carry out step 310, the concept of M × N at the most of choosing is joined in multiple concept sequence.In step 311, judge that in sequence of terms, whether all words are all processed.If so, the multiple concept sequence construct process of text fragment p is finished.Otherwise go to step 304, continue above-mentioned circulation, until handle all words in sequence of terms.

With reference to Fig. 4, Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.

First carry out step 401, create plagiarization evidence list list and candidate and plagiarize evidence list list2.Then carry out step 402, to setting up location index with reference to multiple concept sequence mcs2.Location index adopts Hash table structure.The key word of Hash table is concept, and the value of Hash table is to deposit the location sets of concept all positions in reference to multiple concept sequence.Then carry out step 403, take out all concepts of a position sLoc of multiple concept sequence mcs1 to be measured.Then carry out step 404, search sLoc in the position with reference in multiple concept sequence mcs2, and be stored in array rLocArray.Then carry out step 405, the position rLoc in sLoc and rLocArray is formed to position to (sLoc, rLoc).Then carry out step 406, by position, (sLoc, rLoc) upgraded to candidate and plagiarize evidence list list2.In step 407, judge whether that need to plagiarize evidence list list2 to candidate checks.If so, go to step 408, otherwise carry out step 409.In step 408, the evidence that candidate is plagiarized in evidence list list2 checks, satisfactory evidence is joined and plagiarized in evidence list list, then empties candidate and plagiarizes evidence list list2.In step 409, judge that whether all positions of multiple concept sequence mcs1 to be measured are all processed.If so, go to step 410.Otherwise go to step 403, continue above-mentioned circulation, until handle all positions of multiple concept sequence mcs1 to be measured.In step 410, carry out union operation to plagiarizing evidence list list, and remove the evidence that length is less than certain threshold value.

Of the present invention across language e-text plagiarism detection method, its basic ideas are: first, by across language ontology, different language text is set up respectively to multiple concept sequence.Multiple concept sequence represents text on concept hierarchy, thereby has solved the problem there are differences in different language character string aspect.In addition, due to semanteme, implication, the meaning of representation of concept word, the word with synonymy can be mapped in same concept, has solved to a certain extent recurrent synonym and has replaced phenomenon.Then multiple concept sequence is carried out to copy detection.Utilize Hash table to set up the location index with reference to multiple concept sequence, then judge successively the position in multiple concept sequence to be measured and have common concept with reference to which position in multiple concept sequence.Position in multiple concept sequence to be measured and formed position pair with reference to the position in multiple concept sequence, by position to setting up and safeguard the list of candidate's evidence.In the time utilizing position to renewal candidate evidence, and do not require that the position of insertion is orderly to front and back, but can have between certain extension area on the border of former evidence.So just solve to a certain extent the word order inconsistence problems that copies middle existence across Language Translation type.By candidate's evidence list inspection operation, the evidence that does not meet density requirements is filtered, suitable evidence is joined in evidence list.Finally, multiple evidence lists of same document to be measured are merged, arranged, obtain testing result.Testing result comprises concrete plagiarization evidence and text similarity.

Claims

1. A cross-language electronic text plagiarism detection method, is characterized in that, comprises the following steps:

Step 1, the electronic text to be tested and the reference electronic text are divided into paragraphs respectively, and the paragraph set to be tested and the reference paragraph set are obtained;

Step 2: According to the cross-language ontology, find the concepts corresponding to the words in the test paragraph set and the reference paragraph set, and express the test paragraph set and the reference paragraph set as the test multiple concept sequence and the reference multiple concept according to the found concepts sequence;

Step 3, according to the multiple concept sequence to be tested, retrieve the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested;

Step 4: Detect the found reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested, and generate a plagiarism evidence list;

Step five, merging and sorting out the list of plagiarism evidences to generate detection results;

Step six, output and display the detection result.

2. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, described step 2 specifically comprises the following steps:

1) Word segmentation and stop word filtering are performed on the test paragraph set and the reference paragraph set to obtain the word sequence of the test paragraph and the reference paragraph word sequence respectively;

2) Use the cross-language ontology to find the concepts corresponding to the words in each word sequence, and add all the concepts of the words to the candidate concept array;

3) If there is only one part-of-speech concept in the candidate concept array of the word, select at most N concepts in the candidate concept array and store them in the multiple concept sequence; if there are M kinds of part-of-speech concepts in the candidate concept array of the word, then select Each part of speech selects at most N concepts in the candidate concept array, and stores the at most M×N concepts in the multi-concept sequence;

4) Repeat steps 2) to 3) above until all the words in the word sequence are processed, forming the multiple concept sequence to be tested and the reference multiple concept sequence.

3. The cross-lingual electronic text plagiarism detection method according to claim 1, characterized in that, in the step 4, detecting the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested specifically includes the following step:

1) Create a list of candidate plagiarism evidence and a list of plagiarism evidence;

2) Establish a position index for the reference multiple concept sequence with the most common concepts, and the position index is organized according to the hash table structure, so that the position of the concept in the multiple concept sequence to be tested appears in the reference multiple concept sequence can be found through the position index ;

3) Preset the current gap variable G and set it to 0;

4) Take out the concept array at the position of the multi-concept sequence to be tested, use all the concepts in the concept array to search in the position index, and obtain a position set;

5) If the location set is empty, add 1 to the gap variable G and go to step 8), otherwise set the gap variable G to 0;

6) Combining the concepts of the multi-concept sequence to be tested and the positions in the position set to form a position pair, and for each piece of evidence in the candidate plagiarism evidence list, update the evidence through the position pair;

7) When the distance between the position pair referring to the concept in the multiple concept sequence and all the evidence in the candidate plagiarism evidence list is greater than the preset position threshold, use the position pair to create new evidence, and add the new evidence to the candidate plagiarism evidence list;

8) If the position in the multi-concept sequence to be tested reaches the end of the sentence or the gap variable G is greater than the preset threshold, perform the check operation of the candidate plagiarism evidence list, add the plagiarism evidence that meets the density requirement to the plagiarism evidence list, and then add the gap variable G to the list of plagiarism evidence Set G to 0 and clear the list of candidate plagiarism evidence;

9) Repeat steps 4) to 8) above until all the positions in the multi-concept sequence to be tested are processed;

10) Merge the evidence in the plagiarism evidence list, and then remove the evidence whose length is less than the preset position threshold.

4. the cross-lingual electronic text plagiarism detection method according to claim 3, is characterized in that, described plagiarism evidence that meets density requirement comprises:

1) Plagiarism evidence includes multiple concept sequence fragments to be tested and reference multiple concept sequence fragments;

2) Let Ls be the total number of positions of multiple concept sequence fragments to be tested, Ns be the number of detected positions, and Ns/Ls should not be less than the density threshold T;

3) Let Lr be the total position number of reference multiple concept sequence fragments, Nr be the number of detected positions, and Nr/Lr should not be less than the density threshold T.

5. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, the process of described generation test result is carried out by the following steps:

(1) According to the position of the multiple concept sequences to be tested, the plagiarism evidence of the same document to be tested is merged;

(2) Map reference multiple concept sequence positions to positions in the text character stream;

(3) Calculate the similarity between the test text and the reference text.

6. cross-language electronic text plagiarism detection method according to claim 1, is characterized in that, there is one or more concepts on each position of described multiple concept sequence, multiple concept sequence is defined as:

MCS=<Carray1,Carray2,...,Carrayn>

Among them, MCS is a sequence of multiple concepts, Carrayn is the nth concept array, and at the nth position of MCS, n is a positive integer.

7. The cross-language electronic text plagiarism detection method according to claim 1, wherein the basic unit of the cross-language ontology is a concept, and a concept represents a certain meaning, semantics or meaning.

8. The cross-language electronic text plagiarism detection method according to claim 1, wherein the electronic text to be tested and the reference electronic text are natural language texts in different languages.

9. A cross-language electronic text plagiarism detection system, characterized in that it comprises:

The electronic text preprocessing module is used to convert the input electronic text into a unified coding format, and divides the electronic text to be tested and the reference electronic text into paragraphs respectively, so as to obtain the paragraph set to be tested and the reference paragraph set;

The conceptualization module is used to find the concepts corresponding to the words in the test paragraph set and the reference paragraph set according to the cross-language ontology, and express the test paragraph set and the reference paragraph set as the multiple concept sequences to be tested and reference according to the found concepts. multiple concept sequences;

The retrieval module is used to retrieve the reference multiple concept sequence having the most common concepts with the multiple concept sequence to be tested according to the multiple concept sequence to be tested;

The detection result generation module is used to detect the reference multiple concept sequence that has the most common concepts with the multiple concept sequence to be tested, and generate a plagiarism evidence list;

The detection result display module is used for merging and sorting the plagiarism evidence list and generating the detection result.