CN103823862A - Cross-linguistic electronic text plagiarism detection system and detection method - Google Patents

Cross-linguistic electronic text plagiarism detection system and detection method Download PDF

Info

Publication number
CN103823862A
CN103823862A CN201410062327.1A CN201410062327A CN103823862A CN 103823862 A CN103823862 A CN 103823862A CN 201410062327 A CN201410062327 A CN 201410062327A CN 103823862 A CN103823862 A CN 103823862A
Authority
CN
China
Prior art keywords
concept
measured
sequence
text
evidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410062327.1A
Other languages
Chinese (zh)
Other versions
CN103823862B (en
Inventor
鲍军鹏
张昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201410062327.1A priority Critical patent/CN103823862B/en
Publication of CN103823862A publication Critical patent/CN103823862A/en
Application granted granted Critical
Publication of CN103823862B publication Critical patent/CN103823862B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-linguistic electronic text plagiarism detection system and detection method. The cross-linguistic electronic text plagiarism detection method comprises the steps that paragraph division is carried out on an electronic text to be detected and a reference electronic text respectively to obtain a paragraph set to be detected and a reference paragraph set; concepts corresponding to terms in the paragraph set to be detected and the reference paragraph set are searched for according to a cross-linguistic body, and the paragraph set to be detected and the reference paragraph set are expressed as a multiple-concept sequence to be detected and a reference multiple-concept sequence according to the found concepts; the reference multiple-concept sequence having the most common concepts with the multiple-concept sequence to be detected is obtained through searching according to the multiple-concept sequence to be detected; the multiple-concept sequences are detected to generate a plagiarism evidence list; the plagiarism evidence list is combined and ordered to generate a detection result; the detection result is output and displayed. By means of the cross-linguistic electronic text plagiarism detection system and detection method, the built multiple-concept sequences can sufficiently search the electronic text to be detected and the reference electronic text, and further the detection accuracy is improved.

Description

A kind of e-text plagiarism detection system and detection method thereof across language
Technical field
The invention belongs to Intelligent Information Processing and field of computer technology, relate in particular to a kind of e-text plagiarism detection system and detection method thereof across language.
Background technology
Along with the fast development of infotech, on internet, have magnanimity e-text, and its quantity is also increasing always.Protection e-text intellecture property has become the common recognition of domestic and international all circles.Text copy detection, claims again text plagiarism detection, is to judge whether text copies the technology of other one or more texts, for protection e-text intellecture property provides technical support.Along with international day by day deep, copying of text is not confined to single language, copies very general across the text of Language Translation type yet.Therefore, have great significance for the intellecture property of protection e-text across language text copy detection.
In across language text copy detection, text to be measured and referenced text are used respectively different language.Single language text copy detection is mainly based on string matching and statistics.But in across language text copy detection, the character string of different language exists very big difference, the method based on string matching will be helpless.In addition, different language also differs widely on grammer, and the order of for example Chinese and English word in the time of translation may change.So, be a very difficult problem across language text copy detection.
Solving is machine translation method across a kind of approach of language text copy detection problem.First by mechanical translation, different language text translation is become to same language text.Then utilize single language text copy detection algorithm to detect.But the problem of this method is that mechanical translation quality can produce critical impact to testing result.Mechanical translation is also very poor to the translation accuracy of large section of word at present.Mechanical translation quality has been compared huge spread with human translation quality.So, although mechanical translation is same language text by different language text-converted, there will be some wrong translations, synonym to replace and reversed order.These errors all affect follow-up text copy detection quality to a great extent.
Summary of the invention
For above-mentioned defect or deficiency, the object of the present invention is to provide a kind of e-text plagiarism detection method across language, can be for copying and detect across the text of language.
For reaching above object, technical scheme of the present invention is:
Across an e-text plagiarism detection method for language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
Described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
In described step 4, detect specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
The described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
The process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
On each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
The described base unit across language ontology is concept, definite implication of representation of concept, semanteme, the meaning.
Described is the natural language text of different language to e-text to be measured with reference to e-text.
Across an e-text plagiarism detection system for language, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.Compared with the prior art, beneficial effect of the present invention is:
Utilize across language ontology the modeling on concept hierarchy of different language text, can by e-text to be measured with reference to e-text, it carries out unified representation on concept hierarchy.Because concept just represents definite semanteme, the meaning, therefore, the word with synonymy can be mapped to identical conceptive, so just solve to a certain extent synonym replacement problem, then, by detection algorithm, on conceptual model basis, carried out across language text copy detection, further, in the present invention, set up to obtain multiple concept sequence, can and retrieve fully with reference to e-text e-text to be measured, and then improved the accuracy rate detecting.
Accompanying drawing explanation
Fig. 1 is the general module figure of the method for the invention;
Fig. 2 is multiple concept sequential structure schematic diagram of the present invention;
Fig. 3 is multiple concept sequence construct process flow diagram of the present invention;
Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
The invention provides a kind of e-text plagiarism detection method across language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Concrete, comprise inputting e-text to be measured and being converted to unified coded format with reference to e-text, as UTF-8 form, detected e-text is as the natural language text of Chinese, English, French, German, Russian, Japanese, Spanish or other Languages, rather than the information such as audio frequency, video, picture.Text to be measured and referenced text are the natural language texts of different language, rather than monolingual natural language text.
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Use and provide background knowledge across language ontology.Concept across the base unit of language ontology, definite implication of representation of concept, semanteme, the meaning.Have bilingual at least across language ontology, the word of different language can be mapped to unified conceptive.
Text fragment is expressed as to multiple concept sequence.On each position of multiple concept sequence, can there be one or more concepts, rather than on each position, can only have a concept.Multiple concept sequence can be regarded the sequence of concept array as, and it is defined as:
MCS=<Carray 1,Carray 2,…,Carray n>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
Step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
1) retrieval obtains reference multiple concept sequence and multiple concept sequence to be measured have enough common concept;
2) in multiple concept sequence to be measured, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in reference to multiple concept sequence;
3), in reference to multiple concept sequence, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in multiple concept sequence to be measured.
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Detecting multiple concept sequence specifically comprises the following steps:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list; Wherein, the described plagiarization evidence that meets density requirements has following characteristics:
(1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
(2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
(3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than certain threshold value.
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
The process that generates testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
Step 6, output and demonstration testing result.
The present invention also provides a kind of e-text plagiarism detection system across language, comprising:
E-text pretreatment module 10, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module 20, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module 30, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module 40, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module 50, for merging, arrange plagiarizing evidence list, generates testing result.
It is below the preferred embodiment that inventor provides.
With reference to Fig. 1, Fig. 1 is the general module figure of the method for the invention.The method at least comprises e-text pretreatment module 10, generalities module 20, retrieval module 30, testing result generation module 40 and testing result display module 50.E-text pretreatment module 10 is connected with generalities module 20, and generalities module 20 is connected with retrieval module 30, and retrieval module 30 is connected with testing result generation module 40, and testing result generation module 40 is connected with testing result display module 50.
In text pretreatment module 10, e-text is converted to unified coded format.Then e-text is carried out to paragraph division, obtain paragraph collection to be measured and with reference to paragraph collection.
In generalities module 20, utilize across language ontology, the text fragment of different language is expressed as to multiple concept sequence.
In retrieval module 30, the multiple concept sequence to be measured detecting for needs, some with reference to multiple concept sequence from obtaining with reference to retrieval multiple concept sequence sets, these have enough common concept with reference to multiple concept sequence and multiple concept sequence to be measured.
In testing result generation module 40, in multiple concept sequence basis, carry out copy detection, obtain testing result.
Finally, by testing result display module 50, testing result is shown to user.
With reference to Fig. 2, Fig. 2 is the structural representation of multiple concept sequence of the present invention.Multiple concept sequence can be regarded the sequence of a concept array as, and concept array comprises a concept that word is corresponding, and each concept array has a position in this sequence.For example,, for multiple concept sequence MCS=<a 1, a 2..., a n>, concept array a 1be in the 1st position, concept array a 2be in the 2nd position, by that analogy.On each position of multiple concept sequence, can there be multiple concepts, rather than on each position, can only have a concept.
With reference to Fig. 3, Fig. 3 is multiple concept sequence construct process flow diagram of the present invention.
First carry out step 301, a text fragment p is read in computing machine.Then carry out step 302, text fragment p is carried out to participle and stop words filtration.Then carry out step 303, set up a sequence of terms, the word in text fragment p is joined in sequence of terms.Then carry out step 304, from sequence of terms, take out a word w.Then carry out step 305, set up candidate's concept array a, in across language ontology, search the concept that word w is corresponding, these concepts are joined in candidate's concept array a.In step 306, judge the concept of whether only having a kind of part of speech in candidate's concept array a.If so, go to step 307, otherwise carry out step 309.In step 307, in candidate's concept array a, choose N concept at the most.Then carry out step 308, the concept of N at the most of choosing is joined in multiple concept sequence.In step 309, for every kind of part of speech of M kind part of speech in candidate's concept array a, choose respectively N concept at the most.Then carry out step 310, the concept of M × N at the most of choosing is joined in multiple concept sequence.In step 311, judge that in sequence of terms, whether all words are all processed.If so, the multiple concept sequence construct process of text fragment p is finished.Otherwise go to step 304, continue above-mentioned circulation, until handle all words in sequence of terms.
With reference to Fig. 4, Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
First carry out step 401, create plagiarization evidence list list and candidate and plagiarize evidence list list2.Then carry out step 402, to setting up location index with reference to multiple concept sequence mcs2.Location index adopts Hash table structure.The key word of Hash table is concept, and the value of Hash table is to deposit the location sets of concept all positions in reference to multiple concept sequence.Then carry out step 403, take out all concepts of a position sLoc of multiple concept sequence mcs1 to be measured.Then carry out step 404, search sLoc in the position with reference in multiple concept sequence mcs2, and be stored in array rLocArray.Then carry out step 405, the position rLoc in sLoc and rLocArray is formed to position to (sLoc, rLoc).Then carry out step 406, by position, (sLoc, rLoc) upgraded to candidate and plagiarize evidence list list2.In step 407, judge whether that need to plagiarize evidence list list2 to candidate checks.If so, go to step 408, otherwise carry out step 409.In step 408, the evidence that candidate is plagiarized in evidence list list2 checks, satisfactory evidence is joined and plagiarized in evidence list list, then empties candidate and plagiarizes evidence list list2.In step 409, judge that whether all positions of multiple concept sequence mcs1 to be measured are all processed.If so, go to step 410.Otherwise go to step 403, continue above-mentioned circulation, until handle all positions of multiple concept sequence mcs1 to be measured.In step 410, carry out union operation to plagiarizing evidence list list, and remove the evidence that length is less than certain threshold value.
Of the present invention across language e-text plagiarism detection method, its basic ideas are: first, by across language ontology, different language text is set up respectively to multiple concept sequence.Multiple concept sequence represents text on concept hierarchy, thereby has solved the problem there are differences in different language character string aspect.In addition, due to semanteme, implication, the meaning of representation of concept word, the word with synonymy can be mapped in same concept, has solved to a certain extent recurrent synonym and has replaced phenomenon.Then multiple concept sequence is carried out to copy detection.Utilize Hash table to set up the location index with reference to multiple concept sequence, then judge successively the position in multiple concept sequence to be measured and have common concept with reference to which position in multiple concept sequence.Position in multiple concept sequence to be measured and formed position pair with reference to the position in multiple concept sequence, by position to setting up and safeguard the list of candidate's evidence.In the time utilizing position to renewal candidate evidence, and do not require that the position of insertion is orderly to front and back, but can have between certain extension area on the border of former evidence.So just solve to a certain extent the word order inconsistence problems that copies middle existence across Language Translation type.By candidate's evidence list inspection operation, the evidence that does not meet density requirements is filtered, suitable evidence is joined in evidence list.Finally, multiple evidence lists of same document to be measured are merged, arranged, obtain testing result.Testing result comprises concrete plagiarization evidence and text similarity.

Claims (9)

1. across an e-text plagiarism detection method for language, it is characterized in that, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
2. the e-text plagiarism detection method across language according to claim 1, is characterized in that, described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
3. the e-text plagiarism detection method across language according to claim 1, is characterized in that, in described step 4, detects specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
4. the e-text plagiarism detection method across language according to claim 3, is characterized in that, the described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
5. the e-text plagiarism detection method across language according to claim 1, is characterized in that, the process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text.
6. the e-text plagiarism detection method across language according to claim 1, is characterized in that, on each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
7. the e-text plagiarism detection method across language according to claim 1, is characterized in that, the described base unit across language ontology is concept, definite implication of representation of concept, semanteme or the meaning.
8. the e-text plagiarism detection method across language according to claim 1, is characterized in that, described is the natural language text of different language to e-text to be measured with reference to e-text.
9. across an e-text plagiarism detection system for language, it is characterized in that, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.
CN201410062327.1A 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method Expired - Fee Related CN103823862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410062327.1A CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410062327.1A CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Publications (2)

Publication Number Publication Date
CN103823862A true CN103823862A (en) 2014-05-28
CN103823862B CN103823862B (en) 2017-02-15

Family

ID=50758926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410062327.1A Expired - Fee Related CN103823862B (en) 2014-02-24 2014-02-24 Cross-linguistic electronic text plagiarism detection system and detection method

Country Status (1)

Country Link
CN (1) CN103823862B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN109492228A (en) * 2017-06-28 2019-03-19 三角兽(北京)科技有限公司 Information processing unit and its participle processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
US20100114924A1 (en) * 2008-10-17 2010-05-06 Software Analysis And Forensic Engineering Corporation Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100114924A1 (en) * 2008-10-17 2010-05-06 Software Analysis And Forensic Engineering Corporation Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何文垒: "基于WordNet的中英文跨语言文本相似度研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN105224518B (en) * 2014-06-17 2020-03-17 腾讯科技(深圳)有限公司 Text similarity calculation method and system and similar text search method and system
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN104657350B (en) * 2015-03-04 2017-06-09 中国科学院自动化研究所 Merge the short text Hash learning method of latent semantic feature
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN109492228A (en) * 2017-06-28 2019-03-19 三角兽(北京)科技有限公司 Information processing unit and its participle processing method
CN109492228B (en) * 2017-06-28 2020-01-14 三角兽(北京)科技有限公司 Information processing apparatus and word segmentation processing method thereof
CN107862045A (en) * 2017-11-07 2018-03-30 哈尔滨工程大学 A kind of across language plagiarism detection method based on multiple features
CN107862045B (en) * 2017-11-07 2022-01-14 哈尔滨工程大学 Cross-language plagiarism detection method based on multiple features

Also Published As

Publication number Publication date
CN103823862B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
Gupta et al. Abstractive summarization: An overview of the state of the art
Raganato et al. Word sense disambiguation: a uinified evaluation framework and empirical comparison
Derczynski et al. Microblog-genre noise and impact on semantic annotation accuracy
Liu et al. Opinion target extraction using word-based translation model
Derczynski et al. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data
Sherif et al. Semantic quran
Specia et al. Predicting machine translation adequacy
De Smet et al. Cross-language linking of news stories on the web using interlingual topic modelling
Piperski et al. Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
CN103823862A (en) Cross-linguistic electronic text plagiarism detection system and detection method
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
CN104871151A (en) Method for summarizing document
Ehrmann et al. JRC-names: Multilingual entity name variants and titles as linked data
Şeker et al. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1
Pinter et al. Syntactic parsing of web queries with question intent
CN107797995A (en) A kind of Chinese and English fragment language material generation method
Antici et al. A corpus for sentence-level subjectivity detection on english news articles
Östling et al. Compounding in a Swedish blog corpus
Leilei et al. Approaches for candidate document retrieval and detailed comparison of plagiarism detection
CN102135957A (en) Clause translating method and device
Rana et al. Extraction of opinion target using syntactic rules in urdu text
Deshmukh et al. Sentiment analysis of Marathi language
Vandeghinste et al. METIS-II: machine translation for low resource languages
Shashirekha et al. Dictionary based Amharic-Arabic cross language information retrieval
Agarwal et al. Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English–Hindi Language Pairs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170215

CF01 Termination of patent right due to non-payment of annual fee