CN103823862A - Cross-linguistic electronic text plagiarism detection system and detection method - Google Patents
Cross-linguistic electronic text plagiarism detection system and detection method Download PDFInfo
- Publication number
- CN103823862A CN103823862A CN201410062327.1A CN201410062327A CN103823862A CN 103823862 A CN103823862 A CN 103823862A CN 201410062327 A CN201410062327 A CN 201410062327A CN 103823862 A CN103823862 A CN 103823862A
- Authority
- CN
- China
- Prior art keywords
- concept
- measured
- sequence
- text
- evidence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 48
- 238000012360 testing method Methods 0.000 claims description 30
- 239000012634 fragment Substances 0.000 claims description 18
- 238000000034 method Methods 0.000 claims description 14
- 239000012141 concentrate Substances 0.000 claims description 6
- 238000007689 inspection Methods 0.000 claims description 4
- 208000010138 Diastema Diseases 0.000 claims description 3
- 210000000475 diastema Anatomy 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000013519 translation Methods 0.000 description 13
- 230000014616 translation Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 101100113084 Schizosaccharomyces pombe (strain 972 / ATCC 24843) mcs2 gene Proteins 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-linguistic electronic text plagiarism detection system and detection method. The cross-linguistic electronic text plagiarism detection method comprises the steps that paragraph division is carried out on an electronic text to be detected and a reference electronic text respectively to obtain a paragraph set to be detected and a reference paragraph set; concepts corresponding to terms in the paragraph set to be detected and the reference paragraph set are searched for according to a cross-linguistic body, and the paragraph set to be detected and the reference paragraph set are expressed as a multiple-concept sequence to be detected and a reference multiple-concept sequence according to the found concepts; the reference multiple-concept sequence having the most common concepts with the multiple-concept sequence to be detected is obtained through searching according to the multiple-concept sequence to be detected; the multiple-concept sequences are detected to generate a plagiarism evidence list; the plagiarism evidence list is combined and ordered to generate a detection result; the detection result is output and displayed. By means of the cross-linguistic electronic text plagiarism detection system and detection method, the built multiple-concept sequences can sufficiently search the electronic text to be detected and the reference electronic text, and further the detection accuracy is improved.
Description
Technical field
The invention belongs to Intelligent Information Processing and field of computer technology, relate in particular to a kind of e-text plagiarism detection system and detection method thereof across language.
Background technology
Along with the fast development of infotech, on internet, have magnanimity e-text, and its quantity is also increasing always.Protection e-text intellecture property has become the common recognition of domestic and international all circles.Text copy detection, claims again text plagiarism detection, is to judge whether text copies the technology of other one or more texts, for protection e-text intellecture property provides technical support.Along with international day by day deep, copying of text is not confined to single language, copies very general across the text of Language Translation type yet.Therefore, have great significance for the intellecture property of protection e-text across language text copy detection.
In across language text copy detection, text to be measured and referenced text are used respectively different language.Single language text copy detection is mainly based on string matching and statistics.But in across language text copy detection, the character string of different language exists very big difference, the method based on string matching will be helpless.In addition, different language also differs widely on grammer, and the order of for example Chinese and English word in the time of translation may change.So, be a very difficult problem across language text copy detection.
Solving is machine translation method across a kind of approach of language text copy detection problem.First by mechanical translation, different language text translation is become to same language text.Then utilize single language text copy detection algorithm to detect.But the problem of this method is that mechanical translation quality can produce critical impact to testing result.Mechanical translation is also very poor to the translation accuracy of large section of word at present.Mechanical translation quality has been compared huge spread with human translation quality.So, although mechanical translation is same language text by different language text-converted, there will be some wrong translations, synonym to replace and reversed order.These errors all affect follow-up text copy detection quality to a great extent.
Summary of the invention
For above-mentioned defect or deficiency, the object of the present invention is to provide a kind of e-text plagiarism detection method across language, can be for copying and detect across the text of language.
For reaching above object, technical scheme of the present invention is:
Across an e-text plagiarism detection method for language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
Described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
In described step 4, detect specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
The described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
The process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
On each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
The described base unit across language ontology is concept, definite implication of representation of concept, semanteme, the meaning.
Described is the natural language text of different language to e-text to be measured with reference to e-text.
Across an e-text plagiarism detection system for language, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.Compared with the prior art, beneficial effect of the present invention is:
Utilize across language ontology the modeling on concept hierarchy of different language text, can by e-text to be measured with reference to e-text, it carries out unified representation on concept hierarchy.Because concept just represents definite semanteme, the meaning, therefore, the word with synonymy can be mapped to identical conceptive, so just solve to a certain extent synonym replacement problem, then, by detection algorithm, on conceptual model basis, carried out across language text copy detection, further, in the present invention, set up to obtain multiple concept sequence, can and retrieve fully with reference to e-text e-text to be measured, and then improved the accuracy rate detecting.
Accompanying drawing explanation
Fig. 1 is the general module figure of the method for the invention;
Fig. 2 is multiple concept sequential structure schematic diagram of the present invention;
Fig. 3 is multiple concept sequence construct process flow diagram of the present invention;
Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
The invention provides a kind of e-text plagiarism detection method across language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Concrete, comprise inputting e-text to be measured and being converted to unified coded format with reference to e-text, as UTF-8 form, detected e-text is as the natural language text of Chinese, English, French, German, Russian, Japanese, Spanish or other Languages, rather than the information such as audio frequency, video, picture.Text to be measured and referenced text are the natural language texts of different language, rather than monolingual natural language text.
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Use and provide background knowledge across language ontology.Concept across the base unit of language ontology, definite implication of representation of concept, semanteme, the meaning.Have bilingual at least across language ontology, the word of different language can be mapped to unified conceptive.
Text fragment is expressed as to multiple concept sequence.On each position of multiple concept sequence, can there be one or more concepts, rather than on each position, can only have a concept.Multiple concept sequence can be regarded the sequence of concept array as, and it is defined as:
MCS=<Carray
1,Carray
2,…,Carray
n>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
Step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
1) retrieval obtains reference multiple concept sequence and multiple concept sequence to be measured have enough common concept;
2) in multiple concept sequence to be measured, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in reference to multiple concept sequence;
3), in reference to multiple concept sequence, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in multiple concept sequence to be measured.
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Detecting multiple concept sequence specifically comprises the following steps:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list; Wherein, the described plagiarization evidence that meets density requirements has following characteristics:
(1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
(2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
(3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than certain threshold value.
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
The process that generates testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
Step 6, output and demonstration testing result.
The present invention also provides a kind of e-text plagiarism detection system across language, comprising:
Testing result generation module 40, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module 50, for merging, arrange plagiarizing evidence list, generates testing result.
It is below the preferred embodiment that inventor provides.
With reference to Fig. 1, Fig. 1 is the general module figure of the method for the invention.The method at least comprises e-text pretreatment module 10, generalities module 20, retrieval module 30, testing result generation module 40 and testing result display module 50.E-text pretreatment module 10 is connected with generalities module 20, and generalities module 20 is connected with retrieval module 30, and retrieval module 30 is connected with testing result generation module 40, and testing result generation module 40 is connected with testing result display module 50.
In text pretreatment module 10, e-text is converted to unified coded format.Then e-text is carried out to paragraph division, obtain paragraph collection to be measured and with reference to paragraph collection.
In generalities module 20, utilize across language ontology, the text fragment of different language is expressed as to multiple concept sequence.
In retrieval module 30, the multiple concept sequence to be measured detecting for needs, some with reference to multiple concept sequence from obtaining with reference to retrieval multiple concept sequence sets, these have enough common concept with reference to multiple concept sequence and multiple concept sequence to be measured.
In testing result generation module 40, in multiple concept sequence basis, carry out copy detection, obtain testing result.
Finally, by testing result display module 50, testing result is shown to user.
With reference to Fig. 2, Fig. 2 is the structural representation of multiple concept sequence of the present invention.Multiple concept sequence can be regarded the sequence of a concept array as, and concept array comprises a concept that word is corresponding, and each concept array has a position in this sequence.For example,, for multiple concept sequence MCS=<a
1, a
2..., a
n>, concept array a
1be in the 1st position, concept array a
2be in the 2nd position, by that analogy.On each position of multiple concept sequence, can there be multiple concepts, rather than on each position, can only have a concept.
With reference to Fig. 3, Fig. 3 is multiple concept sequence construct process flow diagram of the present invention.
First carry out step 301, a text fragment p is read in computing machine.Then carry out step 302, text fragment p is carried out to participle and stop words filtration.Then carry out step 303, set up a sequence of terms, the word in text fragment p is joined in sequence of terms.Then carry out step 304, from sequence of terms, take out a word w.Then carry out step 305, set up candidate's concept array a, in across language ontology, search the concept that word w is corresponding, these concepts are joined in candidate's concept array a.In step 306, judge the concept of whether only having a kind of part of speech in candidate's concept array a.If so, go to step 307, otherwise carry out step 309.In step 307, in candidate's concept array a, choose N concept at the most.Then carry out step 308, the concept of N at the most of choosing is joined in multiple concept sequence.In step 309, for every kind of part of speech of M kind part of speech in candidate's concept array a, choose respectively N concept at the most.Then carry out step 310, the concept of M × N at the most of choosing is joined in multiple concept sequence.In step 311, judge that in sequence of terms, whether all words are all processed.If so, the multiple concept sequence construct process of text fragment p is finished.Otherwise go to step 304, continue above-mentioned circulation, until handle all words in sequence of terms.
With reference to Fig. 4, Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
First carry out step 401, create plagiarization evidence list list and candidate and plagiarize evidence list list2.Then carry out step 402, to setting up location index with reference to multiple concept sequence mcs2.Location index adopts Hash table structure.The key word of Hash table is concept, and the value of Hash table is to deposit the location sets of concept all positions in reference to multiple concept sequence.Then carry out step 403, take out all concepts of a position sLoc of multiple concept sequence mcs1 to be measured.Then carry out step 404, search sLoc in the position with reference in multiple concept sequence mcs2, and be stored in array rLocArray.Then carry out step 405, the position rLoc in sLoc and rLocArray is formed to position to (sLoc, rLoc).Then carry out step 406, by position, (sLoc, rLoc) upgraded to candidate and plagiarize evidence list list2.In step 407, judge whether that need to plagiarize evidence list list2 to candidate checks.If so, go to step 408, otherwise carry out step 409.In step 408, the evidence that candidate is plagiarized in evidence list list2 checks, satisfactory evidence is joined and plagiarized in evidence list list, then empties candidate and plagiarizes evidence list list2.In step 409, judge that whether all positions of multiple concept sequence mcs1 to be measured are all processed.If so, go to step 410.Otherwise go to step 403, continue above-mentioned circulation, until handle all positions of multiple concept sequence mcs1 to be measured.In step 410, carry out union operation to plagiarizing evidence list list, and remove the evidence that length is less than certain threshold value.
Of the present invention across language e-text plagiarism detection method, its basic ideas are: first, by across language ontology, different language text is set up respectively to multiple concept sequence.Multiple concept sequence represents text on concept hierarchy, thereby has solved the problem there are differences in different language character string aspect.In addition, due to semanteme, implication, the meaning of representation of concept word, the word with synonymy can be mapped in same concept, has solved to a certain extent recurrent synonym and has replaced phenomenon.Then multiple concept sequence is carried out to copy detection.Utilize Hash table to set up the location index with reference to multiple concept sequence, then judge successively the position in multiple concept sequence to be measured and have common concept with reference to which position in multiple concept sequence.Position in multiple concept sequence to be measured and formed position pair with reference to the position in multiple concept sequence, by position to setting up and safeguard the list of candidate's evidence.In the time utilizing position to renewal candidate evidence, and do not require that the position of insertion is orderly to front and back, but can have between certain extension area on the border of former evidence.So just solve to a certain extent the word order inconsistence problems that copies middle existence across Language Translation type.By candidate's evidence list inspection operation, the evidence that does not meet density requirements is filtered, suitable evidence is joined in evidence list.Finally, multiple evidence lists of same document to be measured are merged, arranged, obtain testing result.Testing result comprises concrete plagiarization evidence and text similarity.
Claims (9)
1. across an e-text plagiarism detection method for language, it is characterized in that, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
2. the e-text plagiarism detection method across language according to claim 1, is characterized in that, described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
3. the e-text plagiarism detection method across language according to claim 1, is characterized in that, in described step 4, detects specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
4. the e-text plagiarism detection method across language according to claim 3, is characterized in that, the described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
5. the e-text plagiarism detection method across language according to claim 1, is characterized in that, the process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text.
6. the e-text plagiarism detection method across language according to claim 1, is characterized in that, on each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
7. the e-text plagiarism detection method across language according to claim 1, is characterized in that, the described base unit across language ontology is concept, definite implication of representation of concept, semanteme or the meaning.
8. the e-text plagiarism detection method across language according to claim 1, is characterized in that, described is the natural language text of different language to e-text to be measured with reference to e-text.
9. across an e-text plagiarism detection system for language, it is characterized in that, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410062327.1A CN103823862B (en) | 2014-02-24 | 2014-02-24 | Cross-linguistic electronic text plagiarism detection system and detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410062327.1A CN103823862B (en) | 2014-02-24 | 2014-02-24 | Cross-linguistic electronic text plagiarism detection system and detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103823862A true CN103823862A (en) | 2014-05-28 |
CN103823862B CN103823862B (en) | 2017-02-15 |
Family
ID=50758926
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410062327.1A Expired - Fee Related CN103823862B (en) | 2014-02-24 | 2014-02-24 | Cross-linguistic electronic text plagiarism detection system and detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103823862B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104657350A (en) * | 2015-03-04 | 2015-05-27 | 中国科学院自动化研究所 | Hash learning method for short text integrated with implicit semantic features |
CN104699785A (en) * | 2015-03-10 | 2015-06-10 | 中国石油大学(华东) | Paper similarity detection method |
CN105224518A (en) * | 2014-06-17 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The lookup method of the computing method of text similarity and system, Similar Text and system |
CN107862045A (en) * | 2017-11-07 | 2018-03-30 | 哈尔滨工程大学 | A kind of across language plagiarism detection method based on multiple features |
CN109492228A (en) * | 2017-06-28 | 2019-03-19 | 三角兽(北京)科技有限公司 | Information processing unit and its participle processing method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404037A (en) * | 2008-11-18 | 2009-04-08 | 西安交通大学 | Method for detecting and positioning electronic text contents plagiary |
US20100114924A1 (en) * | 2008-10-17 | 2010-05-06 | Software Analysis And Forensic Engineering Corporation | Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism |
CN102360372A (en) * | 2011-10-09 | 2012-02-22 | 北京航空航天大学 | Cross-language document similarity detection method |
CN103544326A (en) * | 2013-11-14 | 2014-01-29 | 上海交通大学 | Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations |
-
2014
- 2014-02-24 CN CN201410062327.1A patent/CN103823862B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100114924A1 (en) * | 2008-10-17 | 2010-05-06 | Software Analysis And Forensic Engineering Corporation | Searching The Internet For Common Elements In A Document In Order To Detect Plagiarism |
CN101404037A (en) * | 2008-11-18 | 2009-04-08 | 西安交通大学 | Method for detecting and positioning electronic text contents plagiary |
CN102360372A (en) * | 2011-10-09 | 2012-02-22 | 北京航空航天大学 | Cross-language document similarity detection method |
CN103544326A (en) * | 2013-11-14 | 2014-01-29 | 上海交通大学 | Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations |
Non-Patent Citations (1)
Title |
---|
何文垒: "基于WordNet的中英文跨语言文本相似度研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224518A (en) * | 2014-06-17 | 2016-01-06 | 腾讯科技(深圳)有限公司 | The lookup method of the computing method of text similarity and system, Similar Text and system |
CN105224518B (en) * | 2014-06-17 | 2020-03-17 | 腾讯科技(深圳)有限公司 | Text similarity calculation method and system and similar text search method and system |
CN104657350A (en) * | 2015-03-04 | 2015-05-27 | 中国科学院自动化研究所 | Hash learning method for short text integrated with implicit semantic features |
CN104657350B (en) * | 2015-03-04 | 2017-06-09 | 中国科学院自动化研究所 | Merge the short text Hash learning method of latent semantic feature |
CN104699785A (en) * | 2015-03-10 | 2015-06-10 | 中国石油大学(华东) | Paper similarity detection method |
CN109492228A (en) * | 2017-06-28 | 2019-03-19 | 三角兽(北京)科技有限公司 | Information processing unit and its participle processing method |
CN109492228B (en) * | 2017-06-28 | 2020-01-14 | 三角兽(北京)科技有限公司 | Information processing apparatus and word segmentation processing method thereof |
CN107862045A (en) * | 2017-11-07 | 2018-03-30 | 哈尔滨工程大学 | A kind of across language plagiarism detection method based on multiple features |
CN107862045B (en) * | 2017-11-07 | 2022-01-14 | 哈尔滨工程大学 | Cross-language plagiarism detection method based on multiple features |
Also Published As
Publication number | Publication date |
---|---|
CN103823862B (en) | 2017-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gupta et al. | Abstractive summarization: An overview of the state of the art | |
Raganato et al. | Word sense disambiguation: a uinified evaluation framework and empirical comparison | |
Derczynski et al. | Microblog-genre noise and impact on semantic annotation accuracy | |
Liu et al. | Opinion target extraction using word-based translation model | |
Derczynski et al. | Twitter part-of-speech tagging for all: Overcoming sparse and noisy data | |
Sherif et al. | Semantic quran | |
Specia et al. | Predicting machine translation adequacy | |
De Smet et al. | Cross-language linking of news stories on the web using interlingual topic modelling | |
Piperski et al. | Big and diverse is beautiful: A large corpus of Russian to study linguistic variation | |
CN103823862A (en) | Cross-linguistic electronic text plagiarism detection system and detection method | |
El-Shishtawy et al. | An accurate arabic root-based lemmatizer for information retrieval purposes | |
CN104871151A (en) | Method for summarizing document | |
Ehrmann et al. | JRC-names: Multilingual entity name variants and titles as linked data | |
Şeker et al. | Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1 | |
Pinter et al. | Syntactic parsing of web queries with question intent | |
CN107797995A (en) | A kind of Chinese and English fragment language material generation method | |
Antici et al. | A corpus for sentence-level subjectivity detection on english news articles | |
Östling et al. | Compounding in a Swedish blog corpus | |
Leilei et al. | Approaches for candidate document retrieval and detailed comparison of plagiarism detection | |
CN102135957A (en) | Clause translating method and device | |
Rana et al. | Extraction of opinion target using syntactic rules in urdu text | |
Deshmukh et al. | Sentiment analysis of Marathi language | |
Vandeghinste et al. | METIS-II: machine translation for low resource languages | |
Shashirekha et al. | Dictionary based Amharic-Arabic cross language information retrieval | |
Agarwal et al. | Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English–Hindi Language Pairs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170215 |
|
CF01 | Termination of patent right due to non-payment of annual fee |