Summary of the invention
For above-mentioned defect or deficiency, the object of the present invention is to provide a kind of e-text plagiarism detection method across language, can be for copying and detect across the text of language.
For reaching above object, technical scheme of the present invention is:
Across an e-text plagiarism detection method for language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
Step 6, output and demonstration testing result.
Described step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence.
In described step 4, detect specifically comprising the following steps with the maximum reference multiple concept sequence of multiple concept sequence common concept to be measured of finding:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list;
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than predeterminated position threshold value.
The described plagiarization evidence that meets density requirements comprises:
1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
The process of described generation testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
On each position of described multiple concept sequence, have one or more concepts, multiple concept sequence definition is:
MCS=<Carray1,Carray2,…,Carrayn>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
The described base unit across language ontology is concept, definite implication of representation of concept, semanteme, the meaning.
Described is the natural language text of different language to e-text to be measured with reference to e-text.
Across an e-text plagiarism detection system for language, comprising:
E-text pretreatment module, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module, for merging, arrange plagiarizing evidence list, generates testing result.Compared with the prior art, beneficial effect of the present invention is:
Utilize across language ontology the modeling on concept hierarchy of different language text, can by e-text to be measured with reference to e-text, it carries out unified representation on concept hierarchy.Because concept just represents definite semanteme, the meaning, therefore, the word with synonymy can be mapped to identical conceptive, so just solve to a certain extent synonym replacement problem, then, by detection algorithm, on conceptual model basis, carried out across language text copy detection, further, in the present invention, set up to obtain multiple concept sequence, can and retrieve fully with reference to e-text e-text to be measured, and then improved the accuracy rate detecting.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
The invention provides a kind of e-text plagiarism detection method across language, comprise the following steps:
Step 1, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Concrete, comprise inputting e-text to be measured and being converted to unified coded format with reference to e-text, as UTF-8 form, detected e-text is as the natural language text of Chinese, English, French, German, Russian, Japanese, Spanish or other Languages, rather than the information such as audio frequency, video, picture.Text to be measured and referenced text are the natural language texts of different language, rather than monolingual natural language text.
Step 2, according to across language ontology, searches paragraph collection to be measured and concentrates concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Use and provide background knowledge across language ontology.Concept across the base unit of language ontology, definite implication of representation of concept, semanteme, the meaning.Have bilingual at least across language ontology, the word of different language can be mapped to unified conceptive.
Text fragment is expressed as to multiple concept sequence.On each position of multiple concept sequence, can there be one or more concepts, rather than on each position, can only have a concept.Multiple concept sequence can be regarded the sequence of concept array as, and it is defined as:
MCS=<Carray
1,Carray
2,…,Carray
n>
Wherein, MCS is multiple concept sequence, and Carrayn is n concept array, and on n the position of MCS, n is positive integer.
Step 2 specifically comprises the following steps:
1), to paragraph collection to be measured with carry out participle and stop words with reference to paragraph collection and filter, obtain respectively paragraph sequence of terms to be measured and with reference to paragraph sequence of terms;
2) utilize and search concept corresponding to word in each sequence of terms across language ontology, all concepts of word are joined in candidate's concept array;
3), if only have a kind of concept of part of speech in candidate's concept array of word, in candidate's concept array, choose N concept at the most and be stored in multiple concept sequence; If there is the concept of M kind part of speech in candidate's concept array of word, every kind of part of speech is chosen respectively to N concept at the most in candidate's concept array, by this at the most M × N concept be stored in multiple concept sequence;
4) repeat above step 2)~step 3), until all word processings in sequence of terms are complete, form multiple concept sequence to be measured and with reference to multiple concept sequence
Step 3, according to multiple concept sequence to be measured, retrieval obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
1) retrieval obtains reference multiple concept sequence and multiple concept sequence to be measured have enough common concept;
2) in multiple concept sequence to be measured, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in reference to multiple concept sequence;
3), in reference to multiple concept sequence, there is the concept that exists at least one to occur on the position that exceedes predetermined threshold value in multiple concept sequence to be measured.
Step 4, detects the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured finding, and generates and plagiarizes evidence list;
Detecting multiple concept sequence specifically comprises the following steps:
1) creating candidate plagiarizes evidence list and plagiarizes evidence list;
2) the maximum reference multiple concept sequence of common concept is set up to location index, described location index is organized according to Hash table structure, to make searching by location index the position that the concept in multiple concept sequence to be measured occurs in reference to multiple concept sequence;
3) preset as anterior diastema variable G and set to 0;
4) take out the locational concept array of multiple concept sequence to be measured, search in location index by all concepts in concept array, obtain a location sets;
5) if location sets is empty, gap variable G is added to 1, goes to step 8), otherwise gap variable G is set to 0;
6) by the composition position pair, position in the concept of multiple concept sequence to be measured and location sets, candidate is plagiarized to each evidence in evidence list, by position to fresh evidence more;
7) when the position of the concept with reference in multiple concept sequence is greater than predeterminated position threshold value to all evidence distances of plagiarizing in evidence list with candidate, utilize this position to creating fresh evidence, fresh evidence is joined to candidate and plagiarize in evidence list;
8) if the position in multiple concept sequence to be measured arrives sentence end or gap variable G is greater than predetermined threshold value, carry out candidate and plagiarize evidence list inspection operation, the plagiarization evidence that meets density requirements is joined and plagiarized in evidence list, then gap variable G is set to 0 and empty candidate and plagiarize evidence list; Wherein, the described plagiarization evidence that meets density requirements has following characteristics:
(1) plagiarize evidence and comprise multiple concept sequence fragment to be measured and with reference to multiple concept sequence fragment;
(2) establishing the total positional number of multiple concept sequence fragment to be measured is Ls, and the positional number detecting is Ns, and Ns/Ls is not less than density threshold T;
(3) establishing with reference to the total positional number of multiple concept sequence fragment is Lr, and the positional number detecting is Nr, and Nr/Lr is not less than density threshold T.
9) repeat above-mentioned steps 4)~step 8), until all handle all positions in multiple concept sequence to be measured;
10) evidence of plagiarizing in evidence list is merged, then remove the evidence that length is less than certain threshold value.
Step 5, merges, arranges plagiarizing evidence list, generates testing result;
The process that generates testing result is carried out according to the following steps:
(1), according to the position of multiple concept sequence to be measured, the plagiarization evidence of same document to be measured is merged;
(2) be mapped to the position in text character stream with reference to multiple concept sequence location;
(3) calculate the similarity of text to be measured and referenced text;
Step 6, output and demonstration testing result.
The present invention also provides a kind of e-text plagiarism detection system across language, comprising:
E-text pretreatment module 10, for the e-text of input is converted to unified coded format, carries out paragraph division to e-text to be measured with reference to e-text respectively, obtains paragraph collection to be measured and with reference to paragraph collection;
Generalities module 20, be used for basis across language ontology, search paragraph collection to be measured and concentrate concept corresponding to word with reference to paragraph, and according to found concept, by paragraph collection to be measured be multiple concept sequence to be measured and with reference to multiple concept sequence with reference to paragraph set representations;
Retrieval module 30, for according to multiple concept sequence to be measured, retrieves and obtains the reference multiple concept sequence maximum with multiple concept sequence common concept to be measured;
Testing result generation module 40, for detection of the found reference multiple concept sequence maximum with multiple concept sequence common concept to be measured, generates and plagiarizes evidence list;
Testing result display module 50, for merging, arrange plagiarizing evidence list, generates testing result.
It is below the preferred embodiment that inventor provides.
With reference to Fig. 1, Fig. 1 is the general module figure of the method for the invention.The method at least comprises e-text pretreatment module 10, generalities module 20, retrieval module 30, testing result generation module 40 and testing result display module 50.E-text pretreatment module 10 is connected with generalities module 20, and generalities module 20 is connected with retrieval module 30, and retrieval module 30 is connected with testing result generation module 40, and testing result generation module 40 is connected with testing result display module 50.
In text pretreatment module 10, e-text is converted to unified coded format.Then e-text is carried out to paragraph division, obtain paragraph collection to be measured and with reference to paragraph collection.
In generalities module 20, utilize across language ontology, the text fragment of different language is expressed as to multiple concept sequence.
In retrieval module 30, the multiple concept sequence to be measured detecting for needs, some with reference to multiple concept sequence from obtaining with reference to retrieval multiple concept sequence sets, these have enough common concept with reference to multiple concept sequence and multiple concept sequence to be measured.
In testing result generation module 40, in multiple concept sequence basis, carry out copy detection, obtain testing result.
Finally, by testing result display module 50, testing result is shown to user.
With reference to Fig. 2, Fig. 2 is the structural representation of multiple concept sequence of the present invention.Multiple concept sequence can be regarded the sequence of a concept array as, and concept array comprises a concept that word is corresponding, and each concept array has a position in this sequence.For example,, for multiple concept sequence MCS=<a
1, a
2..., a
n>, concept array a
1be in the 1st position, concept array a
2be in the 2nd position, by that analogy.On each position of multiple concept sequence, can there be multiple concepts, rather than on each position, can only have a concept.
With reference to Fig. 3, Fig. 3 is multiple concept sequence construct process flow diagram of the present invention.
First carry out step 301, a text fragment p is read in computing machine.Then carry out step 302, text fragment p is carried out to participle and stop words filtration.Then carry out step 303, set up a sequence of terms, the word in text fragment p is joined in sequence of terms.Then carry out step 304, from sequence of terms, take out a word w.Then carry out step 305, set up candidate's concept array a, in across language ontology, search the concept that word w is corresponding, these concepts are joined in candidate's concept array a.In step 306, judge the concept of whether only having a kind of part of speech in candidate's concept array a.If so, go to step 307, otherwise carry out step 309.In step 307, in candidate's concept array a, choose N concept at the most.Then carry out step 308, the concept of N at the most of choosing is joined in multiple concept sequence.In step 309, for every kind of part of speech of M kind part of speech in candidate's concept array a, choose respectively N concept at the most.Then carry out step 310, the concept of M × N at the most of choosing is joined in multiple concept sequence.In step 311, judge that in sequence of terms, whether all words are all processed.If so, the multiple concept sequence construct process of text fragment p is finished.Otherwise go to step 304, continue above-mentioned circulation, until handle all words in sequence of terms.
With reference to Fig. 4, Fig. 4 is multiple concept Sequence Detection process flow diagram of the present invention.
First carry out step 401, create plagiarization evidence list list and candidate and plagiarize evidence list list2.Then carry out step 402, to setting up location index with reference to multiple concept sequence mcs2.Location index adopts Hash table structure.The key word of Hash table is concept, and the value of Hash table is to deposit the location sets of concept all positions in reference to multiple concept sequence.Then carry out step 403, take out all concepts of a position sLoc of multiple concept sequence mcs1 to be measured.Then carry out step 404, search sLoc in the position with reference in multiple concept sequence mcs2, and be stored in array rLocArray.Then carry out step 405, the position rLoc in sLoc and rLocArray is formed to position to (sLoc, rLoc).Then carry out step 406, by position, (sLoc, rLoc) upgraded to candidate and plagiarize evidence list list2.In step 407, judge whether that need to plagiarize evidence list list2 to candidate checks.If so, go to step 408, otherwise carry out step 409.In step 408, the evidence that candidate is plagiarized in evidence list list2 checks, satisfactory evidence is joined and plagiarized in evidence list list, then empties candidate and plagiarizes evidence list list2.In step 409, judge that whether all positions of multiple concept sequence mcs1 to be measured are all processed.If so, go to step 410.Otherwise go to step 403, continue above-mentioned circulation, until handle all positions of multiple concept sequence mcs1 to be measured.In step 410, carry out union operation to plagiarizing evidence list list, and remove the evidence that length is less than certain threshold value.
Of the present invention across language e-text plagiarism detection method, its basic ideas are: first, by across language ontology, different language text is set up respectively to multiple concept sequence.Multiple concept sequence represents text on concept hierarchy, thereby has solved the problem there are differences in different language character string aspect.In addition, due to semanteme, implication, the meaning of representation of concept word, the word with synonymy can be mapped in same concept, has solved to a certain extent recurrent synonym and has replaced phenomenon.Then multiple concept sequence is carried out to copy detection.Utilize Hash table to set up the location index with reference to multiple concept sequence, then judge successively the position in multiple concept sequence to be measured and have common concept with reference to which position in multiple concept sequence.Position in multiple concept sequence to be measured and formed position pair with reference to the position in multiple concept sequence, by position to setting up and safeguard the list of candidate's evidence.In the time utilizing position to renewal candidate evidence, and do not require that the position of insertion is orderly to front and back, but can have between certain extension area on the border of former evidence.So just solve to a certain extent the word order inconsistence problems that copies middle existence across Language Translation type.By candidate's evidence list inspection operation, the evidence that does not meet density requirements is filtered, suitable evidence is joined in evidence list.Finally, multiple evidence lists of same document to be measured are merged, arranged, obtain testing result.Testing result comprises concrete plagiarization evidence and text similarity.