CN110019674A

CN110019674A - A kind of text plagiarizes detection method and system

Info

Publication number: CN110019674A
Application number: CN201711167027.XA
Authority: CN
Inventors: 张亿光; 郑杰; 王旭
Original assignee: Shengting Information Technology Shanghai Co Ltd
Current assignee: Shengting Information Technology Shanghai Co Ltd
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2019-07-16

Abstract

The invention discloses a kind of texts to plagiarize detection method and system.By way of this method is deleting short sentence and using character fingerprint is truncated, the quantity and length for extracting sentence fingerprint are reduced；Sentence fingerprint is extracted by name, place name, mechanism name, time and some other redundancy in deletion sentence, realize that minor modifications plagiarize the accurate detection of content, name, place name, mechanism name content are such as changed, can also be detected, enhance robustness.Relative to traditional text plagiarism method, technical solution provided by the invention substantially reduces operand, improve detection speed, it is more applicable for the quick-searching in magnanimity (hundred million grades) original text and goes out file to be detected and the same or similar place of copyrighted urtext, and export all texts and corresponding plagiarism degree that it is plagiarized.

Description

A kind of text plagiarizes detection method and system

Technical field

The invention discloses a kind of texts to plagiarize detection method and system, is related under mass text environment to particular text Carry out plagiarism detection.Plagiarism detection is carried out under mass text environment, due to needing to handle a large amount of text data and a large amount of Matching operation, therefore, corresponding method or system need to meet quickly, accurate and have certain Shandong to anti-means of plagiarizing The requirement such as stick.

Background technique

Granted patent " electronic homework based on paragraph plagiarism detection is counter to plagiarize system and method " (application number 201310631663.9) in, by segmentation, the information such as the word frequency for counting keyword for each paragraph generate a vector, then The similarity between paragraph is calculated with cosine function.This mode is able to detect out the plagiarism between paragraph, if a piece of article Plurality of articles have been plagiarized, all articles plagiarized can have been detected.Shortcoming is that this mode is needed to be detected All all paragraph vectors for having copyright chapters and sections do cosine calculating in all vector paragraphs of article and library, and calculation amount is extremely huge, If typing has copyright, chapters and sections number is slightly larger, and detection speed will become very slow.

As shown in Figure of description 1, examining application No. is CN201510112689.1, entitled " one A kind of method of text similarity detection is disclosed in kind paper similarity detection method " patent document.Comprising:

Step (a) carries out Chinese word segmentation to detection text；

Step (b) carries out stop words processing to the text after participle, deletes in the text if belonging to stop words, text In remaining word belong to keyword；

Step (c) screens sentence, and the sentence by keyword number less than preset value K is deleted；

Step (d) encodes each word in the text after sentence screening by GB2312 coding mode；

Step (e) selects function to delete unnecessary coding the coding, obtains the fingerprint of detection text by fingerprint Sequence；

The fingerprint sequence is compared with the fingerprint sequence in paper library, if there is continuous overlapping, weighs by step (f) Folded part is defined as doubtful plagiarism paragraph；

The doubtful part of plagiarism is navigated to the corresponding paragraph of respective document in paper library, passes through character string by step (g) Matching way is accurately matched, and is confirmed as being defined as plagiarizing paragraph after accurately matching.

This method selects function to carry out fingerprint extraction to the GB2312 coding of sentence by deleting stop words, fingerprint, then There are all interrogation sequences of copyright chapters and sections in comparison library, there is the part being continuously overlapped to be defined as doubtful plagiarism, finally copied doubtful Part progress character string is attacked accurately to match.This method calculation amount is relatively small, can also cope with the situation that a text plagiarizes more texts, but That it finally needs accurate matched character string, for the plagiarism content slightly changed with regard to helpless, as have changed the time, The text of the information such as point just can not be detected.And this method is also required to compare the fingerprint in all libraries, although in capable of coping with The typing text of type quantity, but if typing text is excessive, it will become very slow.

As it can be seen that existing text plagiarizes detection scheme since required calculation amount is extremely huge, system operation is slow, robust Property is poor, the plagiarism content slightly changed can not accurately detected, and is not suitable for being plagiarized in mass text environment Detection.

Summary of the invention

It can not adapt in mass text environment and in order to overcome existing text to plagiarize detection system for being modified slightly The defect that content can not be detected accurately is plagiarized, the present invention provides a kind of text plagiarism detection system comprising:

Text subordinate sentence module is used to a text segmentation be several sentences；

Module is refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, is avoided to text Each word in middle each sentence, sentence carries out fingerprint extraction；

Sentence fingerprint extraction module carries out fingerprint extraction for the sentence after refining to screening；

On the one hand local search engine module is used for according to<Text Flag, fingerprint collection>simultaneously guarantees that Text Flag is unique In the index of mode typing local search engine, original text set refines module, sentence via text subordinate sentence module, sentence screening One group of group fingerprint collection that fingerprint extraction module is handled building matching fingerprint base, on the other hand for by text to be detected via The text subordinate sentence module, sentence screening are refined module, one group of fingerprint that sentence fingerprint extraction module is handled and matching and are referred to Line library is matched, and detects that the sentence that text to be detected possesses identical fingerprints with original text is determined as plagiarizing sentence and export Testing result；

Plagiarize sentence mark module, the testing result label for exporting according to local search engine module plagiarizes text Plagiarize sentence and by corresponding sentence in plagiarism text.

Further, subordinate sentence module is using all non-Chinese, non-English, non-numeric symbol occurred in text as separator handle At a rule sentence, detection matching is carried out as unit of sentence to be had very one text segmentation in the case of a text plagiarizes more texts Good detection effect.Module is refined in sentence screening, and on the one hand short sentence is deleted, and deleting short sentence can be " if any duplicating, sheerly ingeniously The probability of conjunction " reduces, and can greatly reduce fingerprint total number to reduce search engine pressure；On the other hand it deletes in sentence Name, place name, mechanism name, time etc. name entity information, these information by two layers filtering: first layer, using by condition with The Named Entity Extraction Model of airport training identifies name entity, be determined as redundancy and delete that the second layer establishes one Conventional dictionary (segments all texts in one big text library, picks out the highest part word of frequency as common Word), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary, is determined as redundancy words and deletes.Sentence Garbled sentence is become fingerprint by sub fingerprint extraction module, is specifically included: the original finger of every words is extracted using MD5 algorithm Line intercepts certain length, and number is referred to then under the premise of guaranteeing that total fingerprint number multiplicity is sufficiently small from original fingerprint Line is mapped as character fingerprint (i.e. each multiple numbers of character representation), thus reach shorten fingerprint length, reduce local search draw The effect of the index size, raising search efficiency held up.Local search engine module is based on the realization of Lucene Technical Architecture, on the one hand For the new original text of typing, the plagiarism text of text to be detected is on the other hand searched for, can be carried out simultaneously with typing and search, Also warm back-up can be carried out to index.Sentence mark module is plagiarized, is related to according to the determination of the output result of local search engine module Dislike the text plagiarized, determine it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text And it is marked.

Correspondingly, the present invention also provides a kind of texts to plagiarize detection method comprising following steps:

A. text subordinate sentence is given, is several sentences a text segmentation；

B. screening refinement is carried out to the sentence divided, avoids carrying out each word in each sentence in text, sentence Fingerprint extraction；

C. the sentence after refining to screening carries out fingerprint extraction；

D. the one group of group fingerprint collection extracted copyrighted original text set according to step A to C according to < Text Flag, Fingerprint collection > mode search typing rope automotive engine system, and guarantee that Text Flag is unique；

E. according to step A to C to Text Feature Extraction to be detected to one group of fingerprint input search engine match, detect Text to be detected is determined as plagiarizing sentence and output test result with the sentence for having the original text of copyright to possess identical fingerprints；

F. it is marked according to the testing result of step E output to sentence is plagiarized.

The present invention provides text and plagiarizes detection system and method, is sieved by the sentence for obtain after subordinate sentence to text Choosing is refined, and is overcome existing text and is plagiarized what detection technique needed to take the fingerprint for each text in each sentence, sentence Disadvantage has also shortened the length of the fingerprint of sentence, for slightly repairing while reducing the quantity for needing the sentence to take the fingerprint The plagiarism content changed can also detected, and enhance the robustness of system；Number is extracted by the sentence after refining screening to refer to Line is mapped as character fingerprint, reaches and further shortens fingerprint length, the index size of reduction local search engine, improves search The technical effect of efficiency.Compared with the existing technology, text provided by the invention plagiarizes searching system and method is more adapted in magnanimity Plagiarism detection is carried out in text environments.

Detailed description of the invention

Fig. 1 is a kind of existing text similarity detection method flow chart；

Fig. 2 is the frame diagram that text provided by the invention plagiarizes detection system.

Specific embodiment

In order to which technical problem, technical solution and beneficial effect solved by the invention is more clearly understood, tie below Closing attached drawing, the present invention will be described in further detail.It should be understood that specific embodiment described herein is only to explain this Invention, is not intended to limit the present invention.Referring to attached drawing 2, the present invention provides a kind of text plagiarism detection system comprising: text Subordinate sentence module (1), sentence screening are refined module (2), sentence fingerprint extraction module (3), local search engine module (4), are plagiarized Sentence mark module (5).

Wherein, text subordinate sentence module (1), for a text segmentation be several sentences.Subordinate sentence module (1) is in text All non-Chinese, non-English, the non-numeric symbol occurred is separator, is several sentences a text segmentation.With sentence Son is that unit has good detection effect in the case of a text plagiarizes more texts.

Module (2) are refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, are avoided to text Each sentence in this, each word in sentence carry out fingerprint extraction.Module (2), on the one hand that number of words is small are refined in sentence screening It is deleted in the short sentence of certain limit, the sentence such as number of words less than 10 is deleted；Another aspect use condition random field algorithm carries out The name entity informations such as name, place name, mechanism name, time in identification sentence are simultaneously deleted, these information pass through two layers of mistake Filter: first layer is identified name entity using the Named Entity Extraction Model trained by condition random field, is determined as redundancy And delete, the second layer, it establishes a conventional dictionary and (all texts in one big text library is segmented, pick out frequency most High part word is as everyday words), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary, It is determined as redundancy words and deletes.The probability of " if any duplicating, being a coincidce " can be reduced by deleting short sentence, and can be very big Fingerprint total number is reduced to reduce search engine pressure, deletes name in sentence, place name, mechanism name, time and some other Redundancy can cope with plagiarism person and change the case where content is plagiarized in part, for the plagiarism content of some minor modifications, such as more Changing the contents such as name, place name, mechanism name can also prepare detected, and system is made to have certain robustness.

Sentence fingerprint extraction module (3) carries out fingerprint extraction for the sentence after refining to screening.Sentence fingerprint extraction mould Block (3) extracts the original figure fingerprint of every words using MD5 algorithm, then before guaranteeing that total fingerprint number multiplicity is sufficiently small It putting, the digital finger-print that certain length is intercepted from original fingerprint is mapped as character fingerprint (each multiple numbers of character representation), To reach the index size, the technical effect of raising search efficiency that shorten fingerprint length, reduce local search engine.

Local search engine module (4) is on the one hand used for every text according to<Text Flag, and fingerprint collection>simultaneously guarantees text The unique original text set of mode typing of this mark refines module, sentence fingerprint extraction via text subordinate sentence module, sentence screening One group of group fingerprint collection building matching fingerprint base that resume module obtains, is on the other hand used for text to be detected via the text Subordinate sentence module, one group of fingerprint that module is refined in sentence screening, sentence fingerprint extraction module is handled are carried out with fingerprint base is matched Matching detects that the sentence that text to be detected possesses identical fingerprints with original text is determined as plagiarizing sentence and exports detection knot Fruit.Local search engine module (4) may be implemented typing and carry out simultaneously with search, also can carry out warm back-up to index, specifically It can be realized using Lucene Technical Architecture.The testing result of local search engine module (4) output is m according to being plagiarized Sentence number be ranked up from more to less by plagiarism Text Flag item, wherein each indicated by plagiarism Text Flag item are as follows: < text Mark, plagiarism sentence number >.

Plagiarize sentence mark module (5), the testing result label for exporting according to local search engine module plagiarizes text Originally and sentence is plagiarized accordingly in plagiarism text.It is marked to sentence is plagiarized specifically: by local search engine mould Front M determined by the corresponding text of Text Flag in plagiarism Text Flag item is come in the output result of block to be accused of being plagiarized Text, wherein M is user's unrestricted choice integer, M≤m.According to each by plagiarism file content and detected text content, really It is fixed it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text and be marked.

Further, the present invention also provides a kind of texts to plagiarize detection method comprising following steps:

Wherein, step A gives text subordinate sentence, specifically: it is all non-Chinese, non-English, non-numeric with what is occurred in text Symbol is separator, is several sentences a text segmentation.Step B carries out screening refinement to the sentence divided, specific to wrap It includes: on the one hand deleting the short sentence that number of words is less than certain limit, the sentence such as number of words less than 10 is deleted；On the other hand item is used Name, place name, mechanism name, time etc. name entity information and are deleted in part random field algorithm identification sentence, establish one Conventional dictionary (segments all texts in one big text library, picks out the highest part word of frequency as common Word), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary, is determined as redundancy words and deletes.Step Sentence after rapid C refines screening carries out fingerprint extraction, specifically: the initial data fingerprint of every words is extracted using MD5 algorithm, Then under the premise of guaranteeing that the total fingerprint number multiplicity of data is sufficiently small, the digital finger-print of certain length is intercepted from original fingerprint Character fingerprint (each multiple numbers of character representation) are mapped as, to reach the rope for shortening fingerprint length, reducing local search engine The technical effect drawn size, improve search efficiency.The testing result that step E is generated be m according to by plagiarism sentence number by more to Be ranked up less by plagiarism Text Flag item, wherein each indicated by plagiarism Text Flag item are as follows: < Text Flag plagiarizes sentence Subnumber >.Step F is marked according to the testing result of step E to sentence is plagiarized, specifically: by local search engine module Front M determined by the corresponding text of Text Flag in plagiarism Text Flag item is come in testing result to be accused of being plagiarized text This, wherein M is user's unrestricted choice integer, M≤m；According in each file content for being accused of being plagiarized and detected text Hold, determine it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text and be marked.

The present invention compared with the existing technology has the advantage that

1, the quantity and length deleting short sentence and using can largely reduce sentence fingerprint by way of truncating character fingerprint Degree realizes mass text quick-searching by the way of search engine reverse indexing, and such as in hundred million number of stages texts, (text is averagely long Spend 2000 words or so) under, retrieval rate reaches a tens of pieces per second to a hundreds of pieces, meets industrial application requirement；

2, also can for having changed the plagiarism content of name, place name, mechanism name, time and some other redundancy Enough it is detected.Due to deleting name in sentence, place name, mechanism name, time and some other redundancy, The plagiarism content for having changed name, place name, mechanism name, time and some other redundancy can be also detected；

3, it can accurately judge the sentence plagiarized and label, be to plagiarize to determine unit with sentence, be capable of detecting when One text plagiarizes the situation of more texts；

4, new Characters and plagiarism can be carried out using the local search engine that Lucene Technical Architecture is realized simultaneously Retrieval.

Claims

1. a kind of text plagiarizes detection method comprising following steps:

B. screening refinement is carried out to the sentence divided, avoids carrying out fingerprint to each word in each sentence in text, sentence It extracts；

C. the sentence after refining to screening carries out fingerprint extraction, the corresponding fingerprint of every sentence；

D. the one group of group fingerprint collection extracted copyrighted original text set according to step A to C is according to < Text Flag, fingerprint Collection > mode typing local search engine, and guarantee that Text Flag is unique；

E. according to step A to C to Text Feature Extraction to be detected to one group of fingerprint input local search engine match, detect Text to be detected is determined as plagiarizing sentence and output test result with the sentence for having the original text of copyright to possess identical fingerprints；

F. it is marked according to the testing result of the output of step E to sentence is plagiarized.

2. text as described in claim 1 plagiarizes detection method, wherein step A, text subordinate sentence is given specifically: to go out in text Existing all non-Chinese, non-English, non-numeric symbols are separator, are several sentences a text segmentation.

3. text as described in claim 1 plagiarizes detection method, wherein step B, screening is carried out to the sentence divided and refines tool Body are as follows: by number of words be less than certain limit short sentence delete, by sentence name, place name, mechanism name, time etc. name entity and Some other redundancy is deleted, and redundancy is by two layers of filtering: first layer, real using the name trained by condition random field Body identification model identifies name entity, be determined as redundancy and delete；The second layer, according to the conventional dictionary of foundation, by sentence It is not comprised in words in conventional dictionary in son to be determined as redundancy words and delete, wherein the conventional dictionary is by by one All texts are segmented in big text library, pick out what the highest part word of frequency was established as everyday words.

4. text as claimed in any one of claims 1-3 plagiarizes detection method, the sentence after wherein step C refines screening Carry out fingerprint extraction specifically: extract the original figure fingerprint of every words using MD5 algorithm, then guaranteeing total fingerprint number repetition Spend it is sufficiently small under the premise of, from original fingerprint intercept certain length digital finger-print and be mapped as character fingerprint, each character Corresponding multiple numbers, further shorten fingerprint length.

5. text as claimed in claim 4 plagiarizes detection method, wherein the local search engine is based on Lucene technology What framework was realized, new Characters can be carried out simultaneously and plagiarize retrieval.

6. text as claimed in claim 5 plagiarizes detection method, wherein the testing result of the output of the step E is according to quilt Plagiarism sentence number is ranked up multiple by plagiarism Text Flag item from more to less, wherein each indicated by plagiarism Text Flag item Are as follows:<Text Flag, by plagiarism sentence number>.

7. text as claimed in claim 6 plagiarizes detection method, wherein step F, it is marked to sentence is plagiarized specifically: from Step E output testing result in select come front M by plagiarism Text Flag item, according to each by plagiarism Text Flag File identification in is corresponding by plagiarism file content and detected text content, determines that each plagiarism sentence is being detected text Originally it and by the location information in plagiarism text and is marked.

8. a kind of text plagiarizes detection system comprising:

Module is refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, is avoided to every in text Each word in one sentence, sentence carries out fingerprint extraction；

On the one hand local search engine module is used for according to<Text Flag, fingerprint collection>simultaneously guarantees the unique mode of Text Flag The original text set of typing refines module, one that sentence fingerprint extraction module is handled via text subordinate sentence module, sentence screening Group group fingerprint collection building matching fingerprint base, on the other hand for sieving text to be detected via the text subordinate sentence module, sentence One group of fingerprint that module is refined in choosing, sentence fingerprint extraction module is handled is matched with fingerprint base is matched, and is detected to be checked The sentence that text possesses identical fingerprints with original text is surveyed to be determined as plagiarizing sentence and output test result；

Plagiarize sentence mark module, text is plagiarized for the testing result label according to the output of local search engine module and Sentence is plagiarized accordingly in plagiarism text.

9. text as claimed in claim 8 plagiarizes detection system, wherein text subordinate sentence module gives text subordinate sentence specifically: with text All non-Chinese, non-English, the non-numeric symbol occurred in this is separator, is several sentences a text segmentation.

10. text as claimed in claim 8 plagiarizes detection system, wherein sentence screening is refined module and is carried out to the sentence divided Screening is refined, specifically: the short sentence that number of words is less than certain limit is deleted, by name, place name, mechanism name, time etc. in sentence Entity and some other redundancy is named to delete, redundancy is by two layers of filtering: first layer, using by condition random field Trained Named Entity Extraction Model identifies name entity, be determined as redundancy and delete；The second layer, according to the normal of foundation With dictionary, it will be not comprised in words in conventional dictionary in sentence and be determined as redundancy words and delete, wherein the conventional dictionary It is to pick out the highest part word of frequency by segmenting all texts in one big text library and built as everyday words Vertical.

11. the text as described in any one of claim 8-10 plagiarizes detection system, wherein sentence fingerprint extraction module is to sieve Sentence after choosing is refined carries out fingerprint extraction specifically: then the original figure fingerprint that every words are extracted using MD5 algorithm is being protected Demonstrate,prove total fingerprint number multiplicity it is sufficiently small under the premise of, from original fingerprint intercept certain length digital finger-print and be mapped as character Fingerprint, each character correspond to multiple numbers, further shorten fingerprint length.

12. text as claimed in claim 11 plagiarizes detection system, wherein the local search engine is based on Lucene skill What art framework was realized, new Characters can be carried out simultaneously and plagiarize retrieval；Its testing result exported is according to being plagiarized Sentence number is ranked up multiple by plagiarism Text Flag item from more to less, wherein each indicated by plagiarism Text Flag item are as follows: < Text Flag, plagiarism sentence number >.

13. text as claimed in claim 12 plagiarizes detection method, carried out wherein plagiarizing sentence mark module to sentence is plagiarized Label specifically: front M Chinese by plagiarism Text Flag item will be come in local search engine module output test result The corresponding text determination of this mark is accused of being plagiarized text, corresponds to quilt according to each file identification by plagiarism Text Flag item File content and detected text content are plagiarized, determine corresponding each plagiarism sentence in detected text and is accused of being plagiarized Location information in text is simultaneously marked.