CN101404037A

CN101404037A - Method for detecting and positioning electronic text contents plagiary

Info

Publication number: CN101404037A
Application number: CNA2008102323098A
Authority: CN
Inventors: 鲍军鹏; 冯中慧
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2008-11-18
Filing date: 2008-11-18
Publication date: 2009-04-08
Anticipated expiration: 2028-11-18
Also published as: CN101404037B

Abstract

The invention discloses a method for detecting and locating plagiarism of electronic text content by using a computer system. The computer system at least includes: an electronic text input module, a text feature extraction module, a plagiarism evidence extraction module, and a text plagiarism judgment module, displaying detection results and locating Plagiarized content modules. Its detection method firstly extracts features according to the text structure information and semantic information, and obtains the sequence of items to be detected; then processes all items in the sequence of items to be detected in turn to obtain the queue of suspected plagiarism; then detects all queues of suspected plagiarism, obtains evidence of plagiarism, and generates Plagiarism evidence table; then calculate the text similarity according to the evidence table, and determine whether there is plagiarism. If the similarity is greater than or equal to a certain threshold, it is considered that there is plagiarism in the detected text, otherwise it is considered that there is no plagiarism in the detected text. For the text pair judged to be plagiarized, the corresponding plagiarized evidence is taken from the evidence table and sent to the display module to display the specific plagiarized content.

Description

The method of a kind of detection and positioning electronic text contents plagiary

Technical field

The invention belongs to Intelligent Information Processing and field of computer technology, relate to a kind of detected electrons text and whether contain the method for plagiarizing content, the method of particularly a kind of detection and positioning electronic text contents plagiary, this method can accurately be located detected e-text, and provides conclusive plagiarization evidence.

Background technology

Along with the fast development of network and universal rapidly, the e-text of issuing on the internet becomes an emphasis of current knowledge property right protection at present.Because e-text is easy to duplicate and download, the object of become many people's researchs, quoting, the case that some e-texts are considered to plagiarize by big duplicating of the space of a whole page happens occasionally.And the e-text safeguard measure on the present network mainly contains two kinds: a kind of is " prevention " method, and a kind of is " detection " method.

" prevention " method just is to use methods such as encryption, watermark, special carrier to make protected content be difficult to copy.For example IEEE is by the disk distribution collected works, and the online article of Chinese periodical adopts special software to read.Bell Laboratory has proposed " watermark " technology and has used word space or the image of encrypting, and can identify the document authorized user identities.But this does not have unassailable Mahinot Line in the world, does not have cocksure encryption technology yet.Said method all might be cracked; And we do not have technological means to prevent that authorized user from using optical identification ways such as (OCR) to go bootlegging, diffusion yet.So " prevention " method can not solve the intellectual property protection problem fully.

The thinking that " detection " method protects the intellectual property is such: it also is indifferent to file and how is replicated.But at first judge whether contain the content of duplicating or plagiarizing in the current file; If found bootlegging or cribbing, so again copy source or plagiarist are taken measures on customs clearance.The core of " detection " method is exactly the copy detection technology.Obviously " prevention " method and " detection " method are not the relations that opposes mutually, and should replenish mutually, improve and could protect the intellectual property better.

So-called text copy detection is also referred to as text and plagiarizes detection, judges exactly another one or a plurality of text are plagiarized, plagiarize or be replicated in to the content of a text whether.Plagiarize and not only to mean intactly and indiscriminately imitates, comprise also that shift transformation, synonym to original work replaced and changed saying to repeat or the like mode.Text copy detection technology mainly contains two kinds of basic detection methods now: a kind of is " string matching " method, and another kind is " word frequency " method.

So-called string matching detection method is exactly at first to extract some feature strings from text, generally is referred to as " fingerprint " (fingerprints); Judge according to the identical rate of these fingerprints whether plagiarization is arranged in the text then.([1] S.Brin of COPS system of proposing of people such as Brin of Stanford University and Garcia-Molina for example, J.Davis, and H.Garcia-Molina.Copy detection mechanismsfor digital documents.In Proceedings of the ACM SIGMOD AnnualConference, s San Francisco, CA, May 1995.); ([2] the Heintze N.Scalable Document Fingerprinting.InProceedings of the Second USENIX Workshop on Electronic Commerce of KOALA system of Bell Laboratory Heintze exploitation, Oakland, California, 18-21 November, 1996.) or the like.

So-called word frequency detection method is to use " word bag " in the information retrieval (bag of words) method, at first adds up each word frequency of occurrences in the text, uses certain tolerance to obtain the identical degree of two pieces of texts to the word frequencies vector then, and draws final judgement.SCAM prototype ([3] N.Shivakumar and H.Garcia-Molina.SCAM:A copy detection mechanism for digital documents.In Proceedings of 2nd International Conference in Theory and Practiceof Digital Libraries (DL ' 95) of proposing of people such as Garcia-Molina of Stanford University and Shivakumar for example, Austin, Texas, June 1995.); CHECK prototype ([4] Si A. that people such as Si of The Hong Kong Polytechnic University and Leong set up, Leong H.V., Lau R.W.H.CHECK:A Document Plagiarism Detection System.In Proceedingsof ACM Symposium for Applied Computing, pp.70-77, Feb.1997.) or the like.

The string matching method can accurately determine to be replicated content, but after indivedual words were changed (deletion) in the character string, precision just reduced greatly.The word frequency method has certain noise robustness, and small-scale words changes not can the appreciable impact accuracy of detection, and detection efficiency is higher relatively.But when being replicated proportion that content accounts for the entire chapter text hour, the word frequency rule is difficult to detect.The word frequency method is closed 1 type for n and is partly duplicated almost inefficacy.The string matching method is a kind of detection method of paying attention to local feature, because local feature is generally unstable, so this method noise robustness is not good.The word frequency method is excavated global characteristics by word frequency, and local small adjustment can not influence global characteristics, so this method noise resisting ability is stronger relatively.But because the word frequency method only pays close attention to global characteristics, ignored local feature, thereby can not carry out careful detection, be difficult to detection so the word frequency method is plagiarized (for example n close 1 type partly duplicate) for little content to the text of two pieces more similar (but different).

The applicant has submitted to name to be called " a kind of method of utilizing computer program detected electrons text to plagiarize " in 2003 to Patent Office of the People's Republic of China, (patent No.: ZL 03134562.X), this method is extracted text feature according to the structural information and the semantic information of text to be awarded patent right; Using text to plagiarize the sonde method of setting in the determination module then estimates the maximum common semanteme of the text feature in text feature to be detected and the feature database and provides the identical tolerance of text; Judge in view of the above at last whether plagiarization is arranged, think if identical degree is greater than or equal to certain threshold value to exist in the detected text and plagiarize, otherwise think and do not plagiarize in the detected text, be suitable for detecting more quickly long text and plagiarize.This method suitably combines elementary string matching method and word frequency method, is not the identical degree of frequency tolerance according to simple words, but measures identical degree according to the overlapping possibility of the semantic sequence of text feature.

But,,, promptly can't locate concrete plagiarization content so this method can not provide the particular content of plagiarizing text owing to do not store complete content of text in the text feature storehouse of this method.That is to say, can not provide conclusive plagiarization evidence simultaneously for the plagiarization text that detects.

Summary of the invention

Defective or deficiency at above-mentioned prior art existence, the objective of the invention is to, the method of a kind of detection and positioning electronic text contents plagiary is provided, this method can detect through simple words and the plagiarization text that means were handled such as replace, inserts, deletes, and accurately content is plagiarized in the location, provides the plagiarization evidence.Can find out or find the e-text that those have plagiarization suspicion by this method, point out to be plagiarized content, the legal intellecture property of protection provides technological means and foundation in order to take further measures.

In order to realize above-mentioned task, the present invention takes following technical solution:

The method of a kind of detection and positioning electronic text contents plagiary is characterized in that, whether this method is utilized computer system detected electrons text to contain and plagiarized content and accurately locate the plagiarization literal, and described computer system comprises at least:

E-text typing module is in order to submit detected text to or to increase new detected text to computer system;

The text feature extraction module is in order to extract text feature, generating item sequence;

Plagiarize the evidence extraction module, be mapped on the known terms table, generate suspected plagiarism queue, obtain and plagiarize the evidence table in order to take out each in the item sequence successively;

Judge e-text plagiarization module,, judge in the detected text whether contain the plagiarization content in order to calculate detected text to identical degree;

Show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content of text as plagiarizing evidence.

It is continuous successively that e-text typing module, text extraction characteristic module, plagiarization evidence extraction module, text are plagiarized determination module, displaying testing result and location plagiarization content module, and it detects and position fixing process may further comprise the steps:

Step 1 to submitting detected text to or increasing new detected text, is extracted the detected text feature according to text structure information and semantic information, generates to be detected sequence;

Step 2 is handled all in to be detected the sequence successively, generates suspected plagiarism queue;

Step 3 detects all suspected plagiarism queue, therefrom obtains to plagiarize evidence, generates the evidence table;

Step 4 is calculated the identical degree of text according to the evidence table, and whether judge has plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;

Step 5, the text that is determined plagiarization shows as plagiarizing evidence take out corresponding plagiarization content from the evidence table.

The identical degree of Chinese version of the present invention is accounted for the ratio (R value) and the common expression of maximum identical fragment length (M value) of text length by identical literal.The two one then thinks in the text plagiarization content is arranged greater than assign thresholds.It plagiarizes evidence (concrete plagiarization literal) is exactly the pairing content of text in corresponding identical interval in the evidence table.

Description of drawings

Fig. 1 is a preferred embodiment structural drawing of the present invention;

Fig. 2 is that the present invention detects input text, generates the process flow diagram of suspected plagiarism queue;

Fig. 3 is that the present invention detects suspected plagiarism queue, obtains to plagiarize the process flow diagram of evidence.

Fig. 4 is whether the present invention has plagiarization according to the evidence list deciding a process flow diagram.

The present invention will be further described below in conjunction with drawings and Examples.

Embodiment

The method of detection of the present invention and positioning electronic text contents plagiary, its basic ideas are: at first, detect the same text whether two pieces of texts have some.If there is not same text, then must not plagiarize.If have, then carry out next step detection.Secondly, whether identical, whether constituted statement if detecting the sequencing of same text in two pieces of texts, whether identical statement is promptly arranged.If unidentical statement is not then plagiarized.If have, then carry out next step detection.Attention: identical statement is not to refer to that two statements are definitely identical, and a character is not poor.Identical statement allows in the statement indivedual words differences are arranged, but the main framework of statement should be the same.At last, if identical statement has surpassed certain limit, then decidable is for plagiarizing.Evidence plagiarized exactly in identical statement.

Based on a fact, promptly plagiarize the same text that text must comprise some.If two pieces of texts do not have the same text of sufficient amount, then should not plagiarize between these two pieces of texts.Two pieces of texts that a large amount of same text are arranged may not necessarily be exactly to plagiarize text also.If two pieces of texts have a large amount of identical literal, and the sequencing of these same text is also identical in two pieces of texts.That is, exist same sentence (perhaps paragraph) in two pieces of texts, and same sentence (perhaps paragraph) has reached certain-length.These two pieces of texts just exist plagiarization so, and same sentence (perhaps paragraph) is exactly to plagiarize evidence.

Whether the present invention utilizes computer system detected electrons text to contain and plagiarizes content and accurately locate concrete plagiarization content, and this computer system comprises at least:

(1) e-text typing module is in order to submit detected text to or to increase new detected text to system;

(2) text feature extraction module is in order to extract text feature, generating item sequence;

(3) plagiarize the evidence extraction module,, obtain and plagiarize the evidence table in order to generate suspected plagiarism queue;

(4) text is plagiarized determination module, in order to calculate detected text to identical degree, judges in the detected text whether contain the plagiarization content;

(5) show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content (promptly plagiarizing evidence) of text.

Above-mentioned e-text typing module is connected with the text feature extraction module, the text feature extraction module is connected with plagiarization evidence extraction module, plagiarize the evidence extraction module and be connected with text plagiarization determination module, text is plagiarized determination module and is plagiarized content module and be connected with showing testing result and location.It detects and position fixing process may further comprise the steps:

(1) extracts the detected text feature according to text structure information and semantic information, obtain to be detected sequence;

(2) handle in the sequence to be detected all successively, obtain suspected plagiarism queue;

(3) detect all suspected plagiarism queue, therefrom obtain to plagiarize evidence, generate the evidence table;

(4) whether calculate the identical degree of text according to the evidence table, judging has plagiarization.If identical degree is more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise thinks and do not plagiarize in the detected text.

(5) text that is determined plagiarization shows that to taking out corresponding plagiarization evidence from the evidence table, sending into testing result and location plagiarization content module show the concrete content of plagiarizing.

Above-mentioned detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.

No matter detected text stores (such as ascii text file, the word of Microsoft file, html file, pdf (portable document format) file, Tex file or the like) with what form in computing machine, what it presented to the user is that natural language is main content, is not figure, image, video, audio-frequency information.

Above-mentioned natural language text comprises single language such as Chinese, English, Japanese, Korea's literary composition, French, Spanish, Russian, German or the e-text that is mixed by above language.It is different that the text of different language only is that the text pretreatment stage is cut apart the operation of words, and other all links are identical.

Above-mentioned detected e-text, processed minimum unit is an item.An item is exactly one or more continuous character.The length of item is exactly the number of continuation character, is an artificial systematic parameter that is provided with.Detect the length that the minimum length of plagiarizing is exactly an item.

Item is arranged in computing machine in the following manner: all deposit in the Hash table, and each all is a key word.This table is referred to as the known terms table; Each all corresponding listed files, what deposit in the listed files is all file or document codes that comprise this, and listed files is organized with Hash table, and document code is a key word; The all corresponding formation of each file in the listed files (perhaps document code), this position that occurs in this document of storage in the formation, arrange according to orderly fashion the position in the formation.

Above-mentioned suspected plagiarism queue abbreviates suspected plagiarism queue as, is an ordered sequence that is made of a plurality of items.This sequence has following feature:

(1) all items all occur in same piece of writing text (being designated as text d) in the sequence;

(2) sequencing of any two items is determined in proper order by their appearance in text d in the sequence;

(3) any two positions of adjacency in text d are close in the sequence.

Above-mentioned plagiarization evidence is one section continuous literal in the text.This section literal is corresponding with certain suspected plagiarism queue, and promptly all items all appear in this section literal in the suspected plagiarism queue, and order is identical.

The process that generates suspected plagiarism queue is as follows:

(1) text to be detected is through obtaining to be detected sequence after the data cleansing;

(2) item for the treatment of in the detection sequence is mapped on the known terms table successively;

(3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;

(4) not close if newly put into the item of suspected plagiarism queue with the position of the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;

(5) repeating above step finishes until to be detected series processing.

The process that generates doubtful evidence table is as follows:

(1) in the suspected plagiarism queue each, takes out its position queue in known text;

(2) for each position in the position queue, judge whether it drops within certain identical interval, perhaps outside;

(3) if current identical interval, then be identical interval of start-stop position formation, and deposit current identical formation in the current location;

(4) if current location within identical interval, then goes to (7);

(5) if current location outside identical interval, and close with identical interval start-stop position, then expansion should be identical interval;

(6) if current location outside identical interval, and all not close with identical interval start-stop position, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;

(7) if identical interval long enough then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;

(8) repeating above-mentioned steps handles until suspected plagiarism queue.

Calculate the identical degree of text and judge that the process of text plagiarization is as follows:

(1) it is right to read in detected text;

(2) in the evidence table, search the text to all identical intervals (being designated as P);

(3) all identical length of an interval degree (being designated as S) among the accumulative total P;

(4) find the longest identical length of an interval degree (being designated as M) among the P;

(5) calculate the ratio (being designated as R) of S and text size;

(6) if R greater than assign thresholds, then text centering comprises plagiarization;

(7) if M greater than assign thresholds, then text centering comprises plagiarization;

(8) output has all identical interval conducts of plagiarization text to plagiarize evidence;

Below be the preferred embodiment that the inventor provides, need to prove, the invention is not restricted to these embodiment.

With reference to Fig. 1, the structural drawing of a preferred embodiment that provides of the present invention is provided Fig. 1.

Computer system among this embodiment comprises e-text typing module 20, text feature extraction module 30 at least, plagiarizes evidence extraction module 40, judges e-text plagiarization module 50 and shows testing result and content module 60 is plagiarized in the location, e-text typing module 20 is connected with text feature extraction module 30, text feature extraction module 30 is connected with plagiarization evidence extraction module 40, plagiarize evidence extraction module 40 and be connected with text plagiarization determination module 50, text is plagiarized determination module 50 and is plagiarized content module 60 and be connected with showing testing result and location.

Detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.

In e-text typing module 20, by e-text typing and the submission of user 10 with collection.

At text feature extraction module 30, extract its text feature, the generating item sequence at the e-text of submitting to.

In plagiarizing evidence extraction module 40, from the item sequence of text generation to be detected, take out each successively and be mapped on the known terms table, obtain suspected plagiarism queue then and plagiarize the evidence table.

In judging e-text plagiarization module 50, calculate text to identical degree according to plagiarizing the evidence table, and judge plagiarization, evidence plagiarized in record.

Final system is plagiarized content module 60 to the user report testing result by showing testing result and location, and shows and plagiarize the concrete plagiarization content of text as plagiarizing evidence.

Among Fig. 1, text feature extraction module 30 need carry out pre-service to text when extracting text feature.The text pre-service comprises that text is carried out format conversion, participle (cutting speech), stem handles, removes operations such as high frequency words.Format conversion is exactly that the text of other form (such as the word of Microsoft file, pdf (portable document format) file or the like) is completely converted to pure ASCII character formatted file, does not contain the character of NON-ASCII in the text after the feasible conversion.Participle or cut speech and be meant according to word and cut text makes text become a long word sequence rather than a character string.In the process of participle, removed various punctuation marks, numeral and other non-character symbols, separated (such as the space) with a unified symbol between all words.Stem is handled and to be meant the different morphologies of word completely on normalizing to a stem.For example danced, dancing and dance normalizing are dance.The removal high frequency words is meant to be got rid of the extra high word of those frequencies of occurrences from text, these high frequency words comprise single-letter speech, pronoun, preposition, modal particle or the like, such as a, he, the, of or the like.Last text feature extraction module 30 becomes a long sequence to one piece of input text.Text feature extraction module 30 also is responsible for making up the known terms table by known text.

With reference to Fig. 2, Fig. 2 generates the process flow diagram of suspected plagiarism queue for detecting input text.

At first carry out step 201, a text to be detected is read in the computing machine, be designated as d.Then carry out step 202, from text d, read an item, be designated as t, and write down its current location in text to be measured, be designated as p.Carry out step 203 then, judge whether t is present in the known terms table.If then carry out step 204; Otherwise go to step 213.In step 204, take out all known text that comprise t.Judge in step 205 whether the known text that comprises t is handled then, if then go to step 213, otherwise carry out step 206.In step 206, take out a untreated known text d ' who comprises t.Carry out step 207 then, take out the suspected plagiarism queue of text to be measured and this known text, be designated as L.Then carry out step 208, whether judge L length greater than designated value T1, if then go to step 212, otherwise carry out step 209.In step 209, take out last position of being remembered of L, be designated as p '.Carry out step 210 then, whether the difference of judging position p and p ' is greater than designated value T2.If then go to step 212, otherwise carry out step 211.In step 211, t and present position p thereof are appended last at suspected plagiarism queue L.In step 212, carry out the operation that detects current suspected plagiarism queue L, obtain to plagiarize evidence, detailed step please refer to Fig. 3 explanation.In step 213, judge whether text d to be measured runs through.If illustrate then that text to be measured has been handled to be over.Otherwise, go to step 202, continue above-mentioned circulation, all items in handling text to be measured.

With reference to Fig. 3, Fig. 3 obtains to plagiarize the process flow diagram of evidence for detecting suspected plagiarism queue.

At first carry out step 301, take out an item in the suspected plagiarism queue, be designated as t.Carry out step 302 then, take out the position queue of t in known file.Then carry out step 303, judge whether position queue is handled.If go to step 319, otherwise carry out step 304.In step 304, the next position in the extracting position formation is designated as P.Carry out step 305 then, judge whether identical formation is handled.If go to step 307, otherwise carry out step 306.

In step 307, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, go to step 318 then being inserted in the last of identical formation between this newly developed area.In step 306, the next one that takes out in the identical formation is identical interval, is designated as R.Carry out step 308 then, the final position of calculating location P and interval R poor is designated as G.Then carry out step 309, judge that whether G is greater than designated value T2.If then carry out step 310, otherwise goes to step 311.

In step 310, whether the length of judging interval R is greater than designated value T3.If then carry out step 312, otherwise goes to step 305.In step 311, judge that whether G is greater than 0.If then carry out step 314, otherwise goes to step 315.In step 312, interval R is one section and plagiarizes literal, R is put into plagiarize the evidence table.Then carry out step 313, the interval R of deletion goes to step 305 then from identical formation.In step 314, the final position of interval R is revised as position P, and, goes to step 318 then the position that identical interval final position in the text to be measured is revised as this.In step 315, the interval pointer in the identical formation is stepped back a step.Then carry out step 316, judge that position P is whether less than the reference position of interval R.If then carry out step 317, otherwise goes to step 318.

In step 317, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, carry out step 318 then being inserted in the last of identical formation between this newly developed area.In step 318, mark position P handled, and went to step 303 then.In step 319, judge whether all handle in the suspected plagiarism queue.If, illustrate that suspected plagiarism queue disposes, then adjacent identical interval in the plagiarization evidence table to be merged into bigger identical interval and preserved the plagiarization evidence, the process that detects suspected plagiarism queue then finishes.Otherwise go to step 301, continue above-mentioned circulation, in handling suspected plagiarism queue all.

With reference to Fig. 4, Fig. 4 for according to evidence list deciding text to whether the process flow diagram of plagiarization is arranged.

At first carry out step 401, read a text to be detected, be designated as d.Carry out step 402 then, read a known text, be designated as d '.Then carry out step 403, all identical burst length summations between d and the d ' are designated as s in the calculating evidence table.Carry out step 404 then, the maximum in the calculating evidence table between d and the d ' is identical interval, and promptly the identical interval of length maximum is designated as M.Then carry out step 405, calculate s and d length ratio value R1.Carry out step 406 then, calculate s and d ' length ratio value R2.Then carry out step 407, get value bigger among R1 and the R2, be designated as R.In step 408, judge that whether R is greater than designated value T4 then.If then go to step 411, otherwise carry out step 409.In step 409, judge that whether M is greater than designated value T5.If then go to step 411, otherwise carry out step 410.In step 410, be judged to be d and do not plagiarize d ', go to step 413 then.In step 411, be judged to be d and plagiarize d '.Carry out step 412 then, take out identical interval from the evidence table, locate respectively in d and d ', evidence is plagiarized in output.Then carry out step 413, judge whether all known text are handled.If, to the plagiarization decision process end of text d to be measured.Otherwise go to step 402, continue above-mentioned circulation, all carried out plagiarizing judgement until d and all known text.

Claims

1. A method for detecting and locating plagiarized electronic text content, characterized in that the method utilizes a computer system to detect whether the electronic text contains plagiarized content and accurately locates plagiarized words, and the computer system at least includes:

An electronic text entry module, used to submit the detected text to the computer system or add a new detected text;

The text feature extraction module is used to extract text features and generate item sequences;

The plagiarism evidence extraction module is used to sequentially extract each item from the item sequence and map it to the known item table, generate a suspected plagiarism queue, and obtain the plagiarism evidence table;

The electronic text plagiarism judgment module is used to calculate the degree of similarity between the detected texts and determine whether the detected texts contain plagiarized content;

Display detection results and locate plagiarized content modules, which are used to output detection results to users and display specific plagiarized content of plagiarized texts as evidence of plagiarism;

The electronic text entry module, text extraction feature module, plagiarism evidence extraction module, text plagiarism determination module, display detection results and locate plagiarism content module are connected in sequence, and the detection and positioning process includes the following steps:

Step 1, for submitting the detected text or adding a new detected text, extracting the features of the detected text according to the text structure information and semantic information, and generating a sequence of items to be detected;

Step 2, sequentially process all items in the sequence of items to be detected to generate a suspected plagiarism queue;

Step 3: Detect all suspected plagiarism queues, obtain evidence of plagiarism from them, and generate an evidence table;

Step 4: Calculate the degree of similarity of the text according to the evidence table to determine whether there is plagiarism. If the degree of similarity is greater than or equal to a certain threshold, it is considered that there is plagiarism in the detected text, otherwise it is considered that there is no plagiarism in the detected text;

Step 5: For the text that is judged to be plagiarized, take out the corresponding plagiarized content from the evidence table and display it as evidence of plagiarism.

2. The method according to claim 1, wherein the detected text is manually entered by the user, or a copy of the user's existing text, or downloaded by the user through the network, or automatically obtained through the Internet, the detected text No matter what format is stored in the computer, what it presents is mainly natural language content, not graphics, images, video or audio information.

3. The method according to claim 2, wherein the natural language includes Chinese, English, Japanese, Korean, French, Spanish, Russian, German or other single language texts, or is composed of A text in a mixture of the above languages.

4. The method according to claim 1, wherein the minimum unit of the detected text to be processed is an item, and the item is one or more continuous characters, and the item is in the following manner in the computer system Arrangement: All items are stored in a hash table, each item is a keyword, and each item corresponds to a file list, all files or file codes containing the item are stored in the file list, and the file list uses Hash table organization, the file code is the keyword; each file or file code in the file list corresponds to a queue, which stores all occurrences of the item in the file, and the positions in the queue are arranged in an orderly manner.

5. The method according to claim 1, wherein the queue of suspected plagiarism is an ordered sequence composed of a plurality of items, and the ordered sequence has the following characteristics:

1) All items in the ordered sequence appear in the same text;

2) The order of any two items in an ordered sequence is determined by their order of appearance in a text;

3) Any two adjacent items in the ordered sequence have similar positions in a text.

6. The method according to claim 1, wherein the process of generating a queue of suspected plagiarism is carried out in the following steps:

1) After the electronic text to be detected is cleaned, the sequence of items to be detected is obtained;

2) The items in the sequence of items to be detected are sequentially mapped to the known item table;

3) If the known text corresponding to the corresponding item in the known item table is not empty, put the item and its position in the known text into the suspected plagiarism queue;

4) If the newly placed item in the queue of suspected plagiarism is not close to the position of the previous item in the queue, a new queue of suspected plagiarism is generated, otherwise, the original queue of suspected plagiarism is continued;

5) Repeat steps 2) to 4) above until the sequence of items to be detected is processed.

7. The method according to claim 1, characterized in that, the process of obtaining the plagiarism evidence table is carried out in the following steps:

1) For each item in the suspected plagiarism queue, take out its position queue in the known text;

2) For each position in the position queue, determine whether it falls within or outside a certain similar interval;

3) If there is no similar interval at present, a similar interval is formed with the current position as the starting and ending position, and stored in the current identical queue;

4) If the current position is within the same interval, then go to step 7);

5) If the current position is outside the identical interval and is close to the start and end positions of the identical interval, then expand the identical interval;

6) If the current position is outside the same interval and is not close to the start and end positions of the same interval, then use the current position as the start and end positions to form an identical interval and store it in the current identical queue;

7) If the similarity interval is long enough, it will be directly stored in the plagiarism evidence table and deleted from the current similarity queue;

8) Repeat steps 1) to 7) above until the suspected plagiarism queue is processed.

8. The method according to claim 1, wherein said process of judging plagiarism of text comprises the following steps:

1) read in the detected text pair;

2) Look up the detected text in the evidence table and denote all identical intervals as P;

3) The length of all identical intervals in the cumulative P is recorded as S;

4) Find the length of the longest identical interval in P and record it as M;

5) Calculate the ratio of S to text length and record it as R;

6) If R is greater than the specified threshold, the text pair contains plagiarism;

7) If M is greater than the specified threshold, the text pair contains plagiarism;

8) Output all similar intervals with plagiarized texts as evidence of plagiarism.

9. The method according to claim 1, wherein the output detection result includes the ratio R value of the identical text to the text length and the maximum identical segment length M value.

10. The method according to claim 1, wherein the locating plagiarized text is to output and present the text content corresponding to the interval to the user according to the same interval in the evidence table.