CN101404037A - Method for detecting and positioning electronic text contents plagiary - Google Patents

Method for detecting and positioning electronic text contents plagiary Download PDF

Info

Publication number
CN101404037A
CN101404037A CNA2008102323098A CN200810232309A CN101404037A CN 101404037 A CN101404037 A CN 101404037A CN A2008102323098 A CNA2008102323098 A CN A2008102323098A CN 200810232309 A CN200810232309 A CN 200810232309A CN 101404037 A CN101404037 A CN 101404037A
Authority
CN
China
Prior art keywords
text
detected
evidence
plagiarism
plagiarization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102323098A
Other languages
Chinese (zh)
Other versions
CN101404037B (en
Inventor
鲍军鹏
冯中慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN2008102323098A priority Critical patent/CN101404037B/en
Publication of CN101404037A publication Critical patent/CN101404037A/en
Application granted granted Critical
Publication of CN101404037B publication Critical patent/CN101404037B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for detecting and positioning electronic text contents plagiarism by a computer system. The computer system at least comprises an electronic text input module, a text feature extraction module, a plagiarism evidence extraction module, a text plagiarism judgment module, a detection result display module and a plagiarism contents positioning module. The detection method comprises the following steps: firstly, a feature is extracted according to the structural information and the semantic information of the text to obtain a sequence of items to be detected; then all items in the sequence of the items to be detected are sequentially processed to obtain a suspected plagiarism queue; thirdly, all suspected plagiarism queues are detected to obtain the plagiarism evidence and generate a plagiarism evidence sheet; finally, the resemblance of the text is calculated based on the evidence sheet, and the text is judged whether to have plagiarism. If the resemblance is greater or equal to a certain threshold, the detected text is considered to have plagiarism, or else, the detected text is not considered to have plagiarism. As to the text judged to have plagiarism, the corresponding plagiarism evidence thereof is extracted from the evidence sheet, and input the display module to display the specific plagiarism contents.

Description

The method of a kind of detection and positioning electronic text contents plagiary
Technical field
The invention belongs to Intelligent Information Processing and field of computer technology, relate to a kind of detected electrons text and whether contain the method for plagiarizing content, the method of particularly a kind of detection and positioning electronic text contents plagiary, this method can accurately be located detected e-text, and provides conclusive plagiarization evidence.
Background technology
Along with the fast development of network and universal rapidly, the e-text of issuing on the internet becomes an emphasis of current knowledge property right protection at present.Because e-text is easy to duplicate and download, the object of become many people's researchs, quoting, the case that some e-texts are considered to plagiarize by big duplicating of the space of a whole page happens occasionally.And the e-text safeguard measure on the present network mainly contains two kinds: a kind of is " prevention " method, and a kind of is " detection " method.
" prevention " method just is to use methods such as encryption, watermark, special carrier to make protected content be difficult to copy.For example IEEE is by the disk distribution collected works, and the online article of Chinese periodical adopts special software to read.Bell Laboratory has proposed " watermark " technology and has used word space or the image of encrypting, and can identify the document authorized user identities.But this does not have unassailable Mahinot Line in the world, does not have cocksure encryption technology yet.Said method all might be cracked; And we do not have technological means to prevent that authorized user from using optical identification ways such as (OCR) to go bootlegging, diffusion yet.So " prevention " method can not solve the intellectual property protection problem fully.
The thinking that " detection " method protects the intellectual property is such: it also is indifferent to file and how is replicated.But at first judge whether contain the content of duplicating or plagiarizing in the current file; If found bootlegging or cribbing, so again copy source or plagiarist are taken measures on customs clearance.The core of " detection " method is exactly the copy detection technology.Obviously " prevention " method and " detection " method are not the relations that opposes mutually, and should replenish mutually, improve and could protect the intellectual property better.
So-called text copy detection is also referred to as text and plagiarizes detection, judges exactly another one or a plurality of text are plagiarized, plagiarize or be replicated in to the content of a text whether.Plagiarize and not only to mean intactly and indiscriminately imitates, comprise also that shift transformation, synonym to original work replaced and changed saying to repeat or the like mode.Text copy detection technology mainly contains two kinds of basic detection methods now: a kind of is " string matching " method, and another kind is " word frequency " method.
So-called string matching detection method is exactly at first to extract some feature strings from text, generally is referred to as " fingerprint " (fingerprints); Judge according to the identical rate of these fingerprints whether plagiarization is arranged in the text then.([1] S.Brin of COPS system of proposing of people such as Brin of Stanford University and Garcia-Molina for example, J.Davis, and H.Garcia-Molina.Copy detection mechanismsfor digital documents.In Proceedings of the ACM SIGMOD AnnualConference, s San Francisco, CA, May 1995.); ([2] the Heintze N.Scalable Document Fingerprinting.InProceedings of the Second USENIX Workshop on Electronic Commerce of KOALA system of Bell Laboratory Heintze exploitation, Oakland, California, 18-21 November, 1996.) or the like.
So-called word frequency detection method is to use " word bag " in the information retrieval (bag of words) method, at first adds up each word frequency of occurrences in the text, uses certain tolerance to obtain the identical degree of two pieces of texts to the word frequencies vector then, and draws final judgement.SCAM prototype ([3] N.Shivakumar and H.Garcia-Molina.SCAM:A copy detection mechanism for digital documents.In Proceedings of 2nd International Conference in Theory and Practiceof Digital Libraries (DL ' 95) of proposing of people such as Garcia-Molina of Stanford University and Shivakumar for example, Austin, Texas, June 1995.); CHECK prototype ([4] Si A. that people such as Si of The Hong Kong Polytechnic University and Leong set up, Leong H.V., Lau R.W.H.CHECK:A Document Plagiarism Detection System.In Proceedingsof ACM Symposium for Applied Computing, pp.70-77, Feb.1997.) or the like.
The string matching method can accurately determine to be replicated content, but after indivedual words were changed (deletion) in the character string, precision just reduced greatly.The word frequency method has certain noise robustness, and small-scale words changes not can the appreciable impact accuracy of detection, and detection efficiency is higher relatively.But when being replicated proportion that content accounts for the entire chapter text hour, the word frequency rule is difficult to detect.The word frequency method is closed 1 type for n and is partly duplicated almost inefficacy.The string matching method is a kind of detection method of paying attention to local feature, because local feature is generally unstable, so this method noise robustness is not good.The word frequency method is excavated global characteristics by word frequency, and local small adjustment can not influence global characteristics, so this method noise resisting ability is stronger relatively.But because the word frequency method only pays close attention to global characteristics, ignored local feature, thereby can not carry out careful detection, be difficult to detection so the word frequency method is plagiarized (for example n close 1 type partly duplicate) for little content to the text of two pieces more similar (but different).
The applicant has submitted to name to be called " a kind of method of utilizing computer program detected electrons text to plagiarize " in 2003 to Patent Office of the People's Republic of China, (patent No.: ZL 03134562.X), this method is extracted text feature according to the structural information and the semantic information of text to be awarded patent right; Using text to plagiarize the sonde method of setting in the determination module then estimates the maximum common semanteme of the text feature in text feature to be detected and the feature database and provides the identical tolerance of text; Judge in view of the above at last whether plagiarization is arranged, think if identical degree is greater than or equal to certain threshold value to exist in the detected text and plagiarize, otherwise think and do not plagiarize in the detected text, be suitable for detecting more quickly long text and plagiarize.This method suitably combines elementary string matching method and word frequency method, is not the identical degree of frequency tolerance according to simple words, but measures identical degree according to the overlapping possibility of the semantic sequence of text feature.
But,,, promptly can't locate concrete plagiarization content so this method can not provide the particular content of plagiarizing text owing to do not store complete content of text in the text feature storehouse of this method.That is to say, can not provide conclusive plagiarization evidence simultaneously for the plagiarization text that detects.
Summary of the invention
Defective or deficiency at above-mentioned prior art existence, the objective of the invention is to, the method of a kind of detection and positioning electronic text contents plagiary is provided, this method can detect through simple words and the plagiarization text that means were handled such as replace, inserts, deletes, and accurately content is plagiarized in the location, provides the plagiarization evidence.Can find out or find the e-text that those have plagiarization suspicion by this method, point out to be plagiarized content, the legal intellecture property of protection provides technological means and foundation in order to take further measures.
In order to realize above-mentioned task, the present invention takes following technical solution:
The method of a kind of detection and positioning electronic text contents plagiary is characterized in that, whether this method is utilized computer system detected electrons text to contain and plagiarized content and accurately locate the plagiarization literal, and described computer system comprises at least:
E-text typing module is in order to submit detected text to or to increase new detected text to computer system;
The text feature extraction module is in order to extract text feature, generating item sequence;
Plagiarize the evidence extraction module, be mapped on the known terms table, generate suspected plagiarism queue, obtain and plagiarize the evidence table in order to take out each in the item sequence successively;
Judge e-text plagiarization module,, judge in the detected text whether contain the plagiarization content in order to calculate detected text to identical degree;
Show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content of text as plagiarizing evidence.
It is continuous successively that e-text typing module, text extraction characteristic module, plagiarization evidence extraction module, text are plagiarized determination module, displaying testing result and location plagiarization content module, and it detects and position fixing process may further comprise the steps:
Step 1 to submitting detected text to or increasing new detected text, is extracted the detected text feature according to text structure information and semantic information, generates to be detected sequence;
Step 2 is handled all in to be detected the sequence successively, generates suspected plagiarism queue;
Step 3 detects all suspected plagiarism queue, therefrom obtains to plagiarize evidence, generates the evidence table;
Step 4 is calculated the identical degree of text according to the evidence table, and whether judge has plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;
Step 5, the text that is determined plagiarization shows as plagiarizing evidence take out corresponding plagiarization content from the evidence table.
The identical degree of Chinese version of the present invention is accounted for the ratio (R value) and the common expression of maximum identical fragment length (M value) of text length by identical literal.The two one then thinks in the text plagiarization content is arranged greater than assign thresholds.It plagiarizes evidence (concrete plagiarization literal) is exactly the pairing content of text in corresponding identical interval in the evidence table.
Description of drawings
Fig. 1 is a preferred embodiment structural drawing of the present invention;
Fig. 2 is that the present invention detects input text, generates the process flow diagram of suspected plagiarism queue;
Fig. 3 is that the present invention detects suspected plagiarism queue, obtains to plagiarize the process flow diagram of evidence.
Fig. 4 is whether the present invention has plagiarization according to the evidence list deciding a process flow diagram.
The present invention will be further described below in conjunction with drawings and Examples.
Embodiment
The method of detection of the present invention and positioning electronic text contents plagiary, its basic ideas are: at first, detect the same text whether two pieces of texts have some.If there is not same text, then must not plagiarize.If have, then carry out next step detection.Secondly, whether identical, whether constituted statement if detecting the sequencing of same text in two pieces of texts, whether identical statement is promptly arranged.If unidentical statement is not then plagiarized.If have, then carry out next step detection.Attention: identical statement is not to refer to that two statements are definitely identical, and a character is not poor.Identical statement allows in the statement indivedual words differences are arranged, but the main framework of statement should be the same.At last, if identical statement has surpassed certain limit, then decidable is for plagiarizing.Evidence plagiarized exactly in identical statement.
Based on a fact, promptly plagiarize the same text that text must comprise some.If two pieces of texts do not have the same text of sufficient amount, then should not plagiarize between these two pieces of texts.Two pieces of texts that a large amount of same text are arranged may not necessarily be exactly to plagiarize text also.If two pieces of texts have a large amount of identical literal, and the sequencing of these same text is also identical in two pieces of texts.That is, exist same sentence (perhaps paragraph) in two pieces of texts, and same sentence (perhaps paragraph) has reached certain-length.These two pieces of texts just exist plagiarization so, and same sentence (perhaps paragraph) is exactly to plagiarize evidence.
Whether the present invention utilizes computer system detected electrons text to contain and plagiarizes content and accurately locate concrete plagiarization content, and this computer system comprises at least:
(1) e-text typing module is in order to submit detected text to or to increase new detected text to system;
(2) text feature extraction module is in order to extract text feature, generating item sequence;
(3) plagiarize the evidence extraction module,, obtain and plagiarize the evidence table in order to generate suspected plagiarism queue;
(4) text is plagiarized determination module, in order to calculate detected text to identical degree, judges in the detected text whether contain the plagiarization content;
(5) show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content (promptly plagiarizing evidence) of text.
Above-mentioned e-text typing module is connected with the text feature extraction module, the text feature extraction module is connected with plagiarization evidence extraction module, plagiarize the evidence extraction module and be connected with text plagiarization determination module, text is plagiarized determination module and is plagiarized content module and be connected with showing testing result and location.It detects and position fixing process may further comprise the steps:
(1) extracts the detected text feature according to text structure information and semantic information, obtain to be detected sequence;
(2) handle in the sequence to be detected all successively, obtain suspected plagiarism queue;
(3) detect all suspected plagiarism queue, therefrom obtain to plagiarize evidence, generate the evidence table;
(4) whether calculate the identical degree of text according to the evidence table, judging has plagiarization.If identical degree is more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise thinks and do not plagiarize in the detected text.
(5) text that is determined plagiarization shows that to taking out corresponding plagiarization evidence from the evidence table, sending into testing result and location plagiarization content module show the concrete content of plagiarizing.
Above-mentioned detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.
No matter detected text stores (such as ascii text file, the word of Microsoft file, html file, pdf (portable document format) file, Tex file or the like) with what form in computing machine, what it presented to the user is that natural language is main content, is not figure, image, video, audio-frequency information.
Above-mentioned natural language text comprises single language such as Chinese, English, Japanese, Korea's literary composition, French, Spanish, Russian, German or the e-text that is mixed by above language.It is different that the text of different language only is that the text pretreatment stage is cut apart the operation of words, and other all links are identical.
Above-mentioned detected e-text, processed minimum unit is an item.An item is exactly one or more continuous character.The length of item is exactly the number of continuation character, is an artificial systematic parameter that is provided with.Detect the length that the minimum length of plagiarizing is exactly an item.
Item is arranged in computing machine in the following manner: all deposit in the Hash table, and each all is a key word.This table is referred to as the known terms table; Each all corresponding listed files, what deposit in the listed files is all file or document codes that comprise this, and listed files is organized with Hash table, and document code is a key word; The all corresponding formation of each file in the listed files (perhaps document code), this position that occurs in this document of storage in the formation, arrange according to orderly fashion the position in the formation.
Above-mentioned suspected plagiarism queue abbreviates suspected plagiarism queue as, is an ordered sequence that is made of a plurality of items.This sequence has following feature:
(1) all items all occur in same piece of writing text (being designated as text d) in the sequence;
(2) sequencing of any two items is determined in proper order by their appearance in text d in the sequence;
(3) any two positions of adjacency in text d are close in the sequence.
Above-mentioned plagiarization evidence is one section continuous literal in the text.This section literal is corresponding with certain suspected plagiarism queue, and promptly all items all appear in this section literal in the suspected plagiarism queue, and order is identical.
The process that generates suspected plagiarism queue is as follows:
(1) text to be detected is through obtaining to be detected sequence after the data cleansing;
(2) item for the treatment of in the detection sequence is mapped on the known terms table successively;
(3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
(4) not close if newly put into the item of suspected plagiarism queue with the position of the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
(5) repeating above step finishes until to be detected series processing.
The process that generates doubtful evidence table is as follows:
(1) in the suspected plagiarism queue each, takes out its position queue in known text;
(2) for each position in the position queue, judge whether it drops within certain identical interval, perhaps outside;
(3) if current identical interval, then be identical interval of start-stop position formation, and deposit current identical formation in the current location;
(4) if current location within identical interval, then goes to (7);
(5) if current location outside identical interval, and close with identical interval start-stop position, then expansion should be identical interval;
(6) if current location outside identical interval, and all not close with identical interval start-stop position, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
(7) if identical interval long enough then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
(8) repeating above-mentioned steps handles until suspected plagiarism queue.
Calculate the identical degree of text and judge that the process of text plagiarization is as follows:
(1) it is right to read in detected text;
(2) in the evidence table, search the text to all identical intervals (being designated as P);
(3) all identical length of an interval degree (being designated as S) among the accumulative total P;
(4) find the longest identical length of an interval degree (being designated as M) among the P;
(5) calculate the ratio (being designated as R) of S and text size;
(6) if R greater than assign thresholds, then text centering comprises plagiarization;
(7) if M greater than assign thresholds, then text centering comprises plagiarization;
(8) output has all identical interval conducts of plagiarization text to plagiarize evidence;
Below be the preferred embodiment that the inventor provides, need to prove, the invention is not restricted to these embodiment.
With reference to Fig. 1, the structural drawing of a preferred embodiment that provides of the present invention is provided Fig. 1.
Computer system among this embodiment comprises e-text typing module 20, text feature extraction module 30 at least, plagiarizes evidence extraction module 40, judges e-text plagiarization module 50 and shows testing result and content module 60 is plagiarized in the location, e-text typing module 20 is connected with text feature extraction module 30, text feature extraction module 30 is connected with plagiarization evidence extraction module 40, plagiarize evidence extraction module 40 and be connected with text plagiarization determination module 50, text is plagiarized determination module 50 and is plagiarized content module 60 and be connected with showing testing result and location.
Detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.
In e-text typing module 20, by e-text typing and the submission of user 10 with collection.
At text feature extraction module 30, extract its text feature, the generating item sequence at the e-text of submitting to.
In plagiarizing evidence extraction module 40, from the item sequence of text generation to be detected, take out each successively and be mapped on the known terms table, obtain suspected plagiarism queue then and plagiarize the evidence table.
In judging e-text plagiarization module 50, calculate text to identical degree according to plagiarizing the evidence table, and judge plagiarization, evidence plagiarized in record.
Final system is plagiarized content module 60 to the user report testing result by showing testing result and location, and shows and plagiarize the concrete plagiarization content of text as plagiarizing evidence.
Among Fig. 1, text feature extraction module 30 need carry out pre-service to text when extracting text feature.The text pre-service comprises that text is carried out format conversion, participle (cutting speech), stem handles, removes operations such as high frequency words.Format conversion is exactly that the text of other form (such as the word of Microsoft file, pdf (portable document format) file or the like) is completely converted to pure ASCII character formatted file, does not contain the character of NON-ASCII in the text after the feasible conversion.Participle or cut speech and be meant according to word and cut text makes text become a long word sequence rather than a character string.In the process of participle, removed various punctuation marks, numeral and other non-character symbols, separated (such as the space) with a unified symbol between all words.Stem is handled and to be meant the different morphologies of word completely on normalizing to a stem.For example danced, dancing and dance normalizing are dance.The removal high frequency words is meant to be got rid of the extra high word of those frequencies of occurrences from text, these high frequency words comprise single-letter speech, pronoun, preposition, modal particle or the like, such as a, he, the, of or the like.Last text feature extraction module 30 becomes a long sequence to one piece of input text.Text feature extraction module 30 also is responsible for making up the known terms table by known text.
With reference to Fig. 2, Fig. 2 generates the process flow diagram of suspected plagiarism queue for detecting input text.
At first carry out step 201, a text to be detected is read in the computing machine, be designated as d.Then carry out step 202, from text d, read an item, be designated as t, and write down its current location in text to be measured, be designated as p.Carry out step 203 then, judge whether t is present in the known terms table.If then carry out step 204; Otherwise go to step 213.In step 204, take out all known text that comprise t.Judge in step 205 whether the known text that comprises t is handled then, if then go to step 213, otherwise carry out step 206.In step 206, take out a untreated known text d ' who comprises t.Carry out step 207 then, take out the suspected plagiarism queue of text to be measured and this known text, be designated as L.Then carry out step 208, whether judge L length greater than designated value T1, if then go to step 212, otherwise carry out step 209.In step 209, take out last position of being remembered of L, be designated as p '.Carry out step 210 then, whether the difference of judging position p and p ' is greater than designated value T2.If then go to step 212, otherwise carry out step 211.In step 211, t and present position p thereof are appended last at suspected plagiarism queue L.In step 212, carry out the operation that detects current suspected plagiarism queue L, obtain to plagiarize evidence, detailed step please refer to Fig. 3 explanation.In step 213, judge whether text d to be measured runs through.If illustrate then that text to be measured has been handled to be over.Otherwise, go to step 202, continue above-mentioned circulation, all items in handling text to be measured.
With reference to Fig. 3, Fig. 3 obtains to plagiarize the process flow diagram of evidence for detecting suspected plagiarism queue.
At first carry out step 301, take out an item in the suspected plagiarism queue, be designated as t.Carry out step 302 then, take out the position queue of t in known file.Then carry out step 303, judge whether position queue is handled.If go to step 319, otherwise carry out step 304.In step 304, the next position in the extracting position formation is designated as P.Carry out step 305 then, judge whether identical formation is handled.If go to step 307, otherwise carry out step 306.
In step 307, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, go to step 318 then being inserted in the last of identical formation between this newly developed area.In step 306, the next one that takes out in the identical formation is identical interval, is designated as R.Carry out step 308 then, the final position of calculating location P and interval R poor is designated as G.Then carry out step 309, judge that whether G is greater than designated value T2.If then carry out step 310, otherwise goes to step 311.
In step 310, whether the length of judging interval R is greater than designated value T3.If then carry out step 312, otherwise goes to step 305.In step 311, judge that whether G is greater than 0.If then carry out step 314, otherwise goes to step 315.In step 312, interval R is one section and plagiarizes literal, R is put into plagiarize the evidence table.Then carry out step 313, the interval R of deletion goes to step 305 then from identical formation.In step 314, the final position of interval R is revised as position P, and, goes to step 318 then the position that identical interval final position in the text to be measured is revised as this.In step 315, the interval pointer in the identical formation is stepped back a step.Then carry out step 316, judge that position P is whether less than the reference position of interval R.If then carry out step 317, otherwise goes to step 318.
In step 317, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, carry out step 318 then being inserted in the last of identical formation between this newly developed area.In step 318, mark position P handled, and went to step 303 then.In step 319, judge whether all handle in the suspected plagiarism queue.If, illustrate that suspected plagiarism queue disposes, then adjacent identical interval in the plagiarization evidence table to be merged into bigger identical interval and preserved the plagiarization evidence, the process that detects suspected plagiarism queue then finishes.Otherwise go to step 301, continue above-mentioned circulation, in handling suspected plagiarism queue all.
With reference to Fig. 4, Fig. 4 for according to evidence list deciding text to whether the process flow diagram of plagiarization is arranged.
At first carry out step 401, read a text to be detected, be designated as d.Carry out step 402 then, read a known text, be designated as d '.Then carry out step 403, all identical burst length summations between d and the d ' are designated as s in the calculating evidence table.Carry out step 404 then, the maximum in the calculating evidence table between d and the d ' is identical interval, and promptly the identical interval of length maximum is designated as M.Then carry out step 405, calculate s and d length ratio value R1.Carry out step 406 then, calculate s and d ' length ratio value R2.Then carry out step 407, get value bigger among R1 and the R2, be designated as R.In step 408, judge that whether R is greater than designated value T4 then.If then go to step 411, otherwise carry out step 409.In step 409, judge that whether M is greater than designated value T5.If then go to step 411, otherwise carry out step 410.In step 410, be judged to be d and do not plagiarize d ', go to step 413 then.In step 411, be judged to be d and plagiarize d '.Carry out step 412 then, take out identical interval from the evidence table, locate respectively in d and d ', evidence is plagiarized in output.Then carry out step 413, judge whether all known text are handled.If, to the plagiarization decision process end of text d to be measured.Otherwise go to step 402, continue above-mentioned circulation, all carried out plagiarizing judgement until d and all known text.

Claims (10)

1. one kind is detected and the method for positioning electronic text contents plagiary, it is characterized in that, this method utilize computer system detected electrons text whether to contain to plagiarize content and accurately the location plagiarize literal, described computer system comprises at least:
E-text typing module is in order to submit detected text to or to increase new detected text to computer system;
The text feature extraction module is in order to extract text feature, generating item sequence;
Plagiarize the evidence extraction module, be mapped on the known terms table, generate suspected plagiarism queue, obtain and plagiarize the evidence table in order to take out each in the item sequence successively;
Judge e-text plagiarization module,, judge in the detected text whether contain the plagiarization content in order to calculate detected text to identical degree;
Show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content of text as plagiarizing evidence;
It is continuous successively that e-text typing module, text extraction characteristic module, plagiarization evidence extraction module, text are plagiarized determination module, displaying testing result and location plagiarization content module, and it detects and position fixing process may further comprise the steps:
Step 1 to submitting detected text to or increasing new detected text, is extracted the detected text feature according to text structure information and semantic information, generates to be detected sequence;
Step 2 is handled all in to be detected the sequence successively, generates suspected plagiarism queue;
Step 3 detects all suspected plagiarism queue, therefrom obtains to plagiarize evidence, generates the evidence table;
Step 4 is calculated the identical degree of text according to the evidence table, and whether judge has plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;
Step 5, the text that is determined plagiarization shows as plagiarizing evidence take out corresponding plagiarization content from the evidence table.
2. the method for claim 1, it is characterized in that, described detected text comes from user's manual entry, perhaps the user has the copy of text now, perhaps the user perhaps obtains by the internet automatically by network download, and no matter detected text with what form stores in computing machine, what it presented is that natural language is main content, is not figure, image, video or audio-frequency information.
3. method as claimed in claim 2, it is characterized in that, described natural language comprises the text that Chinese, English, Japanese, Korea's literary composition, French, Spanish, Russian, German or other single language constitute, the perhaps text that is mixed by above language.
4. the method for claim 1, it is characterized in that, the processed minimum unit of described detected text is an item, described is one or more continuous character, item is arranged in computer system in the following manner: all deposit in the Hash table, each all is a key word, and each all corresponding listed files, what deposit in the listed files is all file or document codes that comprise this, listed files is organized with Hash table, and document code is a key word; The all corresponding formation of each file in the listed files or document code, this position that occurs in this document of storage in the formation, arrange according to orderly fashion the position in the formation.
5. the method for claim 1 is characterized in that, described suspected plagiarism queue is an ordered sequence that is made of a plurality of items, and this ordered sequence has following feature:
1) all items all occur in same piece of writing text in the ordered sequence;
2) sequencing of any two items is determined in proper order by their appearance in one piece of text in the ordered sequence;
3) position of any two adjacencies in one piece of text is close in the ordered sequence.
6. the method for claim 1 is characterized in that, described generation suspected plagiarism queue process is carried out according to the following steps:
1) e-text to be detected is through obtaining to be detected sequence after the data cleansing;
2) item for the treatment of in the detection sequence is mapped on the known terms table successively;
3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
4) not close if newly put into the item of suspected plagiarism queue with the position of the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
5) repeat above step 2)~step 4), intact until to be detected series processing.
7. the method for claim 1 is characterized in that, the described process of plagiarizing the evidence table of obtaining is carried out according to the following steps:
1) in the suspected plagiarism queue each, takes out its position queue in known text;
2) for each position in the position queue, judge its whether drop within certain identical interval or outside;
3) if current identical interval, then be identical interval of start-stop position formation, and deposit current identical formation in the current location;
4) if current location within identical interval, then goes to step 7);
5) if current location outside identical interval, and close with identical interval start-stop position, then expansion should be identical interval;
6) if current location outside identical interval, and all not close with identical interval start-stop position, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
7) if identical interval long enough then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
8) repeat above-mentioned steps 1)~step 7), handle until suspected plagiarism queue.
8. the method for claim 1 is characterized in that, described judgement text plagiarization process comprises following steps:
1) it is right to read in detected text;
2) in the evidence table, search this detected text all identical intervals are designated as P;
3) all identical length of an interval degree are designated as S among the accumulative total P;
4) find that the longest identical length of an interval degree is designated as M among the P;
5) calculating S is designated as R with the ratio of text size;
6) if R greater than assign thresholds, then text centering comprises plagiarization;
7) if M greater than assign thresholds, then text centering comprises plagiarization;
8) output has all identical interval conducts of plagiarization text to plagiarize evidence.
9. the method for claim 1 is characterized in that, described output testing result comprises that identical literal accounts for the ratio R value and the maximum identical fragment length M value of text length.
10. the method for claim 1 is characterized in that, it is exactly according to the identical interval in the evidence table that literal is plagiarized in described location, the pairing content of text output in this interval and represent to the user.
CN2008102323098A 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary Expired - Fee Related CN101404037B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102323098A CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102323098A CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Publications (2)

Publication Number Publication Date
CN101404037A true CN101404037A (en) 2009-04-08
CN101404037B CN101404037B (en) 2011-05-18

Family

ID=40538049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102323098A Expired - Fee Related CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Country Status (1)

Country Link
CN (1) CN101404037B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature
CN102650986A (en) * 2011-02-27 2012-08-29 孙星明 Synonym expansion method and device both used for text duplication detection
CN102779188A (en) * 2012-06-29 2012-11-14 北京奇虎科技有限公司 System and method for duplicated text removal
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
CN103324664A (en) * 2013-04-27 2013-09-25 国家电网公司 Document similarity distinguishing method based on Fourier transform
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN103412905A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (Portable document format) file comparison method and system
CN103678373A (en) * 2012-09-17 2014-03-26 腾讯科技(深圳)有限公司 Method and device for identifying garbage template articles
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method
CN108446277A (en) * 2018-03-27 2018-08-24 北京大前科技有限责任公司 The method and device of simulation learning
CN110427891A (en) * 2019-08-05 2019-11-08 中国工商银行股份有限公司 The method, apparatus, system and medium of contract for identification
CN110674299A (en) * 2019-09-30 2020-01-10 南京网感至察信息科技有限公司 Detection method for plagiarism in article viewpoint
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113326688A (en) * 2021-06-16 2021-08-31 黑龙江八一农垦大学 Ideological and political theory word duplication checking processing method and device

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579B (en) * 2010-05-11 2012-09-05 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature
CN102650986A (en) * 2011-02-27 2012-08-29 孙星明 Synonym expansion method and device both used for text duplication detection
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
CN102779188A (en) * 2012-06-29 2012-11-14 北京奇虎科技有限公司 System and method for duplicated text removal
CN102779188B (en) * 2012-06-29 2015-11-25 北京奇虎科技有限公司 Duplicated text removal system and method
CN103678373A (en) * 2012-09-17 2014-03-26 腾讯科技(深圳)有限公司 Method and device for identifying garbage template articles
CN103678373B (en) * 2012-09-17 2017-11-17 腾讯科技(深圳)有限公司 A kind of garbage template article recognition methods and equipment
CN103324664A (en) * 2013-04-27 2013-09-25 国家电网公司 Document similarity distinguishing method based on Fourier transform
CN103324664B (en) * 2013-04-27 2016-08-10 国家电网公司 A kind of document similarity method of discrimination based on Fourier transformation
CN103412905A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (Portable document format) file comparison method and system
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method
CN103823862B (en) * 2014-02-24 2017-02-15 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method
CN108446277A (en) * 2018-03-27 2018-08-24 北京大前科技有限责任公司 The method and device of simulation learning
CN108446277B (en) * 2018-03-27 2021-08-17 北京大前科技有限责任公司 Method and device for simulating learning
CN110427891A (en) * 2019-08-05 2019-11-08 中国工商银行股份有限公司 The method, apparatus, system and medium of contract for identification
CN110427891B (en) * 2019-08-05 2022-06-10 中国工商银行股份有限公司 Method, apparatus, system and medium for identifying contract
CN110674299A (en) * 2019-09-30 2020-01-10 南京网感至察信息科技有限公司 Detection method for plagiarism in article viewpoint
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113326688A (en) * 2021-06-16 2021-08-31 黑龙江八一农垦大学 Ideological and political theory word duplication checking processing method and device

Also Published As

Publication number Publication date
CN101404037B (en) 2011-05-18

Similar Documents

Publication Publication Date Title
CN101404037B (en) Method for detecting and positioning electronic text contents plagiary
Chin et al. Detecting Wikipedia vandalism with active learning and statistical language models
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
US7672833B2 (en) Method and apparatus for automatic entity disambiguation
EP2727009A2 (en) Automatic classification of electronic content into projects
CN111259160B (en) Knowledge graph construction method, device, equipment and storage medium
WO2014100459A2 (en) Systems and methods for using non-textual information in analyzing patent matters
CN101833579B (en) Method and system for automatically detecting academic misconduct literature
CN103294664A (en) Method and system for discovering new words in open fields
CN103150405A (en) Classification model modeling method, Chinese cross-textual reference resolution method and system
CN113033198A (en) Similar text pushing method and device, electronic equipment and computer storage medium
CN113377916B (en) Extraction method of main relations in multiple relations facing legal text
US20080008391A1 (en) Method and System for Document Form Recognition
Mansoor et al. Computer-based plagiarism detection techniques: A comparative study
Christen et al. Automatic discovery of abnormal values in large textual databases
Sindhu et al. Fingerprinting based detection system for identifying plagiarism in Malayalam text documents
Madhani et al. Aksharantar: Open Indic-language transliteration datasets and models for the next billion users
CN112632964B (en) NLP-based industry policy information processing method, device, equipment and medium
CN105808602B (en) Method and device for detecting junk information
Zmigrod et al. BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Dejean Extracting structured data from unstructured document with incomplete resources
CN1244865C (en) Method for detecting plagiarism in electronic text using computer program
Guo et al. BLGAV: generative AI author verification model based on BERT and BiLSTM
CN107656909B (en) Document similarity judgment method and device based on document mixing characteristics
Gardner et al. Automatic link detection: a sequence labeling approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110518

Termination date: 20131118