CN101404037B - Method for detecting and positioning electronic text contents plagiary - Google Patents

Method for detecting and positioning electronic text contents plagiary Download PDF

Info

Publication number
CN101404037B
CN101404037B CN2008102323098A CN200810232309A CN101404037B CN 101404037 B CN101404037 B CN 101404037B CN 2008102323098 A CN2008102323098 A CN 2008102323098A CN 200810232309 A CN200810232309 A CN 200810232309A CN 101404037 B CN101404037 B CN 101404037B
Authority
CN
China
Prior art keywords
text
detected
plagiarism
queue
evidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008102323098A
Other languages
Chinese (zh)
Other versions
CN101404037A (en
Inventor
鲍军鹏
冯中慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN2008102323098A priority Critical patent/CN101404037B/en
Publication of CN101404037A publication Critical patent/CN101404037A/en
Application granted granted Critical
Publication of CN101404037B publication Critical patent/CN101404037B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for detecting and positioning electronic text contents plagiarism by a computer system. The computer system at least comprises an electronic text input module, a text feature extraction module, a plagiarism evidence extraction module, a text plagiarism judgment module, a detection result display module and a plagiarism contents positioning module. The detection method comprises the following steps: firstly, a feature is extracted according to the structural information and the semantic information of the text to obtain a sequence of items to be detected; then all items in the sequence of the items to be detected are sequentially processed to obtain a suspected plagiarism queue; thirdly, all suspected plagiarism queues are detected to obtain the plagiarism evidence and generate a plagiarism evidence sheet; finally, the resemblance of the text is calculated based on the evidence sheet, and the text is judged whether to have plagiarism. If the resemblance is greater or equal to a certain threshold, the detected text is considered to have plagiarism, or else, the detected text is not considered to have plagiarism. As to the text judged to have plagiarism, the corresponding plagiarism evidence thereof is extracted from the evidence sheet, and input the display module to display the specific plagiarism contents.

Description

The method of a kind of detection and positioning electronic text contents plagiary
Technical field
The invention belongs to Intelligent Information Processing and field of computer technology, relate to a kind of detected electrons text and whether contain the method for plagiarizing content, the method of particularly a kind of detection and positioning electronic text contents plagiary, this method can accurately be located detected e-text, and provides conclusive plagiarization evidence.
Background technology
Along with the fast development of network and universal rapidly, the e-text of issuing on the internet becomes an emphasis of current knowledge property right protection at present.Because e-text is easy to duplicate and download, the object of become many people's researchs, quoting, the case that some e-texts are considered to plagiarize by big duplicating of the space of a whole page happens occasionally.And the e-text safeguard measure on the present network mainly contains two kinds: a kind of is " prevention " method, and a kind of is " detection " method.
" prevention " method just is to use methods such as encryption, watermark, special carrier to make protected content be difficult to copy.For example IEEE is by the disk distribution collected works, and the online article of Chinese periodical adopts special software to read.Bell Laboratory has proposed " watermark " technology and has used word space or the image of encrypting, and can identify the document authorized user identities.But this does not have unassailable Mahinot Line in the world, does not have cocksure encryption technology yet.Said method all might be cracked; And we do not have technological means to prevent that authorized user from using optical identification ways such as (OCR) to go bootlegging, diffusion yet.So " prevention " method can not solve the intellectual property protection problem fully.
The thinking that " detection " method protects the intellectual property is such: it also is indifferent to file and how is replicated.But at first judge whether contain the content of duplicating or plagiarizing in the current file; If found bootlegging or cribbing, so again copy source or plagiarist are taken measures on customs clearance.The core of " detection " method is exactly the copy detection technology.Obviously " prevention " method and " detection " method are not the relations that opposes mutually, and should replenish mutually, improve and could protect the intellectual property better.
So-called text copy detection is also referred to as text and plagiarizes detection, judges exactly another one or a plurality of text are plagiarized, plagiarize or be replicated in to the content of a text whether.Plagiarize and not only to mean intactly and indiscriminately imitates, comprise also that shift transformation, synonym to original work replaced and changed saying to repeat or the like mode.Text copy detection technology mainly contains two kinds of basic detection methods now: a kind of is " string matching " method, and another kind is " word frequency " method.
So-called string matching detection method is exactly at first to extract some feature strings from text, generally is referred to as " fingerprint " (fingerprints); Judge according to the identical rate of these fingerprints whether plagiarization is arranged in the text then.([1] S.Brin of COPS system of proposing of people such as Brin of Stanford University and Garcia-Molina for example, J.Davis, and H.Garcia-Molina.Copy detection mechanisms for digital documents.In Proceedings of the ACM SIGMOD Annual Conference, s San Francisco, CA, May 1995.); ([2] the Heintze N.Scalable Document Fingerprinting.In Proceedings of the Second USENIX Workshop on Electronic Commerce of KOALA system of Bell Laboratory Heintze exploitation, Oakland, California, 18-21November, 1996.) or the like.
So-called word frequency detection method is to use " word bag " in the information retrieval (bag of words) method, at first adds up each word frequency of occurrences in the text, uses certain tolerance to obtain the identical degree of two pieces of texts to the word frequencies vector then, and draws final judgement.SCAM prototype ([3] N.Shivakumar and H.Garcia-Molina.SCAM:A copy detection mechanism for digital documents.In Proceedings of 2nd Internat ional Conference in Theory and Practice of Digital Libraries (DL ' 95) of proposing of people such as Garcia-Molina of Stanford University and Shivakumar for example, Austin, Texas, June 1995.); CHECK prototype ([4] Si A. that people such as Si of The Hong Kong Polytechnic University and Leong set up, Leong H.V., Lau R.W.H.CHECK:A Document Plagiari sm Detection System.In Proceedings of ACM Symposium for Applied Computing, pp.70-77, Feb.1997.) or the like.
The string matching method can accurately determine to be replicated content, but after indivedual words were changed (deletion) in the character string, precision just reduced greatly.The word frequency method has certain noise robustness, and small-scale words changes not can the appreciable impact accuracy of detection, and detection efficiency is higher relatively.But when being replicated proportion that content accounts for the entire chapter text hour, the word frequency rule is difficult to detect.The word frequency method is closed 1 type for n and is partly duplicated almost inefficacy.The string matching method is a kind of detection method of paying attention to local feature, because local feature is generally unstable, so this method noise robustness is not good.The word frequency method is excavated global characteristics by word frequency, and local small adjustment can not influence global characteristics, so this method noise resisting ability is stronger relatively.But because the word frequency method only pays close attention to global characteristics, ignored local feature, thereby can not carry out careful detection, be difficult to detection so the word frequency method is plagiarized (for example n close 1 type partly duplicate) for little content to the text of two pieces more similar (but different).
The applicant has submitted to name to be called " a kind of method of utilizing computer program detected electrons text to plagiarize " in 2003 to Patent Office of the People's Republic of China, (patent No.: ZL 03134562.X), this method is extracted text feature according to the structural information and the semantic information of text to be awarded patent right; Utilization judges that e-text plagiarizes the sonde method of setting in the module and estimate text feature maximum common semantic in text feature to be detected and the feature database and provide the identical tolerance of text then; Judge in view of the above at last whether plagiarization is arranged, think if identical degree is greater than or equal to certain threshold value to exist in the detected text and plagiarize, otherwise think and do not plagiarize in the detected text, be suitable for detecting more quickly long text and plagiarize.This method suitably combines elementary string matching method and word frequency method, is not the identical degree of frequency tolerance according to simple words, but measures identical degree according to the overlapping possibility of the semantic sequence of text feature.
But,,, promptly can't locate concrete plagiarization content so this method can not provide the particular content of plagiarizing text owing to do not store complete content of text in the text feature storehouse of this method.That is to say, can not provide conclusive plagiarization evidence simultaneously for the plagiarization text that detects.
Summary of the invention
Defective or deficiency at above-mentioned prior art existence, the objective of the invention is to, the computer system and the method thereof of a kind of detection and positioning electronic text contents plagiary are provided, this method can detect through simple words and the plagiarization text that means were handled such as replace, inserts, deletes, and accurately content is plagiarized in the location, provides the plagiarization evidence.Can find out or find the e-text that those have plagiarization suspicion by this method, point out to be plagiarized content, the legal intellecture property of protection provides technological means and foundation in order to take further measures.
In order to realize above-mentioned task, the present invention takes following technical solution:
A kind of detection and localized electron text are plagiarized the computer system of content, it is characterized in that described computer system comprises at least:
E-text typing module is in order to submit detected text to or to increase new detected text to computer system;
The text feature extraction module in order to detected text of submitting to or the new detected text that increases, extracts the detected text feature according to text structure information and semantic information, generates to be detected sequence;
Plagiarize the evidence extraction module,, generate suspected plagiarism queue, detect all suspected plagiarism queue, therefrom obtain to plagiarize evidence, generate the evidence table in order to each of taking out successively in the sequence is mapped on the known terms table;
Described suspected plagiarism queue is an ordered sequence that is made of a plurality of items, and this ordered sequence has following feature:
Items all in A1, the ordered sequence all occur in same piece of known text;
The sequencing of any two items is by they order decisions in text to be measured in A2, the ordered sequence;
Any two positions of adjacency in text to be measured are close in A3, the ordered sequence;
The process of described generation suspected plagiarism queue is as follows:
1), text to be detected is through obtaining to be detected sequence after the data cleansing;
2), the item for the treatment of in the detection sequence is mapped on the known terms table successively;
3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
4) not close in text to be measured position if newly put into the item of suspected plagiarism queue with the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
5), repeating above step finishes until to be detected series processing;
The process of described generation evidence table is as follows:
B1, in the suspected plagiarism queue each, take out its position queue in known text;
B2, for each position in the position queue, judge whether it drops within certain identical interval;
If B3 is current identical interval, then identical interval for one of start-stop position formation with the current location, and deposit current identical formation in;
If the B4 current location then goes to B7 within identical interval;
If the B5 current location is and close with identical interval start-stop position outside identical interval, then expansion should be identical interval;
If the B6 current location is and all not close with identical interval start-stop position outside identical interval, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
The interval long enough if B7 duplicates then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
B8, repetition above-mentioned steps B1 to B7 handle until suspected plagiarism queue;
Judge e-text plagiarization module, in order to calculate the identical degree between detected text and the known text, judge in the detected text whether contain plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;
Show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content of text as plagiarizing evidence;
It is continuous successively that e-text typing module, text feature extraction module, plagiarization evidence extraction module, judgement e-text are plagiarized module, displaying testing result and location plagiarization content module.
Aforementioned calculation machine system detected electrons text is plagiarized content and localization method, it is characterized in that, may further comprise the steps:
Step 1 is submitted detected text to or is increased new detected text to computer system;
Step 2 to detected text of submitting to or the new detected text that increases, is extracted the detected text feature according to text structure information and semantic information, generates to be detected sequence;
Step 3 is mapped to each of taking out successively in the sequence on the known terms table, generates suspected plagiarism queue, detects all suspected plagiarism queue, therefrom obtains to plagiarize evidence, generates the evidence table;
Described suspected plagiarism queue is an ordered sequence that is made of a plurality of items, and this ordered sequence has following feature:
Items all in A1, the ordered sequence all occur in same piece of known text;
The sequencing of any two items is by they order decisions in text to be measured in A2, the ordered sequence;
Any two positions of adjacency in text to be measured are close in A3, the ordered sequence;
The process of described generation suspected plagiarism queue is as follows:
1), text to be detected is through obtaining to be detected sequence after the data cleansing;
2), the item for the treatment of in the detection sequence is mapped on the known terms table successively;
3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
4) not close in text to be measured position if newly put into the item of suspected plagiarism queue with the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
5), repeating above step finishes until to be detected series processing;
The process of described generation evidence table is as follows:
B1, in the suspected plagiarism queue each, take out its position queue in known text;
B2, for each position in the position queue, judge whether it drops within certain identical interval;
If B3 is current identical interval, then identical interval for one of start-stop position formation with the current location, and deposit current identical formation in;
If the B4 current location then goes to B7 within identical interval;
If the B5 current location is and close with identical interval start-stop position outside identical interval, then expansion should be identical interval;
If the B6 current location is and all not close with identical interval start-stop position outside identical interval, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
The interval long enough if B7 duplicates then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
B8, repetition above-mentioned steps B1 to B7 handle until suspected plagiarism queue;
Step 4 is calculated the identical degree between detected text and the known text, judges in the detected text whether contain plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;
Step 5 is exported testing result and is showed that the concrete plagiarization content of plagiarization text is as plagiarizing evidence to the user.
The identical degree of Chinese version of the present invention is accounted for the ratio (R value) and the common expression of maximum identical fragment length (M value) of text length by identical literal.The two one then thinks in the text plagiarization content is arranged greater than assign thresholds.It plagiarizes evidence (concrete plagiarization literal) is exactly the pairing content of text in corresponding identical interval in the evidence table.
Description of drawings
Fig. 1 is a preferred embodiment structural drawing of the present invention;
Fig. 2 is that the present invention detects input text, generates the process flow diagram of suspected plagiarism queue;
Fig. 3 is that the present invention detects suspected plagiarism queue, obtains to plagiarize the process flow diagram of evidence.
Fig. 4 is whether the present invention has plagiarization according to the evidence list deciding a process flow diagram.
The present invention will be further described below in conjunction with drawings and Examples.
Embodiment
The method of detection of the present invention and positioning electronic text contents plagiary, its basic ideas are: at first, detect the same text whether two pieces of texts have some.If there is not same text, then must not plagiarize.If have, then carry out next step detection.Secondly, whether identical, whether constituted statement if detecting the sequencing of same text in two pieces of texts, whether identical statement is promptly arranged.If unidentical statement is not then plagiarized.If have, then carry out next step detection.Attention: identical statement is not to refer to that two statements are definitely identical, and a character is not poor.Identical statement allows in the statement indivedual words differences are arranged, but the main framework of statement should be the same.At last, if identical statement has surpassed certain limit, then decidable is for plagiarizing.Evidence plagiarized exactly in identical statement.
Based on a fact, promptly plagiarize the same text that text must comprise some.If two pieces of texts do not have the same text of sufficient amount, then should not plagiarize between these two pieces of texts.Two pieces of texts that a large amount of same text are arranged may not necessarily be exactly to plagiarize text also.If two pieces of texts have a large amount of identical literal, and the sequencing of these same text is also identical in two pieces of texts.That is, exist same sentence (perhaps paragraph) in two pieces of texts, and same sentence (perhaps paragraph) has reached certain-length.
These two pieces of texts just exist plagiarization so, and same sentence (perhaps paragraph) is exactly to plagiarize evidence.Whether the present invention utilizes computer system detected electrons text to contain and plagiarizes content and accurately locate concrete plagiarization content, and this computer system comprises at least:
(1) e-text typing module is in order to submit detected text to or to increase new detected text to system;
(2) text feature extraction module to detected text of submitting to or the new detected text that increases, extracts the detected text feature according to text structure information and semantic information, generates to be detected sequence;
(3) plagiarize the evidence extraction module,, generate suspected plagiarism queue, detect all suspected plagiarism queue, therefrom obtain to plagiarize evidence, generate the evidence table in order to each of taking out successively in the sequence is mapped on the known terms table;
(4) judge e-text plagiarization module,, judge in the detected text whether contain the plagiarization content in order to calculate detected text to identical degree; If identical degree is more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise thinks and do not plagiarize in the detected text;
(5) show testing result and location plagiarization content module, in order to export testing result to the user and to show and plagiarize the concrete plagiarization content (promptly plagiarizing evidence) of text.
Above-mentioned e-text typing module is connected with the text feature extraction module, the text feature extraction module is connected with plagiarization evidence extraction module, plagiarize the evidence extraction module and plagiarize module and be connected with judging e-text, the judgement e-text is plagiarized module and is plagiarized content module and be connected with showing testing result and location.It detects and position fixing process may further comprise the steps:
(1) submits detected text to or increase new detected text to computer system;
(2) to detected text of submitting to or the new detected text that increases, extract the detected text feature according to text structure information and semantic information, generate to be detected sequence;
(3) each of taking out successively in the item sequence is mapped on the known terms table, generates suspected plagiarism queue, detect all suspected plagiarism queue, therefrom obtain to plagiarize evidence, generate the evidence table;
(4) calculate the identical degree that detects between text and the known text, judge in the detected text whether plagiarization is arranged.If identical degree is more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise thinks and do not plagiarize in the detected text.
(5) export testing result and show that the concrete plagiarization content of plagiarization text is as plagiarizing evidence to the user.
Above-mentioned detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.
No matter detected text stores (such as ascii text file, the word of Microsoft file, html file, pdf (portable document format) file, Tex file or the like) with what form in computing machine, what it presented to the user is that natural language is main content, is not figure, image, video, audio-frequency information.
Above-mentioned natural language text comprises single language such as Chinese, English, Japanese, Korea's literary composition, French, Spanish, Russian, German or the e-text that is mixed by above language.It is different that the text of different language only is that the text pretreatment stage is cut apart the operation of words, and other all links are identical.
Above-mentioned detected e-text, processed minimum unit is an item.An item is exactly one or more continuous character.The length of item is exactly the number of continuation character, is an artificial systematic parameter that is provided with.Detect the length that the minimum length of plagiarizing is exactly an item.
Item is arranged in computing machine in the following manner: all deposit in the Hash table, and each all is a key word.This table is referred to as the known terms table; Each all corresponding listed files, what deposit in the listed files is all file or document codes that comprise this, and listed files is organized with Hash table, and document code is a key word; The all corresponding formation of each file in the listed files (perhaps document code), this position that occurs in this document of storage in the formation, arrange according to orderly fashion the position in the formation.
Above-mentioned suspected plagiarism queue abbreviates suspected plagiarism queue as, is an ordered sequence that is made of a plurality of items.This sequence has following feature:
(1) all items all occur in mutually same piece of writing known text (being designated as text d) in the sequence;
(2) sequencing of any two items is determined in proper order by their appearance in text to be measured in the sequence;
(3) any two positions of adjacency in text to be measured are close in the sequence.
Above-mentioned plagiarization evidence is one section continuous literal in the text.This section literal is corresponding with certain suspected plagiarism queue, and promptly all items all appear in this section literal in the suspected plagiarism queue, and order is identical.The process that generates suspected plagiarism queue is as follows:
(1) text to be detected is through obtaining to be detected sequence after the data cleansing;
(2) item for the treatment of in the detection sequence is mapped on the known terms table successively;
(3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
(4) not close in text to be measured position if newly put into the item of suspected plagiarism queue with the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
(5) repeating above step finishes until to be detected series processing.
The process that generates doubtful evidence table is as follows:
(1) in the suspected plagiarism queue each, takes out its position queue in known text;
(2), judge whether it drops within certain identical interval for each position in the position queue;
(3) if current identical interval, then be identical interval of start-stop position formation, and deposit current identical formation in the current location;
(4) if current location within identical interval, then goes to (7);
(5) if current location outside identical interval, and close with identical interval start-stop position, then expansion should be identical interval;
(6) if current location outside identical interval, and all not close with identical interval start-stop position, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
(7) if identical interval long enough then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
(8) repeating above-mentioned steps handles until suspected plagiarism queue.
Calculate the identical degree of text and judge that the process of text plagiarization is as follows:
(1) it is right to read in the text that detected text and known text form;
(2) in the evidence table, search the text to all identical intervals (being designated as P);
(3) all identical length of an interval degree (being designated as S) among the accumulative total P;
(4) find the longest identical length of an interval degree (being designated as M) among the P;
(5) calculate S and known text length ratio (being designated as R);
(6) if R greater than assign thresholds, then text centering comprises plagiarization;
(7) if M greater than assign thresholds, then text centering comprises plagiarization;
(8) output has all identical interval conducts of plagiarization text to plagiarize evidence;
Below be the preferred embodiment that the inventor provides, need to prove, the invention is not restricted to these embodiment.
With reference to Fig. 1, the structural drawing of a preferred embodiment that provides of the present invention is provided Fig. 1.
Computer system among this embodiment comprises e-text typing module 20, text feature extraction module 30 at least, plagiarizes evidence extraction module 40, judges e-text plagiarization module 50 and shows testing result and content module 60 is plagiarized in the location, e-text typing module 20 is connected with text feature extraction module 30, text feature extraction module 30 is connected with plagiarization evidence extraction module 40, plagiarize evidence extraction module 40 and plagiarize module 50 and be connected with judging e-text, the judgement e-text is plagiarized module 50 and is plagiarized content module 60 and be connected with showing testing result and location.
Detected text comes from user's manual entry, and perhaps the user has the copy of text now, and perhaps the user perhaps obtains by the internet automatically by network download.
In e-text typing module 20, by e-text typing and the submission of user 10 with collection.
At text feature extraction module 30, extract its text feature, the generating item sequence at the e-text of submitting to.
In plagiarizing evidence extraction module 40, from the item sequence of text generation to be detected, take out each successively and be mapped on the known terms table, obtain suspected plagiarism queue then and plagiarize the evidence table.
In judging e-text plagiarization module 50, calculate text to identical degree according to plagiarizing the evidence table, and judge plagiarization, evidence plagiarized in record.
Final system is plagiarized content module 60 to the user report testing result by showing testing result and location, and shows and plagiarize the concrete plagiarization content of text as plagiarizing evidence.
Among Fig. 1, text feature extraction module 30 need carry out pre-service to text when extracting text feature.The text pre-service comprises that text is carried out format conversion, participle (cutting speech), stem handles, removes operations such as high frequency words.Format conversion is exactly that the text of other form (such as the word of Microsoft file, pdf (portable document format) file or the like) is completely converted to pure ASCII character formatted file, does not contain the character of NON-ASCII in the text after the feasible conversion.Participle or cut speech and be meant according to word and cut text makes text become a long word sequence rather than a character string.Various punctuation marks, numeral and other non-character symbols in the process of participle, have been removed, institute
Have between the word and separate (such as the space) with a unified symbol.Stem is handled and to be meant the different morphologies of word completely on normalizing to a stem.For example danced, dancing and dance normalizing are dance.The removal high frequency words is meant to be got rid of the extra high word of those frequencies of occurrences from text, these high frequency words comprise single-letter speech, pronoun, preposition, modal particle or the like, such as a, he, the, of or the like.Last text feature extraction module 30 becomes a long sequence to one piece of input text.Text feature extraction module 30 also is responsible for making up the known terms table by known text.
With reference to Fig. 2, Fig. 2 generates the process flow diagram of suspected plagiarism queue for detecting input text.
At first carry out step 201, a text to be detected is read in the computing machine, be designated as d.Then carry out step 202, from text d, read an item, be designated as t, and write down its current location in text to be measured, be designated as p.Carry out step 203 then, judge whether t is present in the known terms table.If then carry out step 204; Otherwise go to step 213.In step 204, take out all known text that comprise t.Judge in step 205 whether the known text that comprises t is handled then, if then go to step 213, otherwise carry out step 206.In step 206, take out a untreated known text d ' who comprises t.Carry out step 207 then, take out the suspected plagiarism queue of text to be measured and this known text, be designated as L.Then carry out step 208, whether judge L length greater than designated value T1, if then go to step 212, otherwise carry out step 209.In step 209, take out last position of being remembered of L, be designated as p '.Carry out step 210 then, whether the difference of judging position p and p ' is greater than designated value T2.If then go to step 212, otherwise carry out step 211.In step 211, t and present position p thereof are appended last at suspected plagiarism queue L.In step 212, carry out the operation that detects current suspected plagiarism queue L, obtain to plagiarize evidence, detailed step please refer to Fig. 3 explanation.In step 213, judge whether text d to be measured runs through.If illustrate then that text to be measured has been handled to be over.Otherwise, go to step 202, continue above-mentioned circulation, all items in handling text to be measured.
With reference to Fig. 3, Fig. 3 obtains to plagiarize the process flow diagram of evidence for detecting suspected plagiarism queue.
At first carry out step 301, take out an item in the suspected plagiarism queue, be designated as t.Carry out then
Step 302 is taken out the position queue of t in known file.Then carry out step 303, judge whether position queue is handled.If go to step 319, otherwise carry out step 304.In step 304, the next position in the extracting position formation is designated as P.Carry out step 305 then, judge identical
Whether formation is handled.If go to step 307, otherwise carry out step 306.
In step 307, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, go to step 318 then being inserted in the last of identical formation between this newly developed area.In step 306, the next one that takes out in the identical formation is identical interval, is designated as R.Carry out step 308 then, the final position of calculating location P and interval R poor is designated as G.Then carry out step 309, judge that whether G is greater than designated value T2.If then carry out step 310, otherwise goes to step 311.
In step 310, whether the length of judging interval R is greater than designated value T3.If then carry out step 312, otherwise goes to step 305.In step 311, judge that whether G is greater than 0.If then carry out step 314, otherwise goes to step 315.In step 312, interval R is one section and plagiarizes literal, R is put into plagiarize the evidence table.Then carry out step 313, the interval R of deletion goes to step 305 then from identical formation.In step 314, the final position of interval R is revised as position P, and, goes to step 318 then the position that identical interval final position in the text to be measured is revised as this.In step 315, the interval pointer in the identical formation is stepped back a step.Then carry out step 316, judge that position P is whether less than the reference position of interval R.If then carry out step 317, otherwise goes to step 318.
In step 317, generate a new identical interval, its initial sum final position all is position P, writes down this position in text to be measured simultaneously; And, carry out step 318 then being inserted in the last of identical formation between this newly developed area.In step 318, mark position P handled, and went to step 303 then.In step 319, judge whether all handle in the suspected plagiarism queue.If, illustrate that suspected plagiarism queue disposes, then adjacent identical interval in the plagiarization evidence table to be merged into bigger identical interval and preserved the plagiarization evidence, the process that detects suspected plagiarism queue then finishes.Otherwise go to step 301, continue above-mentioned circulation, in handling suspected plagiarism queue all.
With reference to Fig. 4, Fig. 4 for according to evidence list deciding text to whether the process flow diagram of plagiarization is arranged.
At first carry out step 401, read a text to be detected, be designated as d.Carry out step 402 then, read a known text, be designated as d '.Then carry out step 403, all identical burst length summations between d and the d ' are designated as s in the calculating evidence table.Carry out step 404 then, the maximum in the calculating evidence table between d and the d ' is identical interval, and promptly the identical interval of length maximum is designated as M.Then carry out step 405, calculate s and d length ratio value R1.Carry out step 406 then, calculate s and d ' length ratio value R2.Then carry out step 407, get value bigger among R1 and the R2, be designated as R.In step 408, judge that whether R is greater than designated value T4 then.If then go to step 411, otherwise carry out step 409.In step 409, judge that whether M is greater than designated value T5.If then go to step 411, otherwise carry out step 410.In step 410, be judged to be d and do not plagiarize d ', go to step 413 then.In step 411, be judged to be d and plagiarize d '.Carry out step 412 then, take out identical interval from the evidence table, locate respectively in d and d ', evidence is plagiarized in output.Then carry out step 413, judge whether all known text are handled.If, to the plagiarization decision process end of text d to be measured.Otherwise go to step 402, continue above-mentioned circulation, all carried out plagiarizing judgement until d and all known text.

Claims (6)

1. one kind is detected and localized electron text plagiarization content and method, it is characterized in that, may further comprise the steps:
Step 1 is submitted detected text to or is increased new detected text to computer system;
Step 2 to detected text of submitting to or the new detected text that increases, is extracted the detected text feature according to text structure information and semantic information, generates to be detected sequence;
Step 3 is mapped to each of taking out successively in the sequence on the known terms table, generates suspected plagiarism queue, detects all suspected plagiarism queue, therefrom obtains to plagiarize evidence, generates the evidence table;
Described suspected plagiarism queue is an ordered sequence that is made of a plurality of items, and this ordered sequence has following feature:
Items all in A1, the ordered sequence all occur in same piece of known text;
The sequencing of any two items is by they order decisions in text to be measured in A2, the ordered sequence;
Any two positions of adjacency in text to be measured are close in A3, the ordered sequence;
The process of described generation suspected plagiarism queue is as follows:
1), text to be detected is through obtaining to be detected sequence after the data cleansing;
2), the item for the treatment of in the detection sequence is mapped on the known terms table successively;
3) if the corresponding known text of respective items is not empty in the known terms table, then this and the position in known text thereof are put into suspected plagiarism queue;
4) not close in text to be measured position if newly put into the item of suspected plagiarism queue with the last item of this formation, then generate a new suspected plagiarism queue, otherwise just continue former suspected plagiarism queue;
5), repeat above step 2)~step 4) is intact until to be detected series processing;
The process of described generation evidence table is as follows:
B1, in the suspected plagiarism queue each, take out its position queue in known text;
B2, for each position in the position queue, judge whether it drops within certain identical interval;
If B3 is current identical interval, then identical interval for one of start-stop position formation with the current location, and deposit current identical formation in;
If the B4 current location then goes to B7 within identical interval;
If the B5 current location is and close with identical interval start-stop position outside identical interval, then expansion should be identical interval;
If the B6 current location is and all not close with identical interval start-stop position outside identical interval, then be with the current location start-stop position constitute one identical interval, and deposit current identical formation in;
The interval long enough if B7 duplicates then directly deposits it in and plagiarizes in the evidence table, and deletes from current identical formation;
B8, repetition above-mentioned steps B1 to B7 handle until suspected plagiarism queue;
Step 4 is calculated the identical degree between detected text and the known text, judges in the detected text whether contain plagiarization, if identical degree more than or equal to certain threshold value then think to exist in the detected text and plagiarize, otherwise is thought and do not plagiarized in the detected text;
Step 5 is exported testing result and is showed that the concrete plagiarization content of plagiarization text is as plagiarizing evidence to the user.
2. the method for claim 1, it is characterized in that, described detected text comes from user's manual entry, perhaps the user has the copy of text now, perhaps the user perhaps obtains by the internet automatically by network download, and no matter detected text with what form stores in computing machine, what it presented is that natural language is main content, is not figure, image, video or audio-frequency information.
3. method as claimed in claim 2, it is characterized in that, described natural language comprises the text that Chinese, English, Japanese, Korea's literary composition, French, Spanish, Russian, German or other single language constitute, the perhaps text that is mixed by above language.
4. the method for claim 1, it is characterized in that the processed minimum unit of described detected text is an item, described is one or more continuous character, item is arranged in computer system in the following manner: all deposit in the Hash table, each all is a key word, and each all corresponding listed files, what deposit in the listed files is all file or document codes that comprise this, listed files is organized with Hash table, and document code is a key word; The all corresponding formation of each file in the listed files or document code, this position that occurs in this document of storage in the formation, arrange according to orderly fashion the position in the formation.
5. the method for claim 1 is characterized in that, the identical degree between described calculating detected text and the known text judges that the process that whether contains plagiarization in the detected text comprises following steps:
1) it is right to read in the text that detected text and known text form;
2) in the evidence table, search the text all identical intervals are designated as P;
3) all identical length of an interval degree are designated as S among the accumulative total P;
4) find that the longest identical length of an interval degree is designated as M among the P;
5) calculate S and known text length ratio and be designated as R;
6) if R greater than assign thresholds, then text centering comprises plagiarization;
7) if M greater than assign thresholds, then text centering comprises plagiarization;
8) output has all identical interval conducts of plagiarization text to plagiarize evidence.
6. the method for claim 1 is characterized in that, it is according to the identical interval in the evidence table that the concrete plagiarization content of text is plagiarized in described displaying, the pairing content of text output in this interval and represent to the user.
CN2008102323098A 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary Expired - Fee Related CN101404037B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102323098A CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102323098A CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Publications (2)

Publication Number Publication Date
CN101404037A CN101404037A (en) 2009-04-08
CN101404037B true CN101404037B (en) 2011-05-18

Family

ID=40538049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102323098A Expired - Fee Related CN101404037B (en) 2008-11-18 2008-11-18 Method for detecting and positioning electronic text contents plagiary

Country Status (1)

Country Link
CN (1) CN101404037B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579B (en) * 2010-05-11 2012-09-05 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature
CN102650986A (en) * 2011-02-27 2012-08-29 孙星明 Synonym expansion method and device both used for text duplication detection
TWI444838B (en) * 2011-10-12 2014-07-11 Chun Ching Yang Chinese anti-piracy and plagiarism detecting system and its method
CN102779188B (en) * 2012-06-29 2015-11-25 北京奇虎科技有限公司 Duplicated text removal system and method
CN103678373B (en) * 2012-09-17 2017-11-17 腾讯科技(深圳)有限公司 A kind of garbage template article recognition methods and equipment
CN103324664B (en) * 2013-04-27 2016-08-10 国家电网公司 A kind of document similarity method of discrimination based on Fourier transformation
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN103412905A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (Portable document format) file comparison method and system
CN103823862B (en) * 2014-02-24 2017-02-15 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method
CN108446277B (en) * 2018-03-27 2021-08-17 北京大前科技有限责任公司 Method and device for simulating learning
CN110427891B (en) * 2019-08-05 2022-06-10 中国工商银行股份有限公司 Method, apparatus, system and medium for identifying contract
CN110674299A (en) * 2019-09-30 2020-01-10 南京网感至察信息科技有限公司 Detection method for plagiarism in article viewpoint
CN113011194B (en) * 2021-04-15 2022-05-03 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113326688A (en) * 2021-06-16 2021-08-31 黑龙江八一农垦大学 Ideological and political theory word duplication checking processing method and device

Also Published As

Publication number Publication date
CN101404037A (en) 2009-04-08

Similar Documents

Publication Publication Date Title
CN101404037B (en) Method for detecting and positioning electronic text contents plagiary
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
US7672833B2 (en) Method and apparatus for automatic entity disambiguation
CN103150405B (en) Classification model modeling method, Chinese cross-textual reference resolution method and system
US20130006986A1 (en) Automatic Classification of Electronic Content Into Projects
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
CN103294664A (en) Method and system for discovering new words in open fields
US20180181559A1 (en) Utilizing user-verified data for training confidence level models
CN101178786A (en) Online dissertation management method for realizing plagiarize and format checking by network resource
CN113033198A (en) Similar text pushing method and device, electronic equipment and computer storage medium
CN101833579A (en) Method and system for automatically detecting academic misconduct literature
US10706369B2 (en) Verification of information object attributes
US20080008391A1 (en) Method and System for Document Form Recognition
Mansoor et al. Computer-based plagiarism detection techniques: A comparative study
Pera et al. SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents
Sindhu et al. Fingerprinting based detection system for identifying plagiarism in Malayalam text documents
Madhani et al. Aksharantar: Open Indic-language transliteration datasets and models for the next billion users
CN105808602B (en) Method and device for detecting junk information
Zmigrod et al. BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Dejean Extracting structured data from unstructured document with incomplete resources
CN1244865C (en) Method for detecting plagiarism in electronic text using computer program
Moeljadi et al. Building cendana: a treebank for informal indonesian
Zhang et al. Extract Data Points from Invoices with Multi-layer Graph Attention Network and Named Entity Recognition
Gardner et al. Automatic link detection: a sequence labeling approach
CN107656909A (en) A kind of Documents Similarity decision method and device based on document composite character

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110518

Termination date: 20131118