TW201316186A - Chinese anti-piracy and plagiarism detecting system and its method - Google Patents

Chinese anti-piracy and plagiarism detecting system and its method Download PDF

Info

Publication number
TW201316186A
TW201316186A TW100136908A TW100136908A TW201316186A TW 201316186 A TW201316186 A TW 201316186A TW 100136908 A TW100136908 A TW 100136908A TW 100136908 A TW100136908 A TW 100136908A TW 201316186 A TW201316186 A TW 201316186A
Authority
TW
Taiwan
Prior art keywords
article
comparison
search
plagiarism
component
Prior art date
Application number
TW100136908A
Other languages
Chinese (zh)
Other versions
TWI444838B (en
Inventor
Chun-Ching Yang
Original Assignee
Chun-Ching Yang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chun-Ching Yang filed Critical Chun-Ching Yang
Priority to TW100136908A priority Critical patent/TWI444838B/en
Priority to CN2012102585167A priority patent/CN103049467A/en
Publication of TW201316186A publication Critical patent/TW201316186A/en
Application granted granted Critical
Publication of TWI444838B publication Critical patent/TWI444838B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Here are a Chinese anti-piracy and plagiarism detecting system and its method. The system includes several mechanisms: the article loading mechanism, the article dividing mechanism, the searching and comparing mechanism, and the evaluating and reporting mechanism. All these mechanisms work via the composition of one central server, a cluster of operating host computers, and a specific search engine. While operating this system, an user first need to upload a target article which is to be detected to the central server via the article loading mechanism. The central server will distribute that target article to an operating host computer. Following, with the help of the article dividing mechanism and its own words/sentences-dividing calculating method set by the system's program, the operating host computer will divide the target article into sentence units according to the user's installation in the beginning. Next, the searching and comparing mechanism will compare, evaluate, and search out the similar web pages or articles. The identified web pages or articles found out will be downloaded entirely to the operating host computer, where the full text of the target article will be compared with the full text of identified web pages or articles. And next, the evaluating and reporting mechanism will tell and show the result of the comparison-- the identified sentences of the target article together with the identified parts of the web pages or articles will be marked simultaneously, and the source of those web pages or articles will be displayed, too. On the last step, the operating host computer will transmit the result to the central server. Therefore the user can read the result of the work.

Description

中文數位反抄襲偵測比對系統與方法Chinese digital anti-plagiarism detection comparison system and method

本發明係關於中文數位反抄襲偵測比對系統與方法,尤指一種利用搜尋引擎的功能,將要比對的文章以拆解字句演算法拆解分句後各別拿去搜尋,當搜尋結果的總結吻合搜尋的句子時,就把搜尋結果的網頁載下來進行全文比對,即可快速判斷是否有從此網站抄襲之中文數位反抄襲偵測比對系統與方法。The invention relates to a Chinese digital anti-plagiarism detection comparison system and method, in particular to a function of using a search engine, which is to be compared with an article to disassemble a clause after disassembling a sentence, and then search for the search result. When the summary matches the search sentence, the search results page is loaded for full-text comparison, and the Chinese digital anti-plagiarism detection comparison system and method copied from the website can be quickly determined.

近年來網路發達,也改變了學生寫作業的習慣,從過去鼓勵學生多查閱網路資料,到現在必須防堵學生使用網路資料,而網路使用也確實造成了大量抄襲的風,網際網路營造出高度抄襲因情境,讓在情境與時間的壓力下,容易產生抄襲行為;其次社會瀰漫「走捷徑」取巧的心理,學生只關心寫作業的效率而不重視作業品質,學術界也重視論文的出版量,而較少關注內容品質,社會瀰漫取巧的心態;再者,抄襲的偏差行為已成常態,代寫服務更讓整個學術環境惡化,抄襲不誠實行為已惡化為學術欺騙的嚴重犯行。In recent years, the development of the Internet has changed the habit of students to write homework. From the past, students have been encouraged to access more information on the Internet. Until now, it is necessary to prevent students from using Internet data. The use of the Internet has indeed caused a lot of plagiarism. The Internet creates a high degree of plagiarism because of the situation, so that under the pressure of situation and time, it is easy to produce plagiarism; secondly, the society permeates the psychology of "taking shortcuts", students only care about the efficiency of writing homework and not pay attention to the quality of work, the academic world also Paying attention to the publication volume of the paper, but paying less attention to the quality of the content and the mentality of the society. In addition, the biased behavior of plagiarism has become the norm. The writing service has worsened the entire academic environment. The plagiarism of dishonesty has deteriorated into academic deception. Serious crimes.

有鑑於學生抄襲行為日益嚴重,尤其是抄襲自網路資料或將網路資料再二次加工,重新拼貼與排列組合的大雜燴文章,國外的營利企業發展了數位抄襲偵測軟體進行防範,經過長時間的運作與測試,確實能降低抄襲行為的發生率,抑制學生想要僥倖投機的行為,惟多數偵測系統的測驗報告與相關文獻幾乎係以英語系國家為主,各種累積的知識僅止於英文環境,無法移植到中文語系,由於中文語體不管是書結構、字詞組合、文字斷句(segmentation)、標點符號的使用等,都與英文環境有極大差異,因此中文化界面的數位抄襲比對系統與方法仍需要開發,以適合華語教育界使用。In view of the increasing plagiarism of students, especially the plagiarism of plagiarizing online data or reprocessing the network data, re-collection and arrangement of hodgepodge articles, foreign profit-making enterprises have developed digital plagiarism detection software for prevention. After a long period of operation and testing, it can really reduce the incidence of plagiarism and inhibit students' desire to take the opportunity to speculate. However, the test reports and related literatures of most detection systems are mainly based on the English-speaking countries, and all kinds of accumulated knowledge. It only stops in the English environment and cannot be transplanted to the Chinese language. Because the Chinese language is different from the English environment, regardless of the book structure, word combination, sentence sentence, and punctuation, the Chinese culture interface is very different. Digital plagiarism comparison systems and methods still need to be developed to suit the Chinese language education community.

按習知之中文抄襲文章比對系統與方法,如臺灣發明專利公告第I262402號「特徵擷取、資料解密方法以及抄襲文章搜尋的系統與方法」,係對已植入浮水印的文章10進行特徵擷取,以取得文件特徵20,根據所取得的詞彙輸入搜尋引擎30,並且比對索引資料庫40以搜尋網際網路上可能抄襲之文章50;接著將搜尋所得的文章50與原文比對,根據比對結果取得之句子執行浮水印解析;最後,將所取得的浮水印資訊60與原來的浮水印比對,然後根據比對結果判斷該搜尋所得之文章是否為抄襲文章,若比對結果大於一臨界值,則表示其為抄襲文章70。According to the conventional Chinese plagiarism article comparison system and method, such as Taiwan Invention Patent Notice No. I262402 "Feature Extraction, Data Decryption Method, and System and Method for Plagiarized Article Search", features the article 10 that has been implanted with watermark Extracting to obtain the file feature 20, inputting the search engine 30 according to the obtained vocabulary, and comparing the index database 40 to search for articles 50 that may be plagiarized on the Internet; and then comparing the searched articles 50 with the original text, according to The watermark analysis is performed on the sentence obtained by the comparison result; finally, the obtained watermark information 60 is compared with the original watermark, and then the comparison result is used to determine whether the searched article is a plagiarized article, if the comparison result is greater than A threshold value indicates that it is a plagiarism article 70.

而該習知發明之特徵擷取方法係將自植入浮水印的文章取得之句子、詞彙予以斷詞及詞性標注,然後根據文章中浮水印植入的詞與句型,利用同義詞庫與同義句型庫,針對文章的內文產生其語意層面的特徵,即將內容中藏有浮水印之句子與詞彙取出。然後以詞彙以及詞性作為查詢定義的依據,在同義詞資料庫中進行搜尋,以取得可作為該文章之特徵的詞彙。然後以該詞彙為關鍵字,利用搜尋引擎進行網路搜尋,以獲得相關可能的抄襲文章。The feature extraction method of the conventional invention is to use the sentence and vocabulary obtained from the watermarked article to perform word segmentation and part-of-speech tagging, and then use the synonym database and synonym according to the words and sentence patterns implanted in the watermark in the article. The sentence pattern library generates the semantic level features of the article's internal text, that is, the sentence and the vocabulary with the watermark in the content are taken out. Then use the vocabulary and part of speech as the basis for the definition of the query, and search in the synonym database to obtain the vocabulary that can be used as the feature of the article. Then use the vocabulary as a keyword to use the search engine to conduct a web search to obtain relevant plagiarism articles.

然而,該種習知之抄襲文章搜尋系統與方法由於必需過將文章植入浮水印、將植入浮水印的文章進行特徵擷取、根據所取得的詞彙輸入搜尋引擎30、比對索引資料庫40以搜尋網際網路上可能抄襲之文章50、將搜尋所得的文章50與原文比對、根據比對結果取得之句子執行浮水印解析、將所取得的浮水印資訊60與原來的浮水印比對、及根據比對結果判斷該搜尋所得之文章是否為抄襲文章等繁雜的步驟,雖然可比對出改變同義詞與同義句的部分,但對於比對一般論文或長篇文章而言,如此大量複雜的步驟對電腦伺服器會造成很大的負荷,進而影響文章比對的速度,而降低使用的效率。However, the conventional plagiarism article search system and method have to perform the feature of extracting the article into the watermark, extracting the watermarked article, and inputting the search engine 30 and the comparison index database 40 according to the obtained vocabulary. Searching for articles 50 that may be copied on the Internet, comparing the searched articles 50 with the original text, performing a watermark analysis on the sentences obtained from the comparison results, and comparing the obtained watermark information 60 with the original watermark, And judging whether the article obtained by the search is a complicated step such as copying the article according to the comparison result, although the part of changing the synonym and the synonym sentence can be compared, but for comparing the general paper or the long article, such a large number of complicated steps are The computer server will cause a large load, which will affect the speed of the article comparison and reduce the efficiency of use.

其次,因為比對系統所搜尋到相同的部分可能是「參考書目」或合乎視範圍的「直接引用」,故再精確的比對系統也很難直接判斷一文章是否為抄襲文章;最後,「抄襲」一詞仍未有客觀明確之界定,雷同字數多寡亦為判斷抄襲與否重要參考因素,而浮水印範圍之設定,並無法讓使用者根據自己主觀界定或客觀學術環境作彈性調整。Secondly, because the same part of the comparison system may be a "reference bibliography" or a "direct reference" that is appropriate for the scope of the comparison, it is difficult to directly determine whether an article is a plagiarism article; and finally, " The term "plagiarism" has not been defined objectively. The number of similar words is also an important reference factor for judging plagiarism. The setting of the watermarking scope does not allow users to make flexible adjustments based on their subjective definition or objective academic environment.

因此,為更有效率的判斷一文章是否為抄襲文章,首先必需加快文章比對的速度,並利用抄襲比對系統之比對結果做為輔助判斷是否為抄襲文章的依據,最後再以人工檢視方法做進一步查驗即可有效的判斷一文章是否為抄襲文章;其次,使用者可以透過本身所處之學術環境或需求對於「抄襲」字數之定義作彈性調整,亦符合使用者之需求。所以,如何加快文章比對的速度,以及讓審查者可清楚且快速得知比對文章中涉嫌抄襲部分與搜尋部分之差異,與對於抄襲字數定義作為彈性調整,則為該習知抄襲文章搜尋系統所欠缺考量者。Therefore, in order to judge whether an article is a plagiarism article more efficiently, it is first necessary to speed up the comparison of the articles, and use the comparison result of the plagiarism comparison system as an auxiliary to judge whether it is the basis for plagiarizing the article, and finally to manually check the article. The method can be used to further determine whether an article is a copy of the article. Secondly, the user can flexibly adjust the definition of the number of words in "plagiarism" through the academic environment or needs of the user, and meet the needs of the user. Therefore, how to speed up the comparison of the articles, and let the reviewer know clearly and quickly the difference between the suspected plagiarized part and the search part in the comparison article, and the plagiarism of the plagiarized word definition as the elastic adjustment The lack of consideration in the search system.

為此,本發明者基於多年相關系統開發與方法研究之經驗,特針對目前中文數位反抄襲偵測比對系統與方法加以研究,乃發明本案。To this end, the inventors based on years of experience in related system development and method research, specifically for the current Chinese digital anti-plagiarism detection comparison system and method to study, is the invention of the case.

本發明之目的,乃在提供一種中文數位反抄襲偵測比對方法,使用者可將比對文章上傳至中心伺服器以進行資料存取,並將對比對文章透過系統程式所設定之拆解規則進行拆解,而得到固定字數之拆解文句,再利用搜尋引擎將拆解文章一句一句搜尋後,即可獲得多筆與拆解文句雷同的網頁或文章,再將雷同的網頁或文章與比對文章進行全文比對,以比對文章與雷同的網頁或文章雷同的部分進行標記及分析,即可獲得比對結果,該比對結果可供審查者進行人工查驗,並判斷比對文章是否為抄襲文章。The purpose of the present invention is to provide a Chinese digital anti-plagiarism detection comparison method, the user can upload the comparison article to the central server for data access, and compare and disassemble the article through the system program. The rules are disassembled, and the disassembled sentence of the fixed number of words is obtained, and then the search engine is used to search the disassembled articles one by one, and then multiple pages or articles with the same disassembled sentences can be obtained, and the same webpage or article will be used. Compared with the comparison article, the comparison result can be obtained by marking and analyzing the similar parts of the article or the similar article. The comparison result can be manually checked by the reviewer and judged. Whether the article is a copy of the article.

本發明之中文數位反抄襲偵測比對系統,其中文章存取機制設首頁元件以提供使用者進入中文數位反抄襲偵測比對系統之管道,並設有上傳元件與資料庫元件;文章拆解機制在開始偵測上傳比對文章時,將該比對文章以拆解字句演算法,將每篇比對文章先以換行方式拆成複數段落,再依自訂之拆解字數,以固定字數將每個段落拆解成固定長度之最小偵測句子,並將各最小偵測句子之標點符號刪除,再依自訂之搜尋比對字數,以固定字數做為門檻篩選符合搜尋比對條件之最小偵測句子,最後即可將前述各符合搜尋比對條件之最小偵測句子登錄編號,以提供搜尋比對機制進行偵測抄襲之用;搜尋比對機制設比對元件、判定元件及抄襲來源比對元件,利用比對元件及判定元件可獲得與拆解文章部分雷同的網頁或文章,並以抄襲來源比對元件連結並下載與比對文章相似之雷同的網頁或文章,即可將比對文章與雷同的網頁或文章以比對元件及判定元件進行全文比對;評估報告機制是在前述比對文章進行全文比對、判定後,即可針對每一份上傳的比對文章做出分析,並獲得比對結果。The Chinese digital anti-plagiarism detection comparison system of the present invention, wherein the article access mechanism sets a front-end component to provide a user to enter a Chinese digital anti-plagiarism detection comparison system pipeline, and has an upload component and a database component; When the solution mechanism starts to detect and upload the comparison article, the comparison article is decomposed into a sentence algorithm, and each comparison article is first broken into multiple paragraphs by a line feed method, and then the number of words is disassembled according to the custom The fixed number of words disassembles each paragraph into a minimum detection sentence of fixed length, and deletes the punctuation marks of each minimum detection sentence, and then compares the number of words according to the custom search, and uses the fixed number of words as the threshold to match Search for the minimum detection sentence of the comparison condition, and finally the minimum detection sentence registration number corresponding to the search comparison condition can be used to provide the search comparison mechanism for detecting plagiarism; the search comparison mechanism sets the comparison component The determination component and the plagiarism source comparison component can use the comparison component and the determination component to obtain a webpage or article that is identical to the disassembled article, and link and download with the plagiarism source comparison component. For similar webpages or articles with similar articles, the comparison article and the similar webpage or article can be compared by the comparison component and the judgment component. The evaluation report mechanism is to compare and judge the above-mentioned comparison article. , you can analyze each of the uploaded comparison articles and get the comparison results.

本發明之中文數位反抄襲偵測比對系統與方法,乃在中文數位反抄襲偵測比對系統之中心伺服器設有複數個運算主機,經由中心伺服器將各使用者上傳之比對文章分派給各運算主機,利用運算主機進行文章拆解、下載多筆與文章拆解部分雷同的網頁或文章、及全文比對,以獲得比對結果,並將比對結果回傳至中心伺服器,再經由中心伺服器傳送至使用者。The Chinese digital anti-plagiarism detection comparison system and method of the present invention is provided with a plurality of computing hosts in a central server of the Chinese digital anti-plagiarism detection comparison system, and the user is uploaded by the central server to compare the articles. Distribute to each computing host, use the computing host to disassemble the article, download multiple pages and articles that are identical to the article dismantling, and compare the full text to obtain the comparison result, and return the comparison result to the central server. And then transmitted to the user via the central server.

如此,為使 貴審查員得以充分了解本發明之特徵,茲依附圖示解說如下:如第1圖所示,係為本發明之數位抄襲比對方法之步驟流程圖。首先,使用者可將比對文章10上傳至中心伺服器11以進行資料存取,並將比對文章10以拆解字句演算法進行拆解而得到拆解文章12,利用搜尋引擎13將拆解文章12之最小偵測句子一句一句上傳搜尋後,即可獲得多筆與拆解文章12部分雷同的網頁或文章14,並將該雷同的網頁或文章14下載,並將比對文章10與雷同的網頁或文章14進行全文比對15,以比對文章10與雷同的網頁或文章14相同的部分進行標記及分析,即可獲得比對結果16,該比對結果16可供審查者進行人工查驗,以進一步判斷比對文章是否為抄襲文章。Thus, in order for the examiner to fully understand the features of the present invention, the following is illustrated by the accompanying drawings: FIG. 1 is a flow chart showing the steps of the digital plagiarism comparison method of the present invention. First, the user can upload the comparison article 10 to the central server 11 for data access, and disassemble the comparison article 10 by disassembling the sentence algorithm to obtain the disassembled article 12, which is to be disassembled by the search engine 13. After the minimum detection sentence of the article 12 is uploaded and searched, you can obtain multiple pages or articles 14 that are identical to the disassembled article 12, and download the same page or article 14 and compare the article with The same webpage or article 14 is compared in full text 15 to compare and compare the same part of the article 10 with the same webpage or article 14, and the comparison result 16 can be obtained, and the comparison result 16 can be performed by the examiner. Manual inspection to further determine whether the comparison article is a plagiarism article.

再者,本發明之中文數位反抄襲偵測比對系統至少包含:文章存取機制2,設首頁元件20以提供使用者進入中文數位反抄襲偵測比對系統之管道,包括網站首頁、使用者帳號與密碼之鑑別、使用者帳號密碼遺忘重領之機制等;並可將使用者身份區分為系統管理者、教師、學生與參觀者,以方便使用者之管理與設定使用權限。Furthermore, the Chinese digital anti-plagiarism detection comparison system of the present invention at least includes: an article access mechanism 2, and a homepage component 20 for providing a user to enter a Chinese digital anti-plagiarism detection comparison system pipeline, including a website homepage, use The identification of the account number and password, the mechanism for forgetting the user account password, etc.; and the user identity can be divided into system administrators, teachers, students and visitors to facilitate user management and set usage rights.

前述文章存取機制2設上傳元件21,以提供網站頁面供使用者上傳比對文章10至中心伺服器11,此上傳元件21將會判別上傳比對文章10之格式(Word或PDF)是否符合系統需求;上傳元件21會把上傳比對文章10儲存併入中心伺服器11之資料庫元件22之中,並監控使用者在期限之內可以重新上傳比對文章10。The foregoing article access mechanism 2 is provided with an uploading component 21 for providing a website page for the user to upload the matching article 10 to the central server 11, and the uploading component 21 will determine whether the uploading of the article 10 format (Word or PDF) conforms. The system requirements; the upload component 21 will store the upload comparison article 10 into the database component 22 of the central server 11, and monitor the user to re-upload the comparison article 10 within the deadline.

前述文章存取機制2設資料庫元件22,主要是隨時在網際網路上蒐集各類型之數位資料,並加以儲存,以增加比對時之資料來源。The foregoing article access mechanism 2 is provided with a database component 22, which mainly collects and records various types of digital data on the Internet at any time to increase the source of the data during the comparison.

文章拆解機制3,前述文章拆解機制3在開始偵測上傳比對文章10時,將該比對文章10以拆解字句演算法30,將每篇比對文章10先以換行方式31拆成複數段落310,再依自訂之拆解字數32,以固定字數將每個段落310拆解成固定長度之最小偵測句子320,並將各最小偵測句子320之標點符號刪除,再依自訂之搜尋比對字數33,以固定字數做為門檻篩選符合搜尋比對條件之最小偵測句子320,若一刪除標點符號後之最小偵測句子320所剩餘的字數少於所設定之搜尋比對字數33時,即不搜尋比對該句子;如此,利用拆解字句演算法30即可解決比對文章10拆解後的句子太短無搜尋意義,或比對文章10拆解後的句子太長不易搜尋之問題;最後即可將前述各符合搜尋比對條件之最小偵測句子320登錄編號,以提供搜尋比對機制4進行偵測抄襲之用。The article dismantling mechanism 3, the above-mentioned article dismantling mechanism 3, when starting to detect the uploading comparison article 10, the matching article 10 is to disassemble the sentence algorithm 30, and each of the matching articles 10 is firstly broken in the line-entry mode 31. In the plural paragraph 310, according to the custom disassembled word number 32, each paragraph 310 is disassembled into a fixed length minimum detection sentence 320 by a fixed number of words, and the punctuation marks of each minimum detection sentence 320 are deleted. According to the custom search comparison word number 33, the minimum number of detection sentences 320 matching the search comparison condition is selected by using the fixed word number as a threshold, and the minimum number of words remaining in the minimum detection sentence 320 after deleting the punctuation mark is small. When the searched comparison number of words is set to 33, the search sentence is not searched for the sentence; thus, the disassembled sentence algorithm 30 can be used to solve the problem that the sentence after the disassembled article 10 is too short, has no search meaning, or is compared. The sentence after the disassembly of the article 10 is too long to search for the problem; finally, the foregoing minimum detection sentence 320 corresponding to the search comparison condition can be registered to provide the search comparison mechanism 4 for detecting plagiarism.

關於前述拆解字句演算法30,例如某一比對文章10以換行方式31拆成之段落310為:『他慢慢蹲下來,好了一點,好了一點。從略微的仰角,他看到街對面有個手拿氣球的奇怪女人正抬頭仰望天空,她像發現幽浮似地,嘴不由自主地張開來。』;若自訂之拆解字數32為15個字,即可將該段落310拆解成固定長度之最小偵測句子320為:『(他慢慢蹲下來,好了一點,好了一)(點。從略微的仰角,他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空,她像發現幽浮似地,嘴不)(由自主地張開來。)』;之後再將各最小偵測句子320刪除標點符號,可得到不含標點符號之最小偵測句子320為:『(他慢慢蹲下來好了一點好了一)(點從略微的仰角他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空她像發現幽浮似地嘴不)(由自主地張開來)』;若自訂之搜尋比對字數33為8個字,其中該句「(由自主地張開來)」只有7個字,不符合篩選門檻,因此最後可登錄編號及上傳搜尋比對之最小偵測句子320為:『(他慢慢蹲下來好了一點好了一)(點從略微的仰角他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空她像發現幽浮似地嘴不)』;共4句。Regarding the aforementioned disassembling sentence algorithm 30, for example, a paragraph 310 in which a certain collating article 10 is broken in a line-feeding manner 31 is: "He slowly squats down, a little better, a little better. From a slight elevation angle, he saw a strange woman holding a balloon across the street looking up at the sky. She found the sullen, and the mouth opened involuntarily. If the number of custom disassembled words 32 is 15 words, the paragraph 310 can be disassembled into a fixed length minimum detection sentence 320 as: "(He slowly squats down, a little better, one better) ) (point. From a slight elevation angle, he saw the street across the street) (a strange woman with a balloon in hand is looking up) (looking at the sky, she looks like a sulky, mouth does not) (opened by autonomously.) After that, the minimum detection sentence 320 is deleted from the punctuation mark, and the minimum detection sentence 320 without punctuation is obtained: "(He slowly squats down a little better) (point from a slight elevation angle he Seeing across the street) (a strange woman with a balloon in hand is looking up) (Looking at the sky, she looks like a sullen mouth) (by self-opening)"; if the custom search is 33 8 words, in which the sentence "(opened by autonomously)" has only 7 words, which does not meet the screening threshold. Therefore, the minimum detection sentence 320 of the last login number and upload search comparison is: "(He slowly squats down) Ok, it’s a little better.) (The point is from a slightly elevated angle. He sees the street across the street.) (The strange woman with a balloon in hand is looking up) (Wangtian She is empty and looks like a sulky mouth.) A total of four sentences.

搜尋比對機制4設比對元件40,前述比對元件40可透過網際網路搜尋引擎13(如Google搜尋引擎或YAHOO搜尋引擎等)、或各式電子文章資料庫、或其他學生的文章等加以登錄比對前述經篩選過之最小偵測句子320,或將比對文章10與雷同的網頁或文章14進行全文比對15。The search comparison mechanism 4 is provided with a matching component 40. The comparison component 40 can pass through an internet search engine 13 (such as a Google search engine or a YAHOO search engine), or various electronic article databases, or articles of other students. The selected minimum detected sentence 320 is compared to the above-mentioned screened, or the matching article 10 is compared with the identical web page or the article 14 in full text 15 .

前述搜尋比對機制4設判定元件41,此判定元件41之功能在於根據前述比對元件40之比對結果,針對比對文章10中的每一句、每一段落做出是否抄襲的判斷,並以百分比來呈現其抄襲可能性;或將比對文章10與雷同的網頁或文章14進行全文比對15之抄襲比例也做出評估判定。前述搜尋比對機制4設抄襲來源比對元件42,主要是建立抄襲文字與抄襲來源間之連結工作,以連結並下載與比對文章10相似之雷同的網頁或文章14進行全文比對15。The search matching mechanism 4 is provided with a determining component 41. The function of the determining component 41 is to determine whether to plagiarize each sentence and each paragraph in the article 10 according to the comparison result of the comparing component 40. The percentage is used to present the possibility of plagiarism; or the proportion of plagiarism that compares article 10 with the same page or article 14 is also evaluated. The aforementioned search comparison mechanism 4 sets the plagiarism source comparison component 42, mainly to establish a link between the plagiarism text and the plagiarism source, to link and download the similar page or article 14 similar to the comparison article 10 for full-text comparison 15 .

本發明之獲得與比對文章10相似之雷同的網頁或文章14之方法,是利用搜尋比對機制4之搜尋引擎13、比對元件40及判定元件41,將經過拆解字句演算法30篩選過之最小偵測句子320一句一句各別進行搜尋比對,而搜尋後所得到的搜尋比對結果,根據拿去搜尋的句子和搜尋比對結果之總結,計算兩者共同子序列的比例,並設定一門檻值,若共同子序列的比例超過該門檻值時,即以抄襲來源比對元件42連結並下載該網頁或文章,即可獲得與拆解文章12部分雷同的網頁或文章14,以進行後續全文比對15之程序。The method for obtaining a web page or article 14 similar to that of the article 10 of the present invention is to use the search engine 13, the matching component 40 and the decision component 41 of the search matching mechanism 4 to filter the disassembled sentence algorithm 30. The minimum detection sentence 320 is searched and compared separately, and the search comparison result obtained after the search is calculated according to the sum of the search sentence and the search comparison result, and the ratio of the common sub-sequence is calculated. And setting a threshold value, if the proportion of the common subsequence exceeds the threshold value, that is, the plagiarism source comparison component 42 is linked and downloads the webpage or article, and the webpage or article 14 which is identical to the disassembling article 12 is obtained. For the subsequent full-text comparison 15 program.

本發明之全文比對15之方法,是利用利用搜尋比對機制4之比對元件40,將比對文章10與雷同的網頁或文章14直接以無意義字元接在一起,利用詞尾陣列(Suffix Array)的資料結構處理過,再利用資料分割(Data Partitioning,簡稱DP)的技巧,即可得到所有在比對文章10中局部最長且有出現在雷同的網頁或文章14的句子,之後再以搜尋比對機制4之判定元件41,將全文比對15後之抄襲比例做出評估判定;因為比對相同的部分可能是「參考書目」或合乎規範的「直接引用」,為方便審查者進行人工查驗,該比對文章10與雷同網相14之比對相同部分,搜尋比對機制4會各別以反黃方式標記於比對文章10及雷同的網頁或文章14中。The method of the full text comparison 15 of the present invention utilizes the matching component 40 using the search matching mechanism 4, and the comparison article 10 and the identical web page or article 14 are directly connected by meaningless characters, using the suffix array ( Suffix Array) data structure has been processed, and then using Data Partitioning (DP) techniques, you can get all the sentences that are the longest in the comparison article 10 and appear on the same page or article 14, and then The determination component 41 of the search comparison mechanism 4 is used to evaluate and determine the proportion of plagiarism after the full text comparison; since the same part of the comparison may be a "reference bibliography" or a conforming "direct citation", for the convenience of reviewers For manual inspection, the comparison of the article 10 with the same network 14 is the same portion, and the search comparison mechanism 4 is separately marked in the anti-yellow manner in the comparison article 10 and the similar web page or article 14.

評估報告機制5,是所有偵測工作的最後一項,亦即在前述比對文章10與雷同的網頁或文章14進行全文比對、判定後,利用評估報告機制5即可針對每一份上傳的比對文章10做出分析,並獲得比對結果16。The evaluation reporting mechanism 5 is the last item of all detection work, that is, after comparing and judging the above-mentioned comparison article 10 with the same webpage or article 14, the evaluation report mechanism 5 can be used for each upload. The comparison of article 10 is analyzed and the results of the comparison 16 are obtained.

如圖2所示,係為本發明之中文數位反抄襲偵測比對系統之實施例,各使用者6將比對文章10經由文章存取機制2上傳至中心伺服器11以進行資料存取後,中心伺服器11即會將各使用者6上傳之比對文章10分派給各運算主機110,該運算主機110利用文章拆解機制3對比對文章10進行拆解而得到拆解文章12,並以搜尋比對機制4利用搜尋引擎13將拆解文章12一句一句各別上傳搜尋比對後,即可將與拆解文章12部分雷同的網頁或文章14下載回傳至運算主機110,以進行比對文章10與雷同的網頁或文章14之全文比對15,利用評估報告機制5即可針對每一份上傳的比對文章10做出分析,並於產生比對結果16後,將此一比對結果16回傳至中心伺服器11,因此使用者6即可觀看比對結果16。As shown in FIG. 2, it is an embodiment of the Chinese digital anti-plagiarism detection and comparison system of the present invention. Each user 6 uploads the comparison article 10 to the central server 11 via the article access mechanism 2 for data access. Then, the central server 11 distributes the comparison article 10 uploaded by each user 6 to each computing host 110, and the computing host 110 uses the article disassembly mechanism 3 to disassemble the article 10 to obtain the disassembled article 12. After searching and comparing the disassembled articles 12 with the search engine 13 by using the search engine 13, the webpage or article 14 similar to the disassembled article 12 can be downloaded and sent back to the computing host 110. Compare the article 10 with the similar web page or the full text of the article 14 and use the evaluation report mechanism 5 to analyze the uploaded article 10 for each upload, and after generating the comparison result 16, A comparison result 16 is passed back to the central server 11, so that the user 6 can view the comparison result 16.

如此,本發明利用中心伺服器11與運算主機110之分工機制可提升中文數位反抄襲偵測比對系統之搜尋比對速度,且本發明之中文數位反抄襲偵測比對系統利用文章拆解機制3及搜尋比對機制4可加快比對文章10比對的速度,並於比對文章10與雷同網相14之相同部分以反黃方式各別標記於文章及網頁中,使該使用者6可快速獲得比對文章10之比對結果16,並利用比對結果16進行人工查驗,以判斷比對文章是否為抄襲文章,為本案之組成。In this way, the present invention utilizes the division mechanism of the central server 11 and the computing host 110 to improve the search comparison speed of the Chinese digital anti-plagiarism detection comparison system, and the Chinese digital anti-plagiarism detection comparison system utilizes the article disassembly. Mechanism 3 and the search comparison mechanism 4 can speed up the comparison of the article 10, and the same part of the comparison article 10 and the similar network 14 are respectively marked in the anti-yellow manner in the article and the webpage, so that the user 6 can quickly obtain the comparison result 16 of the comparison article 10, and use the comparison result 16 for manual inspection to determine whether the comparison article is a plagiarism article, which is the composition of the case.

10...比對文章10. . . Comparison article

11...中心伺服器11. . . Central server

12...拆解文章12. . . Disassemble the article

13...搜尋引擎13. . . Search engine

14...雷同的網頁或文章14. . . Similar webpage or article

15...全文比對15. . . Full text comparison

16...比對結果16. . . Comparison result

2...文章存取機制2. . . Article access mechanism

20...首頁元件20. . . Home component

21...上傳元件twenty one. . . Upload component

22...資料庫元件twenty two. . . Database component

3...文章拆解機制3. . . Article dismantling mechanism

30...拆解字句演算法30. . . Disassemble word algorithm

310...段落310. . . paragraph

32...拆解字數32. . . Disassemble words

320...最小偵測句子320. . . Minimum detection sentence

33...搜尋比對字數33. . . Search for the number of words

4...搜尋比對機制4. . . Search comparison mechanism

40...比對元件40. . . Alignment component

41...判定元件41. . . Decision component

42...抄襲來源比對元件42. . . Plagiarism source comparison component

5...評估報告機制5. . . Evaluation reporting mechanism

6...使用者6. . . user

110...運算主機110. . . Computing host

圖1是本發明之中文數位反抄襲偵測比對方法之步驟流程圖。1 is a flow chart showing the steps of the Chinese digital anti-plagiarism detection comparison method of the present invention.

圖2是本發明之實施例。Figure 2 is an embodiment of the invention.

10...比對文章10. . . Comparison article

11...中心伺服器11. . . Central server

12...拆解文章12. . . Disassemble the article

13...搜尋引擎13. . . Search engine

14...雷同的網頁或文章14. . . Similar webpage or article

15...全文比對15. . . Full text comparison

16...比對結果16. . . Comparison result

Claims (8)

一種中文數位反抄襲偵測比對方法,包括下列步驟:將比對文章上傳至中心伺服器以進行資料存取;將對比對文章以拆解字句演算法進行拆解而得到拆解文章;利用搜尋引擎將拆解文章之最小偵測句子一句一句上傳搜尋後,即可獲得多筆與拆解文章部分雷同的網頁或文章,並將該雷同的網頁或文章下載;將比對文章與雷同的網頁或文章進行全文比對,以比對文章與雷同的網頁或文章相似的部分進行標記及分析,即可獲得比對結果,該比對結果可供審查者進行人工查驗,並判斷比對文章是否為抄襲文章。A Chinese digital anti-plagiarism detection comparison method includes the following steps: uploading a comparison article to a central server for data access; and disassembling the article by disassembling the word-breaking algorithm to obtain a disassembled article; After the search engine uploads the minimum detection sentence of the disassembled article one by one, it can obtain multiple pages or articles that are similar to the disassembled article, and download the same webpage or article; the comparison article will be identical. The webpage or the article is compared in full text, and the comparison result is marked and analyzed by comparing the similarity of the article with the similar webpage or the article, and the comparison result is obtained for the examiner to manually check and compare the article. Whether it is a copy of the article. 一種中文數位反抄襲偵測比對系統,至少包含:文章存取機制,設首頁元件以提供使用者進入中文數位反抄襲偵測比對系統之管道;前述文章存取機制設上傳元件以提供網站頁面供使用者上傳比對文章至中心伺服器;前述文章存取機制設資料庫元件以隨時在網際網路上蒐集各類型之數位資料,並加以儲存,可增加比對時之資料來源;文章拆解機制,前述文章拆解機制在開始偵測上傳比對文章時,將該比對文章以拆解字句演算法,將每篇比對文章先以換行方式拆成複數段落,再依自訂之拆解字數,以固定字數將每個段落拆解成固定長度之最小偵測句子,並將各最小偵測句子之標點符號刪除,再依自訂之搜尋比對字數,以固定字數做為門檻篩選符合搜尋比對條件之最小偵測句子,最後即可將前述各符合搜尋比對條件之最小偵測句子登錄編號,以提供搜尋比對機制進行偵測抄襲之用;搜尋比對機制設比對元件,前述比對元件可透過網際網路搜尋引擎、或各式電子文章資料庫、或其他學生的文章等加以登錄比對前述經篩選過之最小偵測句子,或將比對文章與雷同的網頁或文章進行全文比對;前述搜尋比對機制設判定元件,此判定元件之功能在於根據前述比對元件之比對結果,針對比對文章中的每一句、每一段落做出是否抄襲的判斷,並以百分比來呈現其抄襲可能性,或將比對文章與雷同的網頁或文章進行全文比對之抄襲比例也做出評估判定;前述搜尋比對機制設抄襲來源比對元件,主要是建立抄襲文字與抄襲來源間之連結工作,以連結並下載與比對文章相似之雷同的網頁或文章進行全文比對;評估報告機制,是所有偵測工作的最後一項,在前述比對文章進行全文比對、判定後,即可針對每一份上傳的比對文章做出分析,並獲得比對結果。A Chinese digital anti-plagiarism detection comparison system includes at least an article access mechanism, and a homepage component is provided to provide a user to enter a Chinese digital anti-plagiarism detection comparison system pipeline; the foregoing article access mechanism sets an upload component to provide a website The page is for the user to upload the comparison article to the central server; the foregoing article access mechanism sets the database component to collect various types of digital data on the Internet at any time, and stores it, thereby increasing the source of the data when comparing; The solution mechanism, the above-mentioned article dismantling mechanism starts to detect and upload the comparison article, and the comparison article is disassembled into a sentence algorithm, and each comparison article is first broken into multiple paragraphs by line feed, and then customized. Disassemble the number of words, disassemble each paragraph into a fixed-length minimum detection sentence with a fixed number of words, and delete the punctuation marks of each minimum detection sentence, and then compare the number of words according to the custom search to fix the word The number is used as a threshold to filter the minimum detection sentence that meets the search comparison condition, and finally the minimum detection sentence registration number corresponding to the search comparison condition can be selected to provide a search. The comparison mechanism detects the plagiarism; the search comparison mechanism sets the comparison component, and the comparison component can be logged in through the Internet search engine, or various electronic article databases, or other student articles. The minimum detected sentence is selected, or the comparison article is compared with the identical webpage or article; the search matching mechanism is provided with a determining component, and the function of the determining component is based on the comparison result of the comparing component. Evaluate whether or not to plagiarize each sentence and each paragraph in the article, and present the possibility of plagiarism by percentage, or to compare the plagiarism ratio of the comparison article with the same page or article. Judgment; the above-mentioned search comparison mechanism sets the plagiarism source comparison component, mainly to establish a link between plagiarism text and plagiarism source, to link and download the same page or article similar to the comparison article for full-text comparison; evaluation report mechanism Is the last item of all detection work. After comparing and judging the above-mentioned comparison articles, you can target each one. Than to make the article analyzed and compared with the results obtained. 依申請專利範圍第2項所述之中文數位反抄襲偵測比對系統,其中首頁元件包括網站首頁、使用者帳號與密碼之鑑別、使用者帳號密碼遺忘重領之機制等;並可將使用者身份區分為系統管理者、教師、學生與參觀者,以方便使用者之管理與設定使用權限。According to the Chinese digital anti-plagiarism detection and comparison system described in item 2 of the patent application scope, the homepage components include the homepage of the website, the identification of the user account and password, and the mechanism for forgetting the user account password, etc.; The identity of the person is divided into system administrators, teachers, students and visitors to facilitate user management and set usage rights. 依申請專利範圍第2項所述之中文數位反抄襲偵測比對系統,其中上傳元件將會判別上傳比對文章之格式是否符合系統需求,且上傳元件會把上傳比對文章儲存併入中心伺服器之資料庫元件之中,並監控使用者在期限之內可以重新上傳比對文章。According to the Chinese digital anti-plagiarism detection and comparison system described in item 2 of the patent application scope, the uploading component will determine whether the format of the uploading comparison article conforms to the system requirements, and the uploading component will merge the uploading comparison article storage into the center. The server's database component, and the monitoring user can re-upload the comparison article within the deadline. 依申請專利範圍第1或2項所述之中文數位反抄襲偵測比對系統,其中獲得與比對文章相似之雷同的網頁或文章之方法,是利用搜尋比對機制之搜尋引擎、比對元件及判定元件,將經過拆解字句演算法篩選過之最小偵測句子一句一句各別進行搜尋比對,而搜尋後所得到的搜尋比對結果,根據拿去搜尋的句子和搜尋比對結果之總結,計算兩者共同子序列的比例,並設定一門檻值,若共同子序列的比例超過該門檻值時,即以抄襲來源比對元件連結並下載該網頁或文章,即可獲得與拆解文章部分雷同的網頁或文章,以進行後續全文比對之程序。According to the Chinese digital anti-plagiarism detection and comparison system described in the first or second patent application scope, the method for obtaining the same web page or article similar to the comparison article is to use the search engine of the search comparison mechanism, and compare The component and the determining component compare the minimum detected sentences filtered by the disassembled sentence algorithm one by one, and the search results obtained after the search are compared according to the searched sentence and the search result. To summarize, calculate the proportion of the common subsequences, and set a threshold. If the proportion of the common subsequence exceeds the threshold, the plagiar source comparison component is linked and the webpage or article is downloaded. Solve some of the same pages or articles in the article for subsequent follow-up comparison procedures. 依申請專利範圍第1或2項所述之中文數位反抄襲偵測比對系統,其中全文比對之方法,是利用利用搜尋比對機制之比對元件,將比對文章與雷同的網頁或文章直接以無意義字元接在一起,利用詞尾陣列(Suffix Array)的資料結構處理過,再利用資料分割(Data Partitioning,簡稱DP)的技巧,即可得到所有在比對文章中局部最長且有出現在雷同的網頁或文章的句子;之後再以搜尋比對機制之判定元件,將全文比對後之抄襲比例做出評估判定。According to the Chinese digital anti-plagiarism detection and comparison system described in the first or second patent application scope, the method of comparing the full text is to use the comparison component of the search comparison mechanism, and compare the article with the same webpage or The article is directly connected with meaningless characters, processed by the data structure of Suffix Array, and then using the technique of Data Partitioning (DP), which can get the longest part in the comparison article. There are sentences on the same page or article; after that, the judgment component of the search comparison mechanism is used to evaluate and judge the proportion of plagiarism after the full text comparison. 依申請專利範圍第1或2項所述之中文數位反抄襲偵測比對系統與方法,比對文章與雷同的網或文章之比對相同部分,搜尋比對機制會各別以反黃方式標記於比對文章及雷同的網頁或文章中。According to the Chinese digital anti-plagiarism detection comparison system and method described in item 1 or 2 of the patent application scope, the comparison of the article and the identical network or article is the same, and the search comparison mechanism will be in an anti-yellow manner. Marked in comparison pages and similar pages or articles. 依申請專利範圍第1或2項所述之中文數位反抄襲偵測比對系統與方法,其中中心伺服器設有複數個運算主機,經由中心伺服器將各使用者上傳之比對文章分派給各運算主機,利用運算主機進行文章拆解、下載多筆與比對文章部分雷同的網頁或文章、及全文比對,以獲得比對結果,並將比對結果回傳至中心伺服器,再經由中心伺服器傳送至使用者。According to the Chinese digital anti-plagiarism detection comparison system and method described in claim 1 or 2, wherein the central server is provided with a plurality of computing hosts, and the matching articles uploaded by each user are distributed to the central server. Each computing host uses the computing host to disassemble the article, download multiple pages or articles that are identical to the comparison article, and compare the full text to obtain the comparison result, and return the comparison result to the central server. Transfer to the user via the central server.
TW100136908A 2011-10-12 2011-10-12 Chinese anti-piracy and plagiarism detecting system and its method TWI444838B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW100136908A TWI444838B (en) 2011-10-12 2011-10-12 Chinese anti-piracy and plagiarism detecting system and its method
CN2012102585167A CN103049467A (en) 2011-10-12 2012-07-24 Chinese digital anti-plagiarism detection and comparison system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW100136908A TWI444838B (en) 2011-10-12 2011-10-12 Chinese anti-piracy and plagiarism detecting system and its method

Publications (2)

Publication Number Publication Date
TW201316186A true TW201316186A (en) 2013-04-16
TWI444838B TWI444838B (en) 2014-07-11

Family

ID=48062110

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100136908A TWI444838B (en) 2011-10-12 2011-10-12 Chinese anti-piracy and plagiarism detecting system and its method

Country Status (2)

Country Link
CN (1) CN103049467A (en)
TW (1) TWI444838B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345466B (en) * 2013-07-12 2016-09-07 唐煜舟 A kind of scientific paper information detecting method based on internet free information
CN103412905A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (Portable document format) file comparison method and system
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN106649871B (en) * 2017-01-03 2019-10-25 广州爱九游信息技术有限公司 Detection method, device and the calculating equipment of article multiplicity
CN109710834B (en) * 2018-11-16 2020-01-10 北京字节跳动网络技术有限公司 Similar webpage detection method and device, storage medium and electronic equipment
CN111611787A (en) * 2019-02-25 2020-09-01 中国海洋大学 Plagiarism evaluation method, system and auxiliary writing system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100406671B1 (en) * 2000-07-24 2003-11-21 주식회사 유니마이다스 Method of searching for piracy and steal on a piece of writing
US7503035B2 (en) * 2003-11-25 2009-03-10 Software Analysis And Forensic Engineering Corp. Software tool for detecting plagiarism in computer source code
CN101334789A (en) * 2008-08-04 2008-12-31 福建师范大学 Device for identifying document plagiarism by search engine
CN101404037B (en) * 2008-11-18 2011-05-18 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN101957809A (en) * 2010-10-14 2011-01-26 传神联合(北京)信息技术有限公司 Anti-plagiarism method

Also Published As

Publication number Publication date
TWI444838B (en) 2014-07-11
CN103049467A (en) 2013-04-17

Similar Documents

Publication Publication Date Title
Ali et al. Overview and comparison of plagiarism detection tools.
Saier et al. unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata
CN109299865B (en) Psychological evaluation system and method based on semantic analysis and information data processing terminal
White et al. Sentence-based natural language plagiarism detection
US9514417B2 (en) Cloud-based plagiarism detection system performing predicting based on classified feature vectors
TWI444838B (en) Chinese anti-piracy and plagiarism detecting system and its method
CN105701076B (en) A kind of paper plagiarizes detection method and system
Ali et al. Survey of plagiarism detection methods
El Moatez Billah Nagoudi et al. 2L-APD: A two-level plagiarism detection system for Arabic documents
Stamatatos et al. Plagiarism and authorship analysis: introduction to the special issue
Brugman et al. Nederlab: Towards a single portal and research environment for diachronic Dutch text corpora
KR102280490B1 (en) Training data construction method for automatically generating training data for artificial intelligence model for counseling intention classification
Elhadi et al. Use of text syntactical structures in detection of document duplicates
CN105701085B (en) A kind of network duplicate checking method and system
Alohaly et al. Better privacy indicators: a new approach to quantification of privacy policies
Jiffriya et al. Plagiarism detection tools and techniques: A comprehensive survey
Panchenko et al. Detection of child sexual abuse media on p2p networks: Normalization and classification of associated filenames
Pera et al. SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents
Ruppert et al. LawStats–Large-Scale German Court Decision Evaluation Using Web Service Classifiers
CN105701086B (en) A kind of sliding window document detection method and system
Gienapp et al. A large dataset of scientific text reuse in Open-Access publications
Lommatzsch et al. NewsImages: addressing the depiction gap with an online news dataset for text-image rematching
Nystrom et al. The future of digital legal history: no magic, no silver bullets
Williams et al. Classifying and ranking search engine results as potential sources of plagiarism
Sweidan et al. Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees