TW201316186A - Chinese anti-piracy and plagiarism detecting system and its method - Google Patents
Chinese anti-piracy and plagiarism detecting system and its method Download PDFInfo
- Publication number
- TW201316186A TW201316186A TW100136908A TW100136908A TW201316186A TW 201316186 A TW201316186 A TW 201316186A TW 100136908 A TW100136908 A TW 100136908A TW 100136908 A TW100136908 A TW 100136908A TW 201316186 A TW201316186 A TW 201316186A
- Authority
- TW
- Taiwan
- Prior art keywords
- article
- comparison
- search
- plagiarism
- component
- Prior art date
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本發明係關於中文數位反抄襲偵測比對系統與方法,尤指一種利用搜尋引擎的功能,將要比對的文章以拆解字句演算法拆解分句後各別拿去搜尋,當搜尋結果的總結吻合搜尋的句子時,就把搜尋結果的網頁載下來進行全文比對,即可快速判斷是否有從此網站抄襲之中文數位反抄襲偵測比對系統與方法。The invention relates to a Chinese digital anti-plagiarism detection comparison system and method, in particular to a function of using a search engine, which is to be compared with an article to disassemble a clause after disassembling a sentence, and then search for the search result. When the summary matches the search sentence, the search results page is loaded for full-text comparison, and the Chinese digital anti-plagiarism detection comparison system and method copied from the website can be quickly determined.
近年來網路發達,也改變了學生寫作業的習慣,從過去鼓勵學生多查閱網路資料,到現在必須防堵學生使用網路資料,而網路使用也確實造成了大量抄襲的風,網際網路營造出高度抄襲因情境,讓在情境與時間的壓力下,容易產生抄襲行為;其次社會瀰漫「走捷徑」取巧的心理,學生只關心寫作業的效率而不重視作業品質,學術界也重視論文的出版量,而較少關注內容品質,社會瀰漫取巧的心態;再者,抄襲的偏差行為已成常態,代寫服務更讓整個學術環境惡化,抄襲不誠實行為已惡化為學術欺騙的嚴重犯行。In recent years, the development of the Internet has changed the habit of students to write homework. From the past, students have been encouraged to access more information on the Internet. Until now, it is necessary to prevent students from using Internet data. The use of the Internet has indeed caused a lot of plagiarism. The Internet creates a high degree of plagiarism because of the situation, so that under the pressure of situation and time, it is easy to produce plagiarism; secondly, the society permeates the psychology of "taking shortcuts", students only care about the efficiency of writing homework and not pay attention to the quality of work, the academic world also Paying attention to the publication volume of the paper, but paying less attention to the quality of the content and the mentality of the society. In addition, the biased behavior of plagiarism has become the norm. The writing service has worsened the entire academic environment. The plagiarism of dishonesty has deteriorated into academic deception. Serious crimes.
有鑑於學生抄襲行為日益嚴重,尤其是抄襲自網路資料或將網路資料再二次加工,重新拼貼與排列組合的大雜燴文章,國外的營利企業發展了數位抄襲偵測軟體進行防範,經過長時間的運作與測試,確實能降低抄襲行為的發生率,抑制學生想要僥倖投機的行為,惟多數偵測系統的測驗報告與相關文獻幾乎係以英語系國家為主,各種累積的知識僅止於英文環境,無法移植到中文語系,由於中文語體不管是書結構、字詞組合、文字斷句(segmentation)、標點符號的使用等,都與英文環境有極大差異,因此中文化界面的數位抄襲比對系統與方法仍需要開發,以適合華語教育界使用。In view of the increasing plagiarism of students, especially the plagiarism of plagiarizing online data or reprocessing the network data, re-collection and arrangement of hodgepodge articles, foreign profit-making enterprises have developed digital plagiarism detection software for prevention. After a long period of operation and testing, it can really reduce the incidence of plagiarism and inhibit students' desire to take the opportunity to speculate. However, the test reports and related literatures of most detection systems are mainly based on the English-speaking countries, and all kinds of accumulated knowledge. It only stops in the English environment and cannot be transplanted to the Chinese language. Because the Chinese language is different from the English environment, regardless of the book structure, word combination, sentence sentence, and punctuation, the Chinese culture interface is very different. Digital plagiarism comparison systems and methods still need to be developed to suit the Chinese language education community.
按習知之中文抄襲文章比對系統與方法,如臺灣發明專利公告第I262402號「特徵擷取、資料解密方法以及抄襲文章搜尋的系統與方法」,係對已植入浮水印的文章10進行特徵擷取,以取得文件特徵20,根據所取得的詞彙輸入搜尋引擎30,並且比對索引資料庫40以搜尋網際網路上可能抄襲之文章50;接著將搜尋所得的文章50與原文比對,根據比對結果取得之句子執行浮水印解析;最後,將所取得的浮水印資訊60與原來的浮水印比對,然後根據比對結果判斷該搜尋所得之文章是否為抄襲文章,若比對結果大於一臨界值,則表示其為抄襲文章70。According to the conventional Chinese plagiarism article comparison system and method, such as Taiwan Invention Patent Notice No. I262402 "Feature Extraction, Data Decryption Method, and System and Method for Plagiarized Article Search", features the article 10 that has been implanted with watermark Extracting to obtain the file feature 20, inputting the search engine 30 according to the obtained vocabulary, and comparing the index database 40 to search for articles 50 that may be plagiarized on the Internet; and then comparing the searched articles 50 with the original text, according to The watermark analysis is performed on the sentence obtained by the comparison result; finally, the obtained watermark information 60 is compared with the original watermark, and then the comparison result is used to determine whether the searched article is a plagiarized article, if the comparison result is greater than A threshold value indicates that it is a plagiarism article 70.
而該習知發明之特徵擷取方法係將自植入浮水印的文章取得之句子、詞彙予以斷詞及詞性標注,然後根據文章中浮水印植入的詞與句型,利用同義詞庫與同義句型庫,針對文章的內文產生其語意層面的特徵,即將內容中藏有浮水印之句子與詞彙取出。然後以詞彙以及詞性作為查詢定義的依據,在同義詞資料庫中進行搜尋,以取得可作為該文章之特徵的詞彙。然後以該詞彙為關鍵字,利用搜尋引擎進行網路搜尋,以獲得相關可能的抄襲文章。The feature extraction method of the conventional invention is to use the sentence and vocabulary obtained from the watermarked article to perform word segmentation and part-of-speech tagging, and then use the synonym database and synonym according to the words and sentence patterns implanted in the watermark in the article. The sentence pattern library generates the semantic level features of the article's internal text, that is, the sentence and the vocabulary with the watermark in the content are taken out. Then use the vocabulary and part of speech as the basis for the definition of the query, and search in the synonym database to obtain the vocabulary that can be used as the feature of the article. Then use the vocabulary as a keyword to use the search engine to conduct a web search to obtain relevant plagiarism articles.
然而,該種習知之抄襲文章搜尋系統與方法由於必需過將文章植入浮水印、將植入浮水印的文章進行特徵擷取、根據所取得的詞彙輸入搜尋引擎30、比對索引資料庫40以搜尋網際網路上可能抄襲之文章50、將搜尋所得的文章50與原文比對、根據比對結果取得之句子執行浮水印解析、將所取得的浮水印資訊60與原來的浮水印比對、及根據比對結果判斷該搜尋所得之文章是否為抄襲文章等繁雜的步驟,雖然可比對出改變同義詞與同義句的部分,但對於比對一般論文或長篇文章而言,如此大量複雜的步驟對電腦伺服器會造成很大的負荷,進而影響文章比對的速度,而降低使用的效率。However, the conventional plagiarism article search system and method have to perform the feature of extracting the article into the watermark, extracting the watermarked article, and inputting the search engine 30 and the comparison index database 40 according to the obtained vocabulary. Searching for articles 50 that may be copied on the Internet, comparing the searched articles 50 with the original text, performing a watermark analysis on the sentences obtained from the comparison results, and comparing the obtained watermark information 60 with the original watermark, And judging whether the article obtained by the search is a complicated step such as copying the article according to the comparison result, although the part of changing the synonym and the synonym sentence can be compared, but for comparing the general paper or the long article, such a large number of complicated steps are The computer server will cause a large load, which will affect the speed of the article comparison and reduce the efficiency of use.
其次,因為比對系統所搜尋到相同的部分可能是「參考書目」或合乎視範圍的「直接引用」,故再精確的比對系統也很難直接判斷一文章是否為抄襲文章;最後,「抄襲」一詞仍未有客觀明確之界定,雷同字數多寡亦為判斷抄襲與否重要參考因素,而浮水印範圍之設定,並無法讓使用者根據自己主觀界定或客觀學術環境作彈性調整。Secondly, because the same part of the comparison system may be a "reference bibliography" or a "direct reference" that is appropriate for the scope of the comparison, it is difficult to directly determine whether an article is a plagiarism article; and finally, " The term "plagiarism" has not been defined objectively. The number of similar words is also an important reference factor for judging plagiarism. The setting of the watermarking scope does not allow users to make flexible adjustments based on their subjective definition or objective academic environment.
因此,為更有效率的判斷一文章是否為抄襲文章,首先必需加快文章比對的速度,並利用抄襲比對系統之比對結果做為輔助判斷是否為抄襲文章的依據,最後再以人工檢視方法做進一步查驗即可有效的判斷一文章是否為抄襲文章;其次,使用者可以透過本身所處之學術環境或需求對於「抄襲」字數之定義作彈性調整,亦符合使用者之需求。所以,如何加快文章比對的速度,以及讓審查者可清楚且快速得知比對文章中涉嫌抄襲部分與搜尋部分之差異,與對於抄襲字數定義作為彈性調整,則為該習知抄襲文章搜尋系統所欠缺考量者。Therefore, in order to judge whether an article is a plagiarism article more efficiently, it is first necessary to speed up the comparison of the articles, and use the comparison result of the plagiarism comparison system as an auxiliary to judge whether it is the basis for plagiarizing the article, and finally to manually check the article. The method can be used to further determine whether an article is a copy of the article. Secondly, the user can flexibly adjust the definition of the number of words in "plagiarism" through the academic environment or needs of the user, and meet the needs of the user. Therefore, how to speed up the comparison of the articles, and let the reviewer know clearly and quickly the difference between the suspected plagiarized part and the search part in the comparison article, and the plagiarism of the plagiarized word definition as the elastic adjustment The lack of consideration in the search system.
為此,本發明者基於多年相關系統開發與方法研究之經驗,特針對目前中文數位反抄襲偵測比對系統與方法加以研究,乃發明本案。To this end, the inventors based on years of experience in related system development and method research, specifically for the current Chinese digital anti-plagiarism detection comparison system and method to study, is the invention of the case.
本發明之目的,乃在提供一種中文數位反抄襲偵測比對方法,使用者可將比對文章上傳至中心伺服器以進行資料存取,並將對比對文章透過系統程式所設定之拆解規則進行拆解,而得到固定字數之拆解文句,再利用搜尋引擎將拆解文章一句一句搜尋後,即可獲得多筆與拆解文句雷同的網頁或文章,再將雷同的網頁或文章與比對文章進行全文比對,以比對文章與雷同的網頁或文章雷同的部分進行標記及分析,即可獲得比對結果,該比對結果可供審查者進行人工查驗,並判斷比對文章是否為抄襲文章。The purpose of the present invention is to provide a Chinese digital anti-plagiarism detection comparison method, the user can upload the comparison article to the central server for data access, and compare and disassemble the article through the system program. The rules are disassembled, and the disassembled sentence of the fixed number of words is obtained, and then the search engine is used to search the disassembled articles one by one, and then multiple pages or articles with the same disassembled sentences can be obtained, and the same webpage or article will be used. Compared with the comparison article, the comparison result can be obtained by marking and analyzing the similar parts of the article or the similar article. The comparison result can be manually checked by the reviewer and judged. Whether the article is a copy of the article.
本發明之中文數位反抄襲偵測比對系統,其中文章存取機制設首頁元件以提供使用者進入中文數位反抄襲偵測比對系統之管道,並設有上傳元件與資料庫元件;文章拆解機制在開始偵測上傳比對文章時,將該比對文章以拆解字句演算法,將每篇比對文章先以換行方式拆成複數段落,再依自訂之拆解字數,以固定字數將每個段落拆解成固定長度之最小偵測句子,並將各最小偵測句子之標點符號刪除,再依自訂之搜尋比對字數,以固定字數做為門檻篩選符合搜尋比對條件之最小偵測句子,最後即可將前述各符合搜尋比對條件之最小偵測句子登錄編號,以提供搜尋比對機制進行偵測抄襲之用;搜尋比對機制設比對元件、判定元件及抄襲來源比對元件,利用比對元件及判定元件可獲得與拆解文章部分雷同的網頁或文章,並以抄襲來源比對元件連結並下載與比對文章相似之雷同的網頁或文章,即可將比對文章與雷同的網頁或文章以比對元件及判定元件進行全文比對;評估報告機制是在前述比對文章進行全文比對、判定後,即可針對每一份上傳的比對文章做出分析,並獲得比對結果。The Chinese digital anti-plagiarism detection comparison system of the present invention, wherein the article access mechanism sets a front-end component to provide a user to enter a Chinese digital anti-plagiarism detection comparison system pipeline, and has an upload component and a database component; When the solution mechanism starts to detect and upload the comparison article, the comparison article is decomposed into a sentence algorithm, and each comparison article is first broken into multiple paragraphs by a line feed method, and then the number of words is disassembled according to the custom The fixed number of words disassembles each paragraph into a minimum detection sentence of fixed length, and deletes the punctuation marks of each minimum detection sentence, and then compares the number of words according to the custom search, and uses the fixed number of words as the threshold to match Search for the minimum detection sentence of the comparison condition, and finally the minimum detection sentence registration number corresponding to the search comparison condition can be used to provide the search comparison mechanism for detecting plagiarism; the search comparison mechanism sets the comparison component The determination component and the plagiarism source comparison component can use the comparison component and the determination component to obtain a webpage or article that is identical to the disassembled article, and link and download with the plagiarism source comparison component. For similar webpages or articles with similar articles, the comparison article and the similar webpage or article can be compared by the comparison component and the judgment component. The evaluation report mechanism is to compare and judge the above-mentioned comparison article. , you can analyze each of the uploaded comparison articles and get the comparison results.
本發明之中文數位反抄襲偵測比對系統與方法,乃在中文數位反抄襲偵測比對系統之中心伺服器設有複數個運算主機,經由中心伺服器將各使用者上傳之比對文章分派給各運算主機,利用運算主機進行文章拆解、下載多筆與文章拆解部分雷同的網頁或文章、及全文比對,以獲得比對結果,並將比對結果回傳至中心伺服器,再經由中心伺服器傳送至使用者。The Chinese digital anti-plagiarism detection comparison system and method of the present invention is provided with a plurality of computing hosts in a central server of the Chinese digital anti-plagiarism detection comparison system, and the user is uploaded by the central server to compare the articles. Distribute to each computing host, use the computing host to disassemble the article, download multiple pages and articles that are identical to the article dismantling, and compare the full text to obtain the comparison result, and return the comparison result to the central server. And then transmitted to the user via the central server.
如此,為使 貴審查員得以充分了解本發明之特徵,茲依附圖示解說如下:如第1圖所示,係為本發明之數位抄襲比對方法之步驟流程圖。首先,使用者可將比對文章10上傳至中心伺服器11以進行資料存取,並將比對文章10以拆解字句演算法進行拆解而得到拆解文章12,利用搜尋引擎13將拆解文章12之最小偵測句子一句一句上傳搜尋後,即可獲得多筆與拆解文章12部分雷同的網頁或文章14,並將該雷同的網頁或文章14下載,並將比對文章10與雷同的網頁或文章14進行全文比對15,以比對文章10與雷同的網頁或文章14相同的部分進行標記及分析,即可獲得比對結果16,該比對結果16可供審查者進行人工查驗,以進一步判斷比對文章是否為抄襲文章。Thus, in order for the examiner to fully understand the features of the present invention, the following is illustrated by the accompanying drawings: FIG. 1 is a flow chart showing the steps of the digital plagiarism comparison method of the present invention. First, the user can upload the comparison article 10 to the central server 11 for data access, and disassemble the comparison article 10 by disassembling the sentence algorithm to obtain the disassembled article 12, which is to be disassembled by the search engine 13. After the minimum detection sentence of the article 12 is uploaded and searched, you can obtain multiple pages or articles 14 that are identical to the disassembled article 12, and download the same page or article 14 and compare the article with The same webpage or article 14 is compared in full text 15 to compare and compare the same part of the article 10 with the same webpage or article 14, and the comparison result 16 can be obtained, and the comparison result 16 can be performed by the examiner. Manual inspection to further determine whether the comparison article is a plagiarism article.
再者,本發明之中文數位反抄襲偵測比對系統至少包含:文章存取機制2,設首頁元件20以提供使用者進入中文數位反抄襲偵測比對系統之管道,包括網站首頁、使用者帳號與密碼之鑑別、使用者帳號密碼遺忘重領之機制等;並可將使用者身份區分為系統管理者、教師、學生與參觀者,以方便使用者之管理與設定使用權限。Furthermore, the Chinese digital anti-plagiarism detection comparison system of the present invention at least includes: an article access mechanism 2, and a homepage component 20 for providing a user to enter a Chinese digital anti-plagiarism detection comparison system pipeline, including a website homepage, use The identification of the account number and password, the mechanism for forgetting the user account password, etc.; and the user identity can be divided into system administrators, teachers, students and visitors to facilitate user management and set usage rights.
前述文章存取機制2設上傳元件21,以提供網站頁面供使用者上傳比對文章10至中心伺服器11,此上傳元件21將會判別上傳比對文章10之格式(Word或PDF)是否符合系統需求;上傳元件21會把上傳比對文章10儲存併入中心伺服器11之資料庫元件22之中,並監控使用者在期限之內可以重新上傳比對文章10。The foregoing article access mechanism 2 is provided with an uploading component 21 for providing a website page for the user to upload the matching article 10 to the central server 11, and the uploading component 21 will determine whether the uploading of the article 10 format (Word or PDF) conforms. The system requirements; the upload component 21 will store the upload comparison article 10 into the database component 22 of the central server 11, and monitor the user to re-upload the comparison article 10 within the deadline.
前述文章存取機制2設資料庫元件22,主要是隨時在網際網路上蒐集各類型之數位資料,並加以儲存,以增加比對時之資料來源。The foregoing article access mechanism 2 is provided with a database component 22, which mainly collects and records various types of digital data on the Internet at any time to increase the source of the data during the comparison.
文章拆解機制3,前述文章拆解機制3在開始偵測上傳比對文章10時,將該比對文章10以拆解字句演算法30,將每篇比對文章10先以換行方式31拆成複數段落310,再依自訂之拆解字數32,以固定字數將每個段落310拆解成固定長度之最小偵測句子320,並將各最小偵測句子320之標點符號刪除,再依自訂之搜尋比對字數33,以固定字數做為門檻篩選符合搜尋比對條件之最小偵測句子320,若一刪除標點符號後之最小偵測句子320所剩餘的字數少於所設定之搜尋比對字數33時,即不搜尋比對該句子;如此,利用拆解字句演算法30即可解決比對文章10拆解後的句子太短無搜尋意義,或比對文章10拆解後的句子太長不易搜尋之問題;最後即可將前述各符合搜尋比對條件之最小偵測句子320登錄編號,以提供搜尋比對機制4進行偵測抄襲之用。The article dismantling mechanism 3, the above-mentioned article dismantling mechanism 3, when starting to detect the uploading comparison article 10, the matching article 10 is to disassemble the sentence algorithm 30, and each of the matching articles 10 is firstly broken in the line-entry mode 31. In the plural paragraph 310, according to the custom disassembled word number 32, each paragraph 310 is disassembled into a fixed length minimum detection sentence 320 by a fixed number of words, and the punctuation marks of each minimum detection sentence 320 are deleted. According to the custom search comparison word number 33, the minimum number of detection sentences 320 matching the search comparison condition is selected by using the fixed word number as a threshold, and the minimum number of words remaining in the minimum detection sentence 320 after deleting the punctuation mark is small. When the searched comparison number of words is set to 33, the search sentence is not searched for the sentence; thus, the disassembled sentence algorithm 30 can be used to solve the problem that the sentence after the disassembled article 10 is too short, has no search meaning, or is compared. The sentence after the disassembly of the article 10 is too long to search for the problem; finally, the foregoing minimum detection sentence 320 corresponding to the search comparison condition can be registered to provide the search comparison mechanism 4 for detecting plagiarism.
關於前述拆解字句演算法30,例如某一比對文章10以換行方式31拆成之段落310為:『他慢慢蹲下來,好了一點,好了一點。從略微的仰角,他看到街對面有個手拿氣球的奇怪女人正抬頭仰望天空,她像發現幽浮似地,嘴不由自主地張開來。』;若自訂之拆解字數32為15個字,即可將該段落310拆解成固定長度之最小偵測句子320為:『(他慢慢蹲下來,好了一點,好了一)(點。從略微的仰角,他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空,她像發現幽浮似地,嘴不)(由自主地張開來。)』;之後再將各最小偵測句子320刪除標點符號,可得到不含標點符號之最小偵測句子320為:『(他慢慢蹲下來好了一點好了一)(點從略微的仰角他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空她像發現幽浮似地嘴不)(由自主地張開來)』;若自訂之搜尋比對字數33為8個字,其中該句「(由自主地張開來)」只有7個字,不符合篩選門檻,因此最後可登錄編號及上傳搜尋比對之最小偵測句子320為:『(他慢慢蹲下來好了一點好了一)(點從略微的仰角他看到街對面)(有個手拿氣球的奇怪女人正抬頭仰)(望天空她像發現幽浮似地嘴不)』;共4句。Regarding the aforementioned disassembling sentence algorithm 30, for example, a paragraph 310 in which a certain collating article 10 is broken in a line-feeding manner 31 is: "He slowly squats down, a little better, a little better. From a slight elevation angle, he saw a strange woman holding a balloon across the street looking up at the sky. She found the sullen, and the mouth opened involuntarily. If the number of custom disassembled words 32 is 15 words, the paragraph 310 can be disassembled into a fixed length minimum detection sentence 320 as: "(He slowly squats down, a little better, one better) ) (point. From a slight elevation angle, he saw the street across the street) (a strange woman with a balloon in hand is looking up) (looking at the sky, she looks like a sulky, mouth does not) (opened by autonomously.) After that, the minimum detection sentence 320 is deleted from the punctuation mark, and the minimum detection sentence 320 without punctuation is obtained: "(He slowly squats down a little better) (point from a slight elevation angle he Seeing across the street) (a strange woman with a balloon in hand is looking up) (Looking at the sky, she looks like a sullen mouth) (by self-opening)"; if the custom search is 33 8 words, in which the sentence "(opened by autonomously)" has only 7 words, which does not meet the screening threshold. Therefore, the minimum detection sentence 320 of the last login number and upload search comparison is: "(He slowly squats down) Ok, it’s a little better.) (The point is from a slightly elevated angle. He sees the street across the street.) (The strange woman with a balloon in hand is looking up) (Wangtian She is empty and looks like a sulky mouth.) A total of four sentences.
搜尋比對機制4設比對元件40,前述比對元件40可透過網際網路搜尋引擎13(如Google搜尋引擎或YAHOO搜尋引擎等)、或各式電子文章資料庫、或其他學生的文章等加以登錄比對前述經篩選過之最小偵測句子320,或將比對文章10與雷同的網頁或文章14進行全文比對15。The search comparison mechanism 4 is provided with a matching component 40. The comparison component 40 can pass through an internet search engine 13 (such as a Google search engine or a YAHOO search engine), or various electronic article databases, or articles of other students. The selected minimum detected sentence 320 is compared to the above-mentioned screened, or the matching article 10 is compared with the identical web page or the article 14 in full text 15 .
前述搜尋比對機制4設判定元件41,此判定元件41之功能在於根據前述比對元件40之比對結果,針對比對文章10中的每一句、每一段落做出是否抄襲的判斷,並以百分比來呈現其抄襲可能性;或將比對文章10與雷同的網頁或文章14進行全文比對15之抄襲比例也做出評估判定。前述搜尋比對機制4設抄襲來源比對元件42,主要是建立抄襲文字與抄襲來源間之連結工作,以連結並下載與比對文章10相似之雷同的網頁或文章14進行全文比對15。The search matching mechanism 4 is provided with a determining component 41. The function of the determining component 41 is to determine whether to plagiarize each sentence and each paragraph in the article 10 according to the comparison result of the comparing component 40. The percentage is used to present the possibility of plagiarism; or the proportion of plagiarism that compares article 10 with the same page or article 14 is also evaluated. The aforementioned search comparison mechanism 4 sets the plagiarism source comparison component 42, mainly to establish a link between the plagiarism text and the plagiarism source, to link and download the similar page or article 14 similar to the comparison article 10 for full-text comparison 15 .
本發明之獲得與比對文章10相似之雷同的網頁或文章14之方法,是利用搜尋比對機制4之搜尋引擎13、比對元件40及判定元件41,將經過拆解字句演算法30篩選過之最小偵測句子320一句一句各別進行搜尋比對,而搜尋後所得到的搜尋比對結果,根據拿去搜尋的句子和搜尋比對結果之總結,計算兩者共同子序列的比例,並設定一門檻值,若共同子序列的比例超過該門檻值時,即以抄襲來源比對元件42連結並下載該網頁或文章,即可獲得與拆解文章12部分雷同的網頁或文章14,以進行後續全文比對15之程序。The method for obtaining a web page or article 14 similar to that of the article 10 of the present invention is to use the search engine 13, the matching component 40 and the decision component 41 of the search matching mechanism 4 to filter the disassembled sentence algorithm 30. The minimum detection sentence 320 is searched and compared separately, and the search comparison result obtained after the search is calculated according to the sum of the search sentence and the search comparison result, and the ratio of the common sub-sequence is calculated. And setting a threshold value, if the proportion of the common subsequence exceeds the threshold value, that is, the plagiarism source comparison component 42 is linked and downloads the webpage or article, and the webpage or article 14 which is identical to the disassembling article 12 is obtained. For the subsequent full-text comparison 15 program.
本發明之全文比對15之方法,是利用利用搜尋比對機制4之比對元件40,將比對文章10與雷同的網頁或文章14直接以無意義字元接在一起,利用詞尾陣列(Suffix Array)的資料結構處理過,再利用資料分割(Data Partitioning,簡稱DP)的技巧,即可得到所有在比對文章10中局部最長且有出現在雷同的網頁或文章14的句子,之後再以搜尋比對機制4之判定元件41,將全文比對15後之抄襲比例做出評估判定;因為比對相同的部分可能是「參考書目」或合乎規範的「直接引用」,為方便審查者進行人工查驗,該比對文章10與雷同網相14之比對相同部分,搜尋比對機制4會各別以反黃方式標記於比對文章10及雷同的網頁或文章14中。The method of the full text comparison 15 of the present invention utilizes the matching component 40 using the search matching mechanism 4, and the comparison article 10 and the identical web page or article 14 are directly connected by meaningless characters, using the suffix array ( Suffix Array) data structure has been processed, and then using Data Partitioning (DP) techniques, you can get all the sentences that are the longest in the comparison article 10 and appear on the same page or article 14, and then The determination component 41 of the search comparison mechanism 4 is used to evaluate and determine the proportion of plagiarism after the full text comparison; since the same part of the comparison may be a "reference bibliography" or a conforming "direct citation", for the convenience of reviewers For manual inspection, the comparison of the article 10 with the same network 14 is the same portion, and the search comparison mechanism 4 is separately marked in the anti-yellow manner in the comparison article 10 and the similar web page or article 14.
評估報告機制5,是所有偵測工作的最後一項,亦即在前述比對文章10與雷同的網頁或文章14進行全文比對、判定後,利用評估報告機制5即可針對每一份上傳的比對文章10做出分析,並獲得比對結果16。The evaluation reporting mechanism 5 is the last item of all detection work, that is, after comparing and judging the above-mentioned comparison article 10 with the same webpage or article 14, the evaluation report mechanism 5 can be used for each upload. The comparison of article 10 is analyzed and the results of the comparison 16 are obtained.
如圖2所示,係為本發明之中文數位反抄襲偵測比對系統之實施例,各使用者6將比對文章10經由文章存取機制2上傳至中心伺服器11以進行資料存取後,中心伺服器11即會將各使用者6上傳之比對文章10分派給各運算主機110,該運算主機110利用文章拆解機制3對比對文章10進行拆解而得到拆解文章12,並以搜尋比對機制4利用搜尋引擎13將拆解文章12一句一句各別上傳搜尋比對後,即可將與拆解文章12部分雷同的網頁或文章14下載回傳至運算主機110,以進行比對文章10與雷同的網頁或文章14之全文比對15,利用評估報告機制5即可針對每一份上傳的比對文章10做出分析,並於產生比對結果16後,將此一比對結果16回傳至中心伺服器11,因此使用者6即可觀看比對結果16。As shown in FIG. 2, it is an embodiment of the Chinese digital anti-plagiarism detection and comparison system of the present invention. Each user 6 uploads the comparison article 10 to the central server 11 via the article access mechanism 2 for data access. Then, the central server 11 distributes the comparison article 10 uploaded by each user 6 to each computing host 110, and the computing host 110 uses the article disassembly mechanism 3 to disassemble the article 10 to obtain the disassembled article 12. After searching and comparing the disassembled articles 12 with the search engine 13 by using the search engine 13, the webpage or article 14 similar to the disassembled article 12 can be downloaded and sent back to the computing host 110. Compare the article 10 with the similar web page or the full text of the article 14 and use the evaluation report mechanism 5 to analyze the uploaded article 10 for each upload, and after generating the comparison result 16, A comparison result 16 is passed back to the central server 11, so that the user 6 can view the comparison result 16.
如此,本發明利用中心伺服器11與運算主機110之分工機制可提升中文數位反抄襲偵測比對系統之搜尋比對速度,且本發明之中文數位反抄襲偵測比對系統利用文章拆解機制3及搜尋比對機制4可加快比對文章10比對的速度,並於比對文章10與雷同網相14之相同部分以反黃方式各別標記於文章及網頁中,使該使用者6可快速獲得比對文章10之比對結果16,並利用比對結果16進行人工查驗,以判斷比對文章是否為抄襲文章,為本案之組成。In this way, the present invention utilizes the division mechanism of the central server 11 and the computing host 110 to improve the search comparison speed of the Chinese digital anti-plagiarism detection comparison system, and the Chinese digital anti-plagiarism detection comparison system utilizes the article disassembly. Mechanism 3 and the search comparison mechanism 4 can speed up the comparison of the article 10, and the same part of the comparison article 10 and the similar network 14 are respectively marked in the anti-yellow manner in the article and the webpage, so that the user 6 can quickly obtain the comparison result 16 of the comparison article 10, and use the comparison result 16 for manual inspection to determine whether the comparison article is a plagiarism article, which is the composition of the case.
10...比對文章10. . . Comparison article
11...中心伺服器11. . . Central server
12...拆解文章12. . . Disassemble the article
13...搜尋引擎13. . . Search engine
14...雷同的網頁或文章14. . . Similar webpage or article
15...全文比對15. . . Full text comparison
16...比對結果16. . . Comparison result
2...文章存取機制2. . . Article access mechanism
20...首頁元件20. . . Home component
21...上傳元件twenty one. . . Upload component
22...資料庫元件twenty two. . . Database component
3...文章拆解機制3. . . Article dismantling mechanism
30...拆解字句演算法30. . . Disassemble word algorithm
310...段落310. . . paragraph
32...拆解字數32. . . Disassemble words
320...最小偵測句子320. . . Minimum detection sentence
33...搜尋比對字數33. . . Search for the number of words
4...搜尋比對機制4. . . Search comparison mechanism
40...比對元件40. . . Alignment component
41...判定元件41. . . Decision component
42...抄襲來源比對元件42. . . Plagiarism source comparison component
5...評估報告機制5. . . Evaluation reporting mechanism
6...使用者6. . . user
110...運算主機110. . . Computing host
圖1是本發明之中文數位反抄襲偵測比對方法之步驟流程圖。1 is a flow chart showing the steps of the Chinese digital anti-plagiarism detection comparison method of the present invention.
圖2是本發明之實施例。Figure 2 is an embodiment of the invention.
10...比對文章10. . . Comparison article
11...中心伺服器11. . . Central server
12...拆解文章12. . . Disassemble the article
13...搜尋引擎13. . . Search engine
14...雷同的網頁或文章14. . . Similar webpage or article
15...全文比對15. . . Full text comparison
16...比對結果16. . . Comparison result
Claims (8)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW100136908A TWI444838B (en) | 2011-10-12 | 2011-10-12 | Chinese anti-piracy and plagiarism detecting system and its method |
CN2012102585167A CN103049467A (en) | 2011-10-12 | 2012-07-24 | Chinese digital anti-plagiarism detection and comparison system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW100136908A TWI444838B (en) | 2011-10-12 | 2011-10-12 | Chinese anti-piracy and plagiarism detecting system and its method |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201316186A true TW201316186A (en) | 2013-04-16 |
TWI444838B TWI444838B (en) | 2014-07-11 |
Family
ID=48062110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW100136908A TWI444838B (en) | 2011-10-12 | 2011-10-12 | Chinese anti-piracy and plagiarism detecting system and its method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN103049467A (en) |
TW (1) | TWI444838B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345466B (en) * | 2013-07-12 | 2016-09-07 | 唐煜舟 | A kind of scientific paper information detecting method based on internet free information |
CN103412905A (en) * | 2013-07-31 | 2013-11-27 | 广联达软件股份有限公司 | PDF (Portable document format) file comparison method and system |
CN103412904A (en) * | 2013-07-31 | 2013-11-27 | 广联达软件股份有限公司 | PDF (portable document format) file comparison method and PDF file comparison system |
CN106649871B (en) * | 2017-01-03 | 2019-10-25 | 广州爱九游信息技术有限公司 | Detection method, device and the calculating equipment of article multiplicity |
CN109710834B (en) * | 2018-11-16 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Similar webpage detection method and device, storage medium and electronic equipment |
CN111611787A (en) * | 2019-02-25 | 2020-09-01 | 中国海洋大学 | Plagiarism evaluation method, system and auxiliary writing system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100406671B1 (en) * | 2000-07-24 | 2003-11-21 | 주식회사 유니마이다스 | Method of searching for piracy and steal on a piece of writing |
US7503035B2 (en) * | 2003-11-25 | 2009-03-10 | Software Analysis And Forensic Engineering Corp. | Software tool for detecting plagiarism in computer source code |
CN101334789A (en) * | 2008-08-04 | 2008-12-31 | 福建师范大学 | Device for identifying document plagiarism by search engine |
CN101404037B (en) * | 2008-11-18 | 2011-05-18 | 西安交通大学 | Method for detecting and positioning electronic text contents plagiary |
CN101957809A (en) * | 2010-10-14 | 2011-01-26 | 传神联合(北京)信息技术有限公司 | Anti-plagiarism method |
-
2011
- 2011-10-12 TW TW100136908A patent/TWI444838B/en not_active IP Right Cessation
-
2012
- 2012-07-24 CN CN2012102585167A patent/CN103049467A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
TWI444838B (en) | 2014-07-11 |
CN103049467A (en) | 2013-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ali et al. | Overview and comparison of plagiarism detection tools. | |
Saier et al. | unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata | |
CN109299865B (en) | Psychological evaluation system and method based on semantic analysis and information data processing terminal | |
White et al. | Sentence-based natural language plagiarism detection | |
US9514417B2 (en) | Cloud-based plagiarism detection system performing predicting based on classified feature vectors | |
TWI444838B (en) | Chinese anti-piracy and plagiarism detecting system and its method | |
CN105701076B (en) | A kind of paper plagiarizes detection method and system | |
Ali et al. | Survey of plagiarism detection methods | |
El Moatez Billah Nagoudi et al. | 2L-APD: A two-level plagiarism detection system for Arabic documents | |
Stamatatos et al. | Plagiarism and authorship analysis: introduction to the special issue | |
Brugman et al. | Nederlab: Towards a single portal and research environment for diachronic Dutch text corpora | |
KR102280490B1 (en) | Training data construction method for automatically generating training data for artificial intelligence model for counseling intention classification | |
Elhadi et al. | Use of text syntactical structures in detection of document duplicates | |
CN105701085B (en) | A kind of network duplicate checking method and system | |
Alohaly et al. | Better privacy indicators: a new approach to quantification of privacy policies | |
Jiffriya et al. | Plagiarism detection tools and techniques: A comprehensive survey | |
Panchenko et al. | Detection of child sexual abuse media on p2p networks: Normalization and classification of associated filenames | |
Pera et al. | SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents | |
Ruppert et al. | LawStats–Large-Scale German Court Decision Evaluation Using Web Service Classifiers | |
CN105701086B (en) | A kind of sliding window document detection method and system | |
Gienapp et al. | A large dataset of scientific text reuse in Open-Access publications | |
Lommatzsch et al. | NewsImages: addressing the depiction gap with an online news dataset for text-image rematching | |
Nystrom et al. | The future of digital legal history: no magic, no silver bullets | |
Williams et al. | Classifying and ranking search engine results as potential sources of plagiarism | |
Sweidan et al. | Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |