TWI594135B

TWI594135B - Plagiarism detecting method of information in english

Info

Publication number: TWI594135B
Application number: TW102102093A
Authority: TW
Inventors: 蘇嘉穎; 王惠嘉; 劉繼仁; 羅鄉儀; 林柏安
Original assignee: 國立成功大學
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2017-08-01
Also published as: TW201430591A

Description

Plagiarism detection method for English materials

本發明係關於一種英文資料之抄襲偵測方法。 The invention relates to a plagiarism detection method for English materials.

隨著台灣高學歷的需求與普及化，學生的論文報告產出量也大幅增加，現今主要的學術研究成果大多以英文發表，但英文寫作對台灣的學生而言並非易事，加上學生為了爭取時間通過考試或獲得學位，抄襲的行為也日趨增加。以學術論理而言，抄襲不但是對原作者的不尊重，也無法進一步發展專業知識。 With the demand and popularization of Taiwan's high education, the output of students' papers has also increased substantially. Most of the major academic research results are published in English, but English writing is not easy for Taiwanese students. As time goes by to pass the exam or obtain a degree, the plagiarism is also increasing. In terms of academic theory, plagiarism is not only disrespectful to the original author, nor can it further develop professional knowledge.

然而，目前抄襲偵測軟體(例如Turnitin、CopyCatch或EVE2)的功能大多為比對論文資料庫或網路資源，並且只單純以單一字詞(word)為基礎進行比對，因此往往同義字無法被偵測出而降低抄襲偵測的準確性。 However, most of the current plagiarism detection software (such as Turnitin, CopyCatch or EVE2) is a comparison of the paper database or network resources, and only based on a single word (word) comparison, so often synonyms can not It is detected to reduce the accuracy of plagiarism detection.

有鑑於上述課題，本發明之目的為提供一種英文資料之抄襲偵測方法，不僅可利用語法結構分析及擷取句子中的字詞及片語，更以片語為抄襲偵測的比對單位，並輔以語意，以改善習知技術中使用單一字詞進行比對的正確性。 In view of the above problems, the object of the present invention is to provide a plagiarism detection method for English materials, which can not only use grammatical structure analysis and extract words and phrases in sentences, but also use spoofing as a plagiarism detection unit. And supplemented by semantics to improve the correctness of the comparison using a single word in the prior art.

為達上述目的，依據本發明之一種英文資料之抄襲偵測方法係實施於一電腦上，抄襲偵測方法包括一語法處理程序、一片語識別程序以及一相似度比對程序。語法處理程序包括一語法分析步驟，其將一來源文件與一使用者文件分別進行一語法分析，以分別得到一來源字詞資料與一來源片語資料，以及一使用者字詞資料與一使用者片語資料。片語識別程序包括一集合產生步驟，其比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料，以產生一集合字詞資料，並比對來源片語資料與使用者片語資料，以產生一集合片語資料。相似度比對程序依據集合字詞資料及集合片語資料並從語意方面進行相似度比對，以藉由集合字詞資料及集合片語資料之語意相似度及次序相似度，計算使用者文件相對於來源文件之抄襲程度。 To achieve the above object, a plagiarism detection method for an English document according to the present invention is implemented on a computer, and the plagiarism detection method includes a grammar processing program, a language recognition program, and a similarity comparison program. The grammar processing program includes a parsing step of separately parsing a source file and a user file to obtain a source word data and a source phrase data, and a user word data and a use. Participant materials. The phrase recognition program includes a set generation step of comparing the source word data with the user word data and comparing the source phrase data with the user phrase data to generate a set of word data and comparing the sources. The phrase material and the user phrase data are used to generate a set of phrase materials. The similarity comparison program performs similarity comparison on the semantics of the aggregated word data and the aggregated document data to calculate the user file by the semantic similarity and order similarity of the aggregated word data and the aggregated phrase data. The degree of plagiarism relative to the source document.

在一實施例中，來源字詞資料或使用者字詞資料分別包含至少一個字詞，來源片語資料或使用者片語資料分別包含至少一個片語。 In an embodiment, the source word data or the user word data respectively comprise at least one word, and the source phrase material or the user phrase material respectively comprise at least one phrase.

在一實施例中，語法分析步驟係使用一自然語言處理技術。 In one embodiment, the parsing step uses a natural language processing technique.

在一實施例中，於語法處理程序中，更包含一斷句步驟、一完全抄襲偵測步驟及一字根還原步驟。 In an embodiment, in the grammar processing program, a sentence segmentation step, a complete plagiarism detection step, and a root restoration step are further included.

在一實施例中，於語法分析步驟中，係將來源文件與使用者文件與一辭典比對，以進行詞性標註及片語擷取。 In an embodiment, in the parsing step, the source file and the user file are compared with a dictionary for part-of-speech tagging and phrase capture.

在一實施例中，集合字詞資料內之字詞係與辭典內之字詞相對應。 In one embodiment, the words within the aggregated word material correspond to the words within the dictionary.

在一實施例中，於片語識別程序中，係以片語作為比對單位。 In an embodiment, in the phrase recognition program, the phrase is used as the comparison unit.

在一實施例中，集合字詞資料中之字詞係包含來源文件與使用者文件中經字根還原後的所有字詞，集合片語資料中之片語係包含來源文件與使用者文件中相同的片語。 In an embodiment, the words in the aggregate word data include all the words restored by the root of the source file and the user file, and the phrase language in the aggregate document data includes the source file and the user file. The same phrase.

在一實施例中，於語意相似度及次序相似度的計算中，係以片語為比對單位。 In an embodiment, in the calculation of semantic similarity and order similarity, the phrase is used as the comparison unit.

在一實施例中，於相似度比對程序中，係分別計算集合字詞資料及集合片語資料中，字詞對字詞、字詞對片語及片語對片語的語意相似度。 In an embodiment, in the similarity comparison program, the semantic similarity between the word-to-word, the word-to-word, and the phrase-to-speech is calculated in the aggregate word data and the aggregate document data.

在一實施例中，於計算字詞對字詞的語意相似度中，係依據字詞的詞性分別進行計算。 In an embodiment, in calculating the semantic similarity of the word to the word, the word part is calculated according to the part of the word.

在一實施例中，於計算字詞對字詞的語意相似度中，係將集合字詞資料中的所有字詞彼此進行比對，以找出字詞對字詞之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the word to the word, all the words in the set word data are compared with each other to find the maximum value of the similarity of the word to the word.

在一實施例中，於計算字詞對片語或片語對片語的語意相似度中，係分別以字詞為基礎進行計算。 In an embodiment, in calculating the semantic similarity of a word to a phrase or a phrase to a phrase, the calculation is performed on a word basis.

在一實施例中，於計算字詞對片語的語意相似度中，係將集合字詞資料中之每一字詞與集合片語資料中的每一字詞進行比對，以找出字詞對片語之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the word to the phrase, each word in the set word data is compared with each word in the set phrase data to find the word. The maximum similarity of the word to the phrase.

在一實施例中，於計算片語對片語的語意相似度中，係將集合片語資料中的所有字詞彼此進行比對，以找出片語對片語之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the phrase to the phrase, all the words in the collective phrase data are compared with each other to find a maximum similarity of the phrase to the phrase.

在一實施例中，於計算語意相似度或次序相似度係分別依據字詞對字詞之相似最大值、字詞對片語之相似最大值，及片語對片語之相似最大值。 In an embodiment, the semantic similarity or order similarity is calculated according to the similar maximum value of the word pair, the similar maximum value of the word to the phrase, and the similar maximum value of the phrase to the phrase.

在一實施例中，當抄襲程度高於一閥值時，則使用者文件有抄襲行為。 In one embodiment, when the degree of plagiarism is above a threshold, the user file has plagiarism.

承上所述，因依據本發明之英文資料之抄襲偵測方法包括語法處理程序、片語識別程序以及相似度比對程序。其中，係透過將來源文件與使用者文件分別進行語法分析，並比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料，以及比對來源片語資料與使用者片語資料，再從語意方面進行相似度比對，藉此可計算出使用者文件相對於來源文件之抄襲程度。因此，與習知相較，本發明不僅可利用語法結構分析及擷取句子中的字詞及片語，更以片語為抄襲偵測的比對單位，並輔以語意，因此可改善習知技術中使用單一字詞進行比對的正確性。 As described above, the plagiarism detection method according to the English document of the present invention includes a grammar processing program, a phrase recognition program, and a similarity comparison program. Among them, the source document and the user file are separately parsed, and the source word data and the user word data are compared and the source phrase data and the user phrase data are compared, and the source document data is compared. The user's phrase data is compared with the semantics, and the degree of plagiarism of the user file relative to the source file can be calculated. Therefore, compared with the prior art, the present invention can not only utilize the grammatical structure analysis and extract words and phrases in sentences, but also use the phrase as the comparison unit of plagiarism detection, supplemented by semantic meaning, thus improving the habit The correctness of the alignment is determined using a single word in the technique.

P01‧‧‧斷句步驟 P01‧‧‧Sentence steps

P02‧‧‧完全抄襲偵測步驟 P02‧‧‧ Complete plagiarism detection step

P03‧‧‧語法分析步驟 P03‧‧‧ Grammar Analysis Steps

P04‧‧‧字根還原步驟 P04‧‧‧ Root Restoration Step

S01‧‧‧語法處理程序 S01‧‧‧ grammar handler

S02‧‧‧片語識別程序 S02‧‧‧Word recognition program

S03‧‧‧相似度比對程序 S03‧‧‧similarity comparison program

圖1所示，其為本發明較佳實施例之一種英文資料之抄襲偵測方法的程序流程圖。 FIG. 1 is a flow chart showing the procedure of a plagiarism detection method for English materials according to a preferred embodiment of the present invention.

圖2為圖1中，語法處理程序的步驟流程圖。 2 is a flow chart showing the steps of the syntax processing program of FIG. 1.

以下將參照相關圖式，說明依本發明較佳實施例的英文資料之抄襲偵測方法，其中相同的元件將以相同的參照符號加以說明。 The plagiarism detection method of the English language according to the preferred embodiment of the present invention will be described below with reference to the related drawings, wherein the same elements will be described with the same reference numerals.

請參照圖1所示，其為本發明較佳實施例之一種英文資料之抄襲偵測方法的程序流程圖。 Please refer to FIG. 1 , which is an English text of a preferred embodiment of the present invention. The program flow chart of the plagiarism detection method.

本發明英文資料之抄襲偵測方法係可實施於電腦上(例如但不限於軟體程序)。其中，英文資料之抄襲偵測方法包括一語法處理程序S01、片語識別程序S02以及一相似度比對程序S03。 The plagiarism detection method of the English document of the present invention can be implemented on a computer (such as but not limited to a software program). The plagiarism detection method of the English data includes a grammar processing program S01, a phrase recognition program S02, and a similarity comparison program S03.

首先，語法處理程序S01是將一來源文件與一使用者文件進行語法處理。其中，來源文件可包含複數文件，每一文件中可包含至少一個句子(sentence)，而使用者文件亦可包含至少一個句子。於此，來源文件指的是資料庫中的文件，而其來源可例如但不限為論文資料庫或網路資源，或其它方式取得的英文文件。另外，使用者文件指的是使用者所寫，並要被偵測是否為抄襲來源文件的文件資料。 First, the grammar handler S01 performs syntax processing on a source file and a user file. The source file may include a plurality of files, each file may include at least one sentence, and the user file may also include at least one sentence. Here, the source file refers to a file in the database, and the source thereof may be, for example, but not limited to, a thesis database or a network resource, or an English file obtained by other means. In addition, the user file refers to the file written by the user and is to be detected as a copy of the source file.

語法處理程序S01是分別將來源文件及使用者文件的內容進行句子切割，並透過語法結構分析及擷取句子裡的所有字詞及片語，且進行詞性標註和字根還原等處理動作。本發明係使用一自然語言處理(Natural Language Processing,NLP)技術來進行語法處理及分析，並以Stanford Parser分析工具來進行語法處理及分析的工作。Stanford Parser分析工具為史丹佛大學自然語言處理研究團(The Stanford Natural Language Processing Group)所發展出來的文法解析工具。 The grammar processing program S01 performs sentence cutting on the contents of the source file and the user file, and analyzes and extracts all the words and phrases in the sentence through the grammatical structure, and performs processing operations such as part-of-speech tagging and root-restore. The present invention uses a Natural Language Processing (NLP) technique for grammar processing and analysis, and uses Stanford Parser analysis tools for grammar processing and analysis. The Stanford Parser analysis tool is a grammar parser developed by The Stanford Natural Language Processing Group.

請參照圖2所示，其為圖1中，語法處理程序S01的步驟流程圖。於此，語法處理程序S01可包含一斷句步驟P01、一完全抄襲偵測步驟P02、一語法分析步驟P03及一字根還原步驟P04。 Please refer to FIG. 2, which is a flow chart of the steps of the syntax processing program S01 in FIG. Here, the grammar processing program S01 may include a sentence breaking step P01, a complete plagiarism detecting step P02, a parsing step P03, and a root resting step P04.

首先，進行斷句步驟P01。 First, the sentence step P01 is performed.

於斷句步驟P01中，是將來源文件與使用者文件分別使用Stanford Parser切割成獨立句子，並從樹狀結構中找出片語(phrase)、字詞(word，或稱單字)關係及子句等語法結構。 In the step S01 of the sentence segment, the source file and the user file are respectively cut into independent sentences using Stanford Parser, and the phrase, word (word, or word) relationship and clause are found from the tree structure. And other grammatical structures.

接著，進行完全抄襲偵測步驟P02。 Next, a full plagiarism detection step P02 is performed.

於完全抄襲偵測步驟P02中，是偵測使用者文件中，完全抄襲來源文件內的句子。換言之，係比對使用者文件內的句子及來源文件內的句子，先找出完全抄襲的部分，以降低後續計算及處理的複雜度。 In the complete plagiarism detection step P02, it is detected in the user file that the sentence in the source file is completely copied. In other words, the sentence in the user file and the sentence in the source file are compared first to find the completely plagiarized part, so as to reduce the complexity of subsequent calculation and processing.

接著，進行語法分析步驟P03。 Next, a parsing step P03 is performed.

於語法分析步驟P03中，是將來源文件與使用者文件分別進行語法分析，以分別得到一來源字詞資料與一來源片語資料，以及得到一使用者字詞資料與一使用者片語資料。其中，語法分析(Syntactic Analysis，或稱Parsing)，是利用電腦來分析句子的文法規則及架構，再以樹狀結構的模式展開。另外，亦使用Stanford Parcer將來源文件與使用者文件與一英語詞彙資料庫(例如WordNet)進行比對，以進行字詞的詞性標註及片語擷取。換言之，語法分析步驟P03就是將已經斷句及完全抄襲偵測後的句子，利用Stanford Parser所產生的樹狀結構及其相依關係，並以WordNet作為輔助工具，以擷取出來源文件及使用者文件中所有的字詞及片語。其中於詞性標註中，係將Stanford Parcer所得到的詞性標註轉換成WordNet可讀的形式：包含四種詞性：名詞(Nouns)、動詞(Verbs)、形容詞(djectives)及副詞(Adverbs)。另外，於片語擷取中，係以片語作為抄襲偵測的比對單位，並輔以語意，且分別利用Stanford Parser所產生的樹狀結構及相依關係，擷取出所有的片語。 In the parsing step P03, the source file and the user file are respectively separated. Grammatical analysis is performed to obtain a source word data and a source document data respectively, and obtain a user word data and a user phrase material. Among them, Syntactic Analysis (or Parsing) is a grammar rule and structure that uses a computer to analyze sentences, and then develops in a tree structure. In addition, Stanford Parcer is also used to compare source files and user files with an English vocabulary database (such as WordNet) for word-based tagging and phrase capture. In other words, the parsing step P03 is to use the tree structure generated by Stanford Parser and its dependence relationship after the sentence has been broken and completely plagiarized, and use WordNet as an auxiliary tool to extract the source file and the user file. All words and phrases. Among the part-of-speech tagging, the tokens obtained by Stanford Parcer are converted into WordNet readable form: including four parts of speech: Nouns, Verbs, daggerives and Adverbs. In addition, in the phrase acquisition, the phrase is used as the comparison unit of plagiarism detection, supplemented by semantic meaning, and the tree structure and the dependence relationship generated by Stanford Parser are used respectively to extract all the words.

最後，進行字根還原步驟P04。 Finally, the root restoration step P04 is performed.

於字根還原步驟P04中，係將步驟P03所擷取出之字詞及片語，利用WordNet進行字根還原，避免同一字詞因單複數或時態的不同而造成誤判。經過上述的4個步驟後，可完成語法處理程序S01，並可分別得到來源字詞資料與來源片語資料，以及使用者字詞資料與使用者片語資料。於此，來源字詞資料或使用者字詞資料可分別包含至少一個字詞，且來源片語資料或使用者片語資料分別包含至少一個片語(或稱有意義的英文字詞的組合)。 In the root restoration step P04, the words and phrases extracted in step P03 are used to restore the roots by using WordNet, so as to avoid misjudgment of the same words due to singular plural or tense. After the above four steps, the grammar processing program S01 can be completed, and the source word data and the source phrase data, as well as the user word data and the user phrase data, can be obtained respectively. Here, the source word data or the user word data may respectively contain at least one word, and the source phrase material or the user phrase material respectively includes at least one phrase (or a combination of meaningful English words).

以下，以實際的英文句子來說明上述的語法處理程序S01。為了簡化說明，來源文件以一個句子為例，並例如為「John has turned on his new radio.」，而使用者文件亦以一個句子為例，並例如為「John switched the new radio on.」。 Hereinafter, the above-described grammar processing program S01 will be described in actual English sentences. For simplification of the description, the source file is exemplified by a sentence, for example, "John has turned on his new radio.", and the user file is also exemplified by a sentence, for example, "John switched the new radio on."

在使用Stanford Parser進行詞性標註及語法分析，並擷取出所有的字詞及片語後，得到的結果如下：S₁₁={John,has,turned on,his new radio}，tree_p₁₁={turned on,his new radio}，T₁₁={John,has}。另外，S₁={John,switched on,the new radio}，tree_p₁={the new radio}，dep_p₁={switched on}，T₁={John}。於此，S_ni為第n份來源文件的第i句，tree_p_ni為第n份來源文件的第i句中，從樹狀結構擷取的片語集合，T_ni為第n份來源文件中第i句的字詞集合，S_j為使用者文件的第j句，tree_p_j為使用者文件的第j句中，從樹狀結構擷取的片語集合，dep_p_j為使用者文件的第j句中，從Typed dependencies擷取的片語集合，而T_j為使用者文件中第j句的字詞集合。 After using Stanford Parser for part-of-speech tagging and grammar analysis, and extracting all the words and phrases, the results are as follows: S ₁₁ ={John,has,turned on,his new radio},tree_p ₁₁ ={turned on ,his new radio},T ₁₁ ={John,has}. In addition, S ₁ = {John, switched on, the new radio}, tree_p ₁ = {the new radio}, dep_p ₁ = {switched on}, T ₁ = {John}. Here, S _ni is the i-th sentence of the nth source file, and tree_p _ni is the phrase set extracted from the tree structure in the i-th sentence of the nth source file, T _ni is the nth source file the i-th sentence word set, S _j for the j-th sentence of the user file, tree_p _j for the j-th sentence of the user file, a collection of phrases from the retrieved tree, dep_p _j for the first user files In the j sentence, the set of phrases retrieved from Typed dependencies, and T _j is the set of words in the jth sentence of the user file.

接著進行字根還原，可得到：S₁₁={John,have,turn on,his new radio}，tree_p₁₁={turn on,his new radio}，T₁₁={John,have}。另外，S₁={John,switch_on,the new radio}，tree_p₁={the new radio}，dep_p₁={switch_on}，T₁={John}。在此定義，來源字詞資料為T₁₁、來源片語資料為tree_p₁₁、使用者字詞資料為T₁，而使用者片語資料為tree_p₁加上dep_p₁。 Then the root is restored, and you can get: S ₁₁ ={John,have,turn on,his new radio},tree_p ₁₁ ={turn on,his new radio},T ₁₁ ={John,have}. In addition, S ₁ = {John, switch_on, the new radio}, tree_p ₁ = {the new radio}, dep_p ₁ = {switch_on}, T ₁ = {John}. Here, the source word data is T ₁₁ , the source phrase data is tree_p ₁₁ , the user word data is T ₁ , and the user phrase data is tree_p ₁ plus dep_p ₁ .

接著，請再參照圖1所示，進行片語識別程序S02。其中，片語識別程序S02係包括一集合產生步驟，集合產生步驟是比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料，以產生一集合字詞資料，另外，集合產生步驟亦比對來源片語資料與使用者片語資料，以產生一集合片語資料。於此，片語識別程序S02仍以片語為比對的單位。 Next, referring to FIG. 1, the phrase recognition program S02 is performed. The phrase recognition program S02 includes a set generation step of comparing the source word data with the user word data and comparing the source phrase data with the user phrase data to generate a set word. In addition, the collection generation step also compares the source phrase data with the user phrase data to generate a set of phrase materials. Here, the phrase recognition program S02 still uses the phrase as the unit of comparison.

在上一程序中已將使用者文件中所有的片語擷取出，分別為tree_p₁及dep_p₁，其中，dep_p₁內的片語已經過WordNet的識別，但tree_p₁只是單純從樹狀結構擷取，對抄襲比對而言並沒有意義。因此需透過來源文件與使用者文件中句子的比對，篩選出具有意義的片語，再將兩句子的字詞集合和片語集合聯合起來，以產生集合字詞資料及集合片語資料。 In the previous program, all the words in the user file have been extracted, which are tree_p ₁ and dep_p _{1 respectively} . Among them, the words in dep_p ₁ have been recognized by WordNet, but tree_p ₁ is simply from the tree structure. Take, it does not make sense for plagiarism. Therefore, through the comparison of the source document and the sentence in the user file, the meaningful phrase is filtered, and the word collection and the phrase collection of the two sentences are combined to generate the aggregate word data and the collection document data.

其過程為，係先透過句子的比對以及機器可讀辭典(即WordNet)的查詢以篩選出符合資格的片語。不符合資格的片語則需要進行拆解，而拆解後的片語又重新進行篩選動作，直到所有拆解出來的片語都經過篩選或被拆解成為單字(字詞)為止。其中，將來源文件與使用者文件中無法配對(不相同)之片語拆解，並將拆解而得到之字詞中可被識別者歸入集合字詞資料，將拆解而得到的片語再配對，配對成功放入集合片語資料中，沒有配對成功再放入WordNet查詢，若可查詢到則收入集合字詞資料內，若查詢不到再拆解，直到集合片語資料中的片語無法於WordNet中查到時，則完成片語識別程序S02。其中，集合字詞資料中之字詞係包含來源文件與使用者文件中經字根還原後的所有字詞，而集合片語資料中之片語係至少包含來源文件與使用者文件中相同的片語。於此，集合字詞資料內之字詞係與WordNet辭典內之字詞相對應。所謂相對應是指可於WordNet辭典內被查詢到，並為可識別或有意義之字詞。 The process is to first select a qualified phrase through a comparison of sentences and a query of a machine-readable dictionary (ie, WordNet). Ineligible phrases need to be disassembled, and the disassembled phrase is re-screened until all the disassembled phrases have been filtered or disassembled into words (words). Among them, the words that cannot be paired (different) in the source file and the user file are disassembled, and the identifiable words in the disassembled words are classified into the collective word data, and the pieces obtained by disassembling are disassembled. After the words are paired, the pairing is successfully put into the collection document data. If there is no matching, then the WordNet query is entered. If it can be queried, it will be included in the income collection word data. If the query is not re-disassembled, it will not be included in the collection document data. When the phrase cannot be found in WordNet, the phrase recognition program S02 is completed. Among them, the words in the collection word data contain The source file and all the words in the user file that have been restored by the root, and the phrase in the collection phrase data contains at least the same phrase in the source file as the user file. Here, the words in the collection word data correspond to the words in the WordNet dictionary. The so-called relative refers to words that can be queried in the WordNet dictionary and are identifiable or meaningful.

仍沿用上述來源文件及使用者文件的英文句子，經語法處理程序S01後得到：S₁₁={John,have,turn on,his new radio}，tree_p₁₁={turn on,his new radio}，T₁₁={John,have}。另外，S₁={John,switch_on,the new radio}，tree_p₁={the new radio}，dep_p₁={switch_on}，T₁={John}。 The English sentence of the above source file and user file is still used, and is obtained by the syntax processing program S01: S ₁₁ ={John,have,turn on,his new radio},tree_p ₁₁ ={turn on,his new radio},T ₁₁ ={John,have}. In addition, S ₁ = {John, switch_on, the new radio}, tree_p ₁ = {the new radio}, dep_p ₁ = {switch_on}, T ₁ = {John}.

再經片語識別程序S02的比對、篩選、拆解…等過程後，得到的結果如下：S₁₁={John,have,turn_on,his,new radio}，S₁={John,switch_on,the,new radio}，集合字詞資料={John,have,turn_on,his,John,switch_on,the}，以及集合片語資料={new radio,new radio}。 After the process of comparison, screening, disassembly, etc. of the phrase recognition program S02, the results are as follows: S ₁₁ = {John, have, turn_on, his, new radio}, S ₁ = {John, switch_on, the , new radio}, collection word data = {John, have, turn_on, his, John, switch_on, the}, and collection phrase data = {new radio, new radio}.

最後，請再參照圖1所示，進行相似度比對程序S03。 Finally, please refer to FIG. 1 again to perform the similarity comparison program S03.

相似度比對程序S03係依據集合字詞資料及集合片語資料，並從語意方面進行相似度比對，以藉由集合字詞資料及集合片語資料之語意相似度及次序相似度，計算使用者文件相對於來源文件之抄襲程度。其中，於語意相似度及次序相似度的計算中，係以片語為比對單位。另外，係分別計算集合字詞資料及集合片語資料中，字詞對字詞、字詞對片語及片語對片語的語意相似度。此外，計算語意相似度或次序相似度係分別依據字詞對字詞之相似最大值、字詞對片語之相似最大值，及片語對片語之相似最大值，以下將分別說明。 The similarity comparison program S03 is based on the aggregate word data and the aggregate document data, and the similarity comparison is performed from the semantic aspect to calculate the semantic similarity and order similarity of the aggregate word data and the collective document data. The degree of plagiarism of the user's files relative to the source file. Among them, in the calculation of semantic similarity and order similarity, the phrase is used as the comparison unit. In addition, the semantic similarity between the word-to-word, the word-to-speech and the phrase-to-speech is calculated in the aggregate word data and the collective document data. In addition, the calculation of the semantic similarity or the order similarity is respectively based on the similar maximum value of the word pair, the similar maximum value of the word to the phrase, and the similar maximum value of the phrase to the phrase, which will be respectively described below.

於計算字詞w對字詞w的語意相似度中，係依據字詞w的詞性分別進行計算。另外，係將集合字詞資料中的所有字詞w彼此進行比對，以找出字詞對字詞之一相似度最大值。因為在WordNet中，名詞和動詞會以上下義詞關係組成階層式架構，而形容詞和副詞則無法以階層式架構呈現，因此，在計算字詞之間的語意相似度時，須先判斷兩個字詞的詞性，並採取不同的相似度計算方法。如果兩字詞為名詞或動詞，將使用Path-based measure來計算相似度，如果兩字詞是形容詞或副詞，則使用Gloss-based measure來計算相似度。 In calculating the semantic similarity of the word w to the word w, it is calculated according to the part of speech of the word w. In addition, all the words w in the aggregate word data are compared with each other to find the maximum similarity of the word pair. Because in WordNet, nouns and verbs form a hierarchical structure with the above-mentioned semantic relationship, while adjectives and adverbs cannot be presented in a hierarchical structure. Therefore, when calculating the semantic similarity between words, two The part of speech, and adopt different methods of similarity calculation. If the two words are nouns or verbs, the Path-based measure is used to calculate the similarity. If the two words are adjectives or adverbs, the Gloss-based measure is used to calculate the similarity.

本發明是以WordNet中名詞和動詞的階層架構為基礎，結合PATH(Rada,et al.,1989)和WUP(Wu & palmer,1994)的方法來計算名詞或動詞的語意相似度，而形容詞或副詞則使用VECTOR(Patwardhan,2003)來計算語意相似度，故以下說明係將PATH、WUP及VECTOR的論文全文內容納入本發明之揭露說明內。詳細內容及公式如下所示。 The invention is based on the hierarchical structure of nouns and verbs in WordNet. Combine PATH (Rada, et al., 1989) and WUP (Wu & palmer, 1994) to calculate the semantic similarity of nouns or verbs, while adjectives or adverbs use VECTOR (Patwardhan, 2003) to calculate semantic similarity. Therefore, the following description incorporates the full text of the papers of PATH, WUP and VECTOR into the disclosure of the present invention. The details and formulas are as follows.

PATH方法是使用兩字詞在WordNet中最短路徑長度來計算，如公式3-1所示： The PATH method is calculated using the shortest path length in WordNet using two words, as shown in Equation 3-1:

WUP方法則是考慮了兩字詞的深度，以及它們的最小共通父節點(LCS)在WordNet中的深度來計算字詞相似度，如公式3-2所式： The WUP method considers the depth of two words and the depth of their smallest common parent node (LCS) in WordNet to calculate word similarity, as in Equation 3-2:

VECTOR方式則利用兩字詞w₁、w₂在WordNet中的註解組成向量v1、v2，再使用Cosine(餘弦)計算字詞的相似度，如公式3-3所示： The VECTOR method uses the annotations of the two words w ₁ and w ₂ in WordNet to form the vectors v1 and v2, and then uses Cosine (cosine) to calculate the similarity of the words, as shown in Equation 3-3:

因此，整體而言，字詞與字詞(以下稱為狀況A、狀況B、狀況C)的語意相似度計算可如公式3-4~3-6所示，其中，len(w₁)為集合字詞資料內的字詞w₁的長度： Therefore, overall, the semantic similarity between words and words (hereinafter referred to as status A, status B, and status C) can be calculated as shown in Equations 3-4 to 3-6, where len(w ₁ ) is The length of the word w ₁ in the collection word data:

另外，於計算字詞對片語的語意相似度中，係以字詞為基礎進行計算。其中，係將集合字詞資料中之每一字詞與集合片語資料中的每一字詞進行比對，以找出字詞對片語之一相似度最大值。由於集合片語資料中的片語p是在片語識別程序S02中，單純因兩句子的比對而產生，且在WordNet中沒有被查詢到，因此，需要一一比對集合片語資料中所有片語p內的所有字詞，找出當中的相似度最大值。 In addition, in calculating the semantic similarity of words to phrases, the calculation is based on words. Wherein, each word in the aggregate word data is compared with each word in the aggregate document data to find a maximum value of the similarity of the word to the phrase. Due to the collection The phrase p in the data is generated in the phrase recognition program S02, simply due to the alignment of two sentences, and is not queried in WordNet. Therefore, it is necessary to compare all the phrases in the collection phrase data. All the words in it, find out the maximum similarity among them.

假設片語p={p_t₁,p_t₂,...,p_t_N,...}，其中，p_t_N為片語p的第N個字詞，因此，字詞與片語(以下稱為狀況D、狀況E)的相似度計算方式可如公式3-7~3-10所示：sim(w,p)=max_N sim(w,p_t _N) (3-7) Suppose the phrase p={p_t ₁ , p_t ₂ ,...,p_t _N ,...}, where p_t _N is the Nth word of the phrase p, therefore, the words and phrases (hereinafter referred to as The similarity calculation method of condition D and condition E) can be as shown in formula 3-7~3-10: sim ( w,p )=max _N sim ( w,p _ t _N ) (3-7)

此外，於計算片語對片語的語意相似度中，也是以字詞為基礎進行計算。其中，係將集合片語資料中的所有字詞彼此進行比對，以找出片語對片語之一相似度最大值。由於集合片語資料中的片語p₁、p₂是在片語識別程序S02中單純因兩句子的比對而產生，且在WordNet中沒有被查詢到，因此，需要一一比對集合片語資料中，所有片語之所有字詞，找出當中相似度最大值。 In addition, in calculating the semantic similarity of the phrase to the phrase, the calculation is also based on the word. Among them, all the words in the collective document data are compared with each other to find the maximum similarity of the phrase to the phrase. Since the words p ₁ and p ₂ in the aggregate document data are generated by the alignment of the two sentences in the phrase recognition program S02 and are not queried in WordNet, it is necessary to compare the collection pieces one by one. In the corpus, all the words of all the words are used to find the maximum similarity among them.

假設片語p₁={p₁_t₁,p₁_t₂,...,p₁_t_M,...}，其中，p₁_t_M為片語p₁的第M個字詞，片語p₂={p₂_t₁,p₂_t₂,...,p₂_t_N,...}，其中，p₂_t_N為片語p₂的第N個字詞，因此，片語與片語(以下稱為狀況F)的計算公式如3-11~3-14所示，其中len(p₁)為片語p₁的長度： Suppose the phrase p ₁ ={p ₁ _t ₁ , p ₁ _t ₂ ,..., p ₁ _t _M ,...}, where p ₁ _t _M is the Mth word of the phrase p ₁ , slice p ₂ ={p ₂ _t ₁ ,p ₂ _t ₂ ,...,p ₂ _t _N ,...}, where p ₂ _t _N is the Nth word of the phrase p ₂ , therefore, the slice The calculation formula of the language and the phrase (hereinafter referred to as the condition F) is as shown in 3-11~3-14, where len(p ₁ ) is the length of the phrase p ₁ :

接著，需要將集合字詞資料和集合片語資料按照次序排序並去除重覆的項目，組成一聯集U，再跟來源文件與使用者文件進行語意比對及次序比對。 Then, the aggregate word data and the aggregate document data need to be sorted in order and the repeated items are removed to form a joint U, and then the source file and the user file are semantically compared and aligned.

假設此聯集U={JTP₁,JTP₂,...,JTP_f,...,JTP_F}，S_ni={TP₁,TP₂,...,TP_g,...TP_G}，而S_j={TP’₁,TP’₂,...,TP’_h,...TP’_H}，再參考Li,et al.(2006)所定義的句子相似度計算方法，同時考量同義字替代及字詞次序的改變，結合語意相似度(Semantic Similarity)與次序相似度(Order Similarity)來計算S_ni和S_j的整體相似度，計算步驟如下：第一步驟-計算句子語意相似度如下：聯集U內所有的JTP_f與句子S_ni內所有的TP_g進行比對：sem ₁={max_g sim(JTP ₁,TP _g),max_g sim(JTP ₂ ,TP _g),…max_g sim(JTP _f ,TP _g)} (3-15) Suppose this union U={JTP ₁ , JTP ₂ ,...,JTP _f ,...,JTP _F },S _ni ={TP ₁ ,TP ₂ ,...,TP _g ,...TP _G }, and S _j ={TP' ₁ , TP' ₂ ,...,TP' _h ,...TP' _H }, and then refer to the sentence similarity calculation method defined by Li, et al. (2006), At the same time, the synonym substitution and the change of word order are considered. The overall similarity between S _ni and S _j is calculated _by combining Semantic Similarity and Order Similarity. The calculation steps are as follows: First step - Calculate the sentence semantic similarity as follows: the union U for all the JTP _f sentence S _ni all TP _g for _{_{comparison: sem 1 = {max g sim}} (JTP 1, TP g), max g sim (JTP 2, TP g ),...max _g sim ( JTP _f ,TP _g )} (3-15)

聯集U內所有的JTP_f與句子S_j內所有的TP_h進行比對：sem ₂={max_h sim(JTP ₁ ,TP' _h),max_h sim(JTP ₂ ,TP' _h),…max_h sim(JTP _f ,TP' _h)}(3-16) All JTP _{f in the} union U is compared with all TP _h in the sentence S _j : sem ₂ ={max _h sim ( JTP ₁ ,TP' _h ) , max _h sim ( JTP ₂ ,TP' _h ),... Max _h sim ( JTP _f ,TP' _h )}(3-16)

最後，兩句子的語意相似度需要使用cosine similarity來計算： Finally, the semantic similarity of two sentences needs to be calculated using cosine similarity:

第二步驟-計算句子次序相似度如下：在計算sem₁和sem₂的同時，需要將每一個產生最大相似度的g和h記錄起來，若相似度皆為0，則order就是0，藉此可組成order₁和order₂，最後，兩句子的次序相似度為： The second step - calculating the sentence order similarity is as follows: in the calculation of sem ₁ and sem ₂ , each of g and h which produces the greatest similarity needs to be recorded. If the similarity is 0, the order is 0, thereby Can form order ₁ and order ₂ , and finally, the order similarity of the two sentences is:

最後，進行第三步驟-計算句子整體相似度如下：於此，係給予不同的權重，並結合上述兩種相似度計算公式，計算句子整體相似度：Sim(S _ni ,S _j)=αSim _sem+(1-α)Sim _order (3-19)，其中α≦1，而Li,et al.(2006)認為，α應大於0.5，因此可介於0.5~1之間(0.5≦α≦1)，使用者可依據其需求而自訂。 Finally, the third step is performed - the overall similarity of the sentences is calculated as follows: Here, different weights are given, and the overall similarity of the sentences is calculated in combination with the above two similarity calculation formulas: Sim ( S _ni , S _j ) = αSim _sem +(1- α ) Sim _order (3-19), where α ≦1, and Li, et al. (2006) believe that α should be greater than 0.5, so it can be between 0.5 and 1 (0.5≦ α ≦1 ), users can customize according to their needs.

當抄襲程度高於一閥值τ時，則本發明可認定使用者文件有抄襲行為。換言之，若Sim(S_ni,S_j)>τ時，則使用者文件會被視為有抄襲來源文件的行為。 When the degree of plagiarism is higher than a threshold value τ, the present invention can determine that the user file has plagiarism. In other words, if Sim(S _ni , S _j )>τ, the user file will be treated as a plagiarized source file.

以下，仍利用上述的例子說明相似度比對程序S03的過程。 Hereinafter, the procedure of the similarity comparison program S03 will be described using the above example.

經由前面程序S01及S02之後，S₁₁={John,have,turn_on,his,new radio}，S₁={John,switch_on,the,new radio}，集合字詞資料={John,have,turn_on,his,John,switch_on,the}，集合片語資料={new radio,new radio}，以及聯集U={John,has,turn_on,his,new radio,switch_on,the}. After the previous procedures S01 and S02, S ₁₁ = {John, have, turn_on, his, new radio}, S ₁ = {John, switch_on, the, new radio}, set word data = {John, have, turn_on, His, John, switch_on, the}, collection phrase data = {new radio, new radio}, and union U = {John, has, turn_on, his, new radio, switch_on, the}.

因此，字詞及片語的所有可能比對狀況如下表所示： Therefore, all possible comparisons of words and phrases are shown in the following table:

藉由將來源文件的句子與聯集U內容進行比對，便會出現6將比對狀況(狀況A~F)。其中，狀況A、狀況B及狀況C分別為字詞與字詞的比對，狀況D及狀況E分別為字詞與片語的比對，而將況F為片語與片語的比對。 By comparing the sentences of the source file with the contents of the union U, there will be a comparison condition (conditions A~F). Among them, the situation A, the situation B and the situation C are the comparison of the words and the words respectively, the situation D and the situation E are the comparison of the words and the words respectively, and the condition F is the comparison between the words and the words. .

若詞性無法直接給予0，則需要個別處理。以狀況D的例子來說，Sim(John,new radio)=max(Sim(John,new),Sim(John,radio))=max(0,Sim(John,radio))=Sim(John,radio)。 If the part of speech cannot be directly given 0, it needs to be handled individually. In the case of condition D, Sim(John, new radio)=max(Sim(John,new),Sim(John,radio))=max(0,Sim(John,radio))=Sim(John,radio) ).

再將來源文件句子與聯集U進行比對，可得到以下的比對表： Then compare the source file sentences with the union U to get the following comparison table:

於此，將行(Column)中最大的數字記錄起來，得到sem₁={1,1,2,1,2,1,0}，並將產生行最大數字時的列(Row)的位置記錄起來，得到order₁={1,2,3,4,5,3,0}。 Here, the largest number in the column is recorded, and sem ₁ = { _{1, 1} , 2, ₁ , 2, _1, 0} is obtained, and the position record of the column (Row) when the row maximum number is generated is generated. Get up and get order ₁ ={1,2,3,4,5,3,0}.

接著，將使用者文件句子與聯集U進行比對，可得到以下的比對表： Next, the user file sentence is compared with the union U to obtain the following comparison table:

於此，再將行中最大的數字記錄起來，假設Sim(have,switch_on)=0.2(可自訂其它數字)，可得到sem₂={1,0.2,1,0,2,2,1}，並將產生行最大數字時的列的位置記錄起來，得到order₂={1,2,2,0,4,2,3}。 Here, record the largest number in the line, assuming Sim (have, switch_on) = 0.2 (can customize other numbers), you can get sem ₂ = {1, 0.2, 1, 0, _{2, 2,} 1} And record the position of the column when the row has the largest number, and get order ₂ ={1,2,2,0,4,2,3}.

最後，計算句子的整體相似度得到： Finally, calculate the overall similarity of the sentence to get:

整體相似度Sim(S ₁₁ ,S ₁)=α×0.79931+(1-α)×0.60114 Overall similarity Sim ( S ₁₁ ,S ₁ )= α ×0.79931+(1- α )×0.60114

假設α=0.6(可為其它數字)，閥值τ例如為0.7(可依需求自訂其它數字)，則Sim(S11,S1)=0.60.79931+0.40.60114=0.720042> τ(0.7)，因此，判定使用者文件相對於來源文件有抄襲行為。 Assuming α = 0.6 (may be other numbers), the threshold τ is, for example, 0.7 (other numbers can be customized according to requirements), then Sim(S11, S1) = 0.60.79931 + 0.40.60114 = 0.720042 > τ (0.7), therefore, it is determined that the user file has plagiarism relative to the source file.

綜上所述，因依據本發明之英文資料之抄襲偵測方法包括語法處理程序、片語識別程序以及相似度比對程序。其中，係透過將來源文件與使用者文件分別進行語法分析，並比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料，以及比對來源片語資料與使用者片語資料，再從語意方面進行相似度比對，藉此計算出使用者文件相對於來源文件之抄襲程度。因此，與習知相較，本發明不僅可利用語法結構分析及擷取句子中的字詞及片語，更以片語為抄襲偵測的比對單位，並輔以語意，因此可改善習知技術中使用單一字詞進行比對的正確性。 In summary, the plagiarism detection method according to the English document of the present invention includes a grammar processing program, a phrase recognition program, and a similarity comparison program. Among them, the source document and the user file are separately parsed, and the source word data and the user word data are compared and the source phrase data and the user phrase data are compared, and the source document data is compared. And the user's phrase data, and then similarity comparison in terms of semantics, thereby calculating the degree of plagiarism of the user file relative to the source file. Therefore, compared with the prior art, the present invention can not only utilize the grammatical structure analysis and extract words and phrases in sentences, but also use the phrase as the comparison unit of plagiarism detection, supplemented by semantic meaning, thus improving the habit The correctness of the alignment is determined using a single word in the technique.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

S01‧‧‧語法處理程序 S01‧‧‧ grammar handler

S02‧‧‧片語識別程序 S02‧‧‧Word recognition program

S03‧‧‧相似度比對程序 S03‧‧‧similarity comparison program

Claims

A plagiarism detection method for English data is implemented on a computer, and the plagiarism detection method comprises: a grammar processing program, comprising a grammar analysis step, respectively performing a parsing analysis on a source file and a user file To obtain a source word material and a source phrase material, and a user word material and a user phrase material respectively; a language recognition program includes a set generation step, which compares the source word Data and the user word data and the source document data and the user phrase data to generate a set of word data, and compare the source document data with the user phrase data to generate a collection of phrase data, wherein the words in the collection word data include the source file and all the words restored by the root in the user file, and the phrase language in the collection phrase data includes the source The file is the same phrase in the user file; and a similarity comparison program is based on the set word data and the set of phrase data and is similar in terms of semantics and order Comparing, the plagiarism of the user file relative to the source file is calculated by analyzing the aggregate word data and the semantic similarity and order similarity between the source document and the user file and the user file .

The method for detecting plagiarism of English materials as described in claim 1 wherein the source word material or the user word data respectively comprise at least one word, the source phrase material or the user phrase material Contain at least one phrase respectively.

The plagiarism detection method of the English document as described in claim 1 of the patent application, wherein the grammar analysis step uses a natural language processing technique.

The plagiarism detection method of the English document as described in claim 1 , wherein the grammar processing program further comprises a sentence break step, a complete plagiarism detection step and a root root restoration step.

For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the grammar analysis step, the source file and the user file are compared with a dictionary for the part-of-speech tagging and phrase 撷take.

The method for detecting plagiarism of English materials as described in claim 5, wherein the words in the collective word data correspond to words in the dictionary.

For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the phrase recognition program, the phrase is used as the comparison unit.

For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the calculation of the semantic similarity and the similarity of the order, the phrase is used as the comparison unit.

The method for detecting plagiarism of English materials as described in claim 7 of the patent application, wherein in the similarity comparison program, the word-to-word, the word-to-word, The semantic similarity of words to phrases and phrases.

For example, in the plagiarism detection method of the English materials mentioned in the ninth application patent range, the semantic similarity of the calculated words to the words is calculated according to the part of speech of the words.

The plagiarism detection method for the English materials mentioned in claim 9 wherein, in calculating the semantic similarity of the words to the words, all the words in the collective word data are compared with each other to Find the maximum similarity between a word and a word.

For example, in the plagiarism detection method of the English materials mentioned in the scope of the patent application, in the semantic similarity of the calculated words to the phrase or the phrase to the phrase, the calculation is based on the words respectively.

For example, in the plagiarism detection method of the English materials mentioned in claim 11, wherein in calculating the semantic similarity of the words to the words, each word and collection phrase data in the collective word data is Each word in the comparison is compared to find the maximum similarity of the word to the phrase.

For example, in the plagiarism detection method of the English document mentioned in claim 13, wherein in calculating the semantic similarity of the phrase to the phrase, all the words in the collective phrase data are compared with each other to Find out the maximum similarity of the phrase to the phrase.

The method for detecting plagiarism of English materials as described in claim 14 wherein the semantic similarity or the similarity of the order is based on the similar maximum value of the word pair and the word-to-word. The similarity maximum, and the similar maximum of the phrase to the phrase.

For example, the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, wherein when the plagiarism is higher than a threshold, the user file has plagiarism.