TWI594135B - Plagiarism detecting method of information in english - Google Patents

Plagiarism detecting method of information in english Download PDF

Info

Publication number
TWI594135B
TWI594135B TW102102093A TW102102093A TWI594135B TW I594135 B TWI594135 B TW I594135B TW 102102093 A TW102102093 A TW 102102093A TW 102102093 A TW102102093 A TW 102102093A TW I594135 B TWI594135 B TW I594135B
Authority
TW
Taiwan
Prior art keywords
phrase
word
words
data
similarity
Prior art date
Application number
TW102102093A
Other languages
Chinese (zh)
Other versions
TW201430591A (en
Inventor
蘇嘉穎
王惠嘉
劉繼仁
羅鄉儀
林柏安
Original Assignee
國立成功大學
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 國立成功大學 filed Critical 國立成功大學
Priority to TW102102093A priority Critical patent/TWI594135B/en
Publication of TW201430591A publication Critical patent/TW201430591A/en
Application granted granted Critical
Publication of TWI594135B publication Critical patent/TWI594135B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Description

英文資料之抄襲偵測方法 Plagiarism detection method for English materials

本發明係關於一種英文資料之抄襲偵測方法。 The invention relates to a plagiarism detection method for English materials.

隨著台灣高學歷的需求與普及化,學生的論文報告產出量也大幅增加,現今主要的學術研究成果大多以英文發表,但英文寫作對台灣的學生而言並非易事,加上學生為了爭取時間通過考試或獲得學位,抄襲的行為也日趨增加。以學術論理而言,抄襲不但是對原作者的不尊重,也無法進一步發展專業知識。 With the demand and popularization of Taiwan's high education, the output of students' papers has also increased substantially. Most of the major academic research results are published in English, but English writing is not easy for Taiwanese students. As time goes by to pass the exam or obtain a degree, the plagiarism is also increasing. In terms of academic theory, plagiarism is not only disrespectful to the original author, nor can it further develop professional knowledge.

然而,目前抄襲偵測軟體(例如Turnitin、CopyCatch或EVE2)的功能大多為比對論文資料庫或網路資源,並且只單純以單一字詞(word)為基礎進行比對,因此往往同義字無法被偵測出而降低抄襲偵測的準確性。 However, most of the current plagiarism detection software (such as Turnitin, CopyCatch or EVE2) is a comparison of the paper database or network resources, and only based on a single word (word) comparison, so often synonyms can not It is detected to reduce the accuracy of plagiarism detection.

有鑑於上述課題,本發明之目的為提供一種英文資料之抄襲偵測方法,不僅可利用語法結構分析及擷取句子中的字詞及片語,更以片語為抄襲偵測的比對單位,並輔以語意,以改善習知技術中使用單一字詞進行比對的正確性。 In view of the above problems, the object of the present invention is to provide a plagiarism detection method for English materials, which can not only use grammatical structure analysis and extract words and phrases in sentences, but also use spoofing as a plagiarism detection unit. And supplemented by semantics to improve the correctness of the comparison using a single word in the prior art.

為達上述目的,依據本發明之一種英文資料之抄襲偵測方法係實施於一電腦上,抄襲偵測方法包括一語法處理程序、一片語識別程序以及一相似度比對程序。語法處理程序包括一語法分析步驟,其將一來源文件與一使用者文件分別進行一語法分析,以分別得到一來源字詞資料與一來源片語資料,以及一使用者字詞資料與一使用者片語資料。片語識別程序包括一集合產生步驟,其比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料,以產生一集合字詞資料,並比對來源 片語資料與使用者片語資料,以產生一集合片語資料。相似度比對程序依據集合字詞資料及集合片語資料並從語意方面進行相似度比對,以藉由集合字詞資料及集合片語資料之語意相似度及次序相似度,計算使用者文件相對於來源文件之抄襲程度。 To achieve the above object, a plagiarism detection method for an English document according to the present invention is implemented on a computer, and the plagiarism detection method includes a grammar processing program, a language recognition program, and a similarity comparison program. The grammar processing program includes a parsing step of separately parsing a source file and a user file to obtain a source word data and a source phrase data, and a user word data and a use. Participant materials. The phrase recognition program includes a set generation step of comparing the source word data with the user word data and comparing the source phrase data with the user phrase data to generate a set of word data and comparing the sources. The phrase material and the user phrase data are used to generate a set of phrase materials. The similarity comparison program performs similarity comparison on the semantics of the aggregated word data and the aggregated document data to calculate the user file by the semantic similarity and order similarity of the aggregated word data and the aggregated phrase data. The degree of plagiarism relative to the source document.

在一實施例中,來源字詞資料或使用者字詞資料分別包含至少一個字詞,來源片語資料或使用者片語資料分別包含至少一個片語。 In an embodiment, the source word data or the user word data respectively comprise at least one word, and the source phrase material or the user phrase material respectively comprise at least one phrase.

在一實施例中,語法分析步驟係使用一自然語言處理技術。 In one embodiment, the parsing step uses a natural language processing technique.

在一實施例中,於語法處理程序中,更包含一斷句步驟、一完全抄襲偵測步驟及一字根還原步驟。 In an embodiment, in the grammar processing program, a sentence segmentation step, a complete plagiarism detection step, and a root restoration step are further included.

在一實施例中,於語法分析步驟中,係將來源文件與使用者文件與一辭典比對,以進行詞性標註及片語擷取。 In an embodiment, in the parsing step, the source file and the user file are compared with a dictionary for part-of-speech tagging and phrase capture.

在一實施例中,集合字詞資料內之字詞係與辭典內之字詞相對應。 In one embodiment, the words within the aggregated word material correspond to the words within the dictionary.

在一實施例中,於片語識別程序中,係以片語作為比對單位。 In an embodiment, in the phrase recognition program, the phrase is used as the comparison unit.

在一實施例中,集合字詞資料中之字詞係包含來源文件與使用者文件中經字根還原後的所有字詞,集合片語資料中之片語係包含來源文件與使用者文件中相同的片語。 In an embodiment, the words in the aggregate word data include all the words restored by the root of the source file and the user file, and the phrase language in the aggregate document data includes the source file and the user file. The same phrase.

在一實施例中,於語意相似度及次序相似度的計算中,係以片語為比對單位。 In an embodiment, in the calculation of semantic similarity and order similarity, the phrase is used as the comparison unit.

在一實施例中,於相似度比對程序中,係分別計算集合字詞資料及集合片語資料中,字詞對字詞、字詞對片語及片語對片語的語意相似度。 In an embodiment, in the similarity comparison program, the semantic similarity between the word-to-word, the word-to-word, and the phrase-to-speech is calculated in the aggregate word data and the aggregate document data.

在一實施例中,於計算字詞對字詞的語意相似度中,係依據字詞的詞性分別進行計算。 In an embodiment, in calculating the semantic similarity of the word to the word, the word part is calculated according to the part of the word.

在一實施例中,於計算字詞對字詞的語意相似度中,係將集合字詞資料中的所有字詞彼此進行比對,以找出字詞對字詞之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the word to the word, all the words in the set word data are compared with each other to find the maximum value of the similarity of the word to the word.

在一實施例中,於計算字詞對片語或片語對片語的語意相似度中,係分別以字詞為基礎進行計算。 In an embodiment, in calculating the semantic similarity of a word to a phrase or a phrase to a phrase, the calculation is performed on a word basis.

在一實施例中,於計算字詞對片語的語意相似度中,係將集合字詞資料中之每一字詞與集合片語資料中的每一字詞進行比對,以找出字詞對片語之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the word to the phrase, each word in the set word data is compared with each word in the set phrase data to find the word. The maximum similarity of the word to the phrase.

在一實施例中,於計算片語對片語的語意相似度中,係將集合片語資料中的所有字詞彼此進行比對,以找出片語對片語之一相似度最大值。 In an embodiment, in calculating the semantic similarity of the phrase to the phrase, all the words in the collective phrase data are compared with each other to find a maximum similarity of the phrase to the phrase.

在一實施例中,於計算語意相似度或次序相似度係分別依據字詞對字詞之相似最大值、字詞對片語之相似最大值,及片語對片語之相似最大值。 In an embodiment, the semantic similarity or order similarity is calculated according to the similar maximum value of the word pair, the similar maximum value of the word to the phrase, and the similar maximum value of the phrase to the phrase.

在一實施例中,當抄襲程度高於一閥值時,則使用者文件有抄襲行為。 In one embodiment, when the degree of plagiarism is above a threshold, the user file has plagiarism.

承上所述,因依據本發明之英文資料之抄襲偵測方法包括語法處理程序、片語識別程序以及相似度比對程序。其中,係透過將來源文件與使用者文件分別進行語法分析,並比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料,以及比對來源片語資料與使用者片語資料,再從語意方面進行相似度比對,藉此可計算出使用者文件相對於來源文件之抄襲程度。因此,與習知相較,本發明不僅可利用語法結構分析及擷取句子中的字詞及片語,更以片語為抄襲偵測的比對單位,並輔以語意,因此可改善習知技術中使用單一字詞進行比對的正確性。 As described above, the plagiarism detection method according to the English document of the present invention includes a grammar processing program, a phrase recognition program, and a similarity comparison program. Among them, the source document and the user file are separately parsed, and the source word data and the user word data are compared and the source phrase data and the user phrase data are compared, and the source document data is compared. The user's phrase data is compared with the semantics, and the degree of plagiarism of the user file relative to the source file can be calculated. Therefore, compared with the prior art, the present invention can not only utilize the grammatical structure analysis and extract words and phrases in sentences, but also use the phrase as the comparison unit of plagiarism detection, supplemented by semantic meaning, thus improving the habit The correctness of the alignment is determined using a single word in the technique.

P01‧‧‧斷句步驟 P01‧‧‧Sentence steps

P02‧‧‧完全抄襲偵測步驟 P02‧‧‧ Complete plagiarism detection step

P03‧‧‧語法分析步驟 P03‧‧‧ Grammar Analysis Steps

P04‧‧‧字根還原步驟 P04‧‧‧ Root Restoration Step

S01‧‧‧語法處理程序 S01‧‧‧ grammar handler

S02‧‧‧片語識別程序 S02‧‧‧Word recognition program

S03‧‧‧相似度比對程序 S03‧‧‧similarity comparison program

圖1所示,其為本發明較佳實施例之一種英文資料之抄襲偵測方法的程序流程圖。 FIG. 1 is a flow chart showing the procedure of a plagiarism detection method for English materials according to a preferred embodiment of the present invention.

圖2為圖1中,語法處理程序的步驟流程圖。 2 is a flow chart showing the steps of the syntax processing program of FIG. 1.

以下將參照相關圖式,說明依本發明較佳實施例的英文資料之抄襲偵測方法,其中相同的元件將以相同的參照符號加以說明。 The plagiarism detection method of the English language according to the preferred embodiment of the present invention will be described below with reference to the related drawings, wherein the same elements will be described with the same reference numerals.

請參照圖1所示,其為本發明較佳實施例之一種英文資料 之抄襲偵測方法的程序流程圖。 Please refer to FIG. 1 , which is an English text of a preferred embodiment of the present invention. The program flow chart of the plagiarism detection method.

本發明英文資料之抄襲偵測方法係可實施於電腦上(例如但不限於軟體程序)。其中,英文資料之抄襲偵測方法包括一語法處理程序S01、片語識別程序S02以及一相似度比對程序S03。 The plagiarism detection method of the English document of the present invention can be implemented on a computer (such as but not limited to a software program). The plagiarism detection method of the English data includes a grammar processing program S01, a phrase recognition program S02, and a similarity comparison program S03.

首先,語法處理程序S01是將一來源文件與一使用者文件進行語法處理。其中,來源文件可包含複數文件,每一文件中可包含至少一個句子(sentence),而使用者文件亦可包含至少一個句子。於此,來源文件指的是資料庫中的文件,而其來源可例如但不限為論文資料庫或網路資源,或其它方式取得的英文文件。另外,使用者文件指的是使用者所寫,並要被偵測是否為抄襲來源文件的文件資料。 First, the grammar handler S01 performs syntax processing on a source file and a user file. The source file may include a plurality of files, each file may include at least one sentence, and the user file may also include at least one sentence. Here, the source file refers to a file in the database, and the source thereof may be, for example, but not limited to, a thesis database or a network resource, or an English file obtained by other means. In addition, the user file refers to the file written by the user and is to be detected as a copy of the source file.

語法處理程序S01是分別將來源文件及使用者文件的內容進行句子切割,並透過語法結構分析及擷取句子裡的所有字詞及片語,且進行詞性標註和字根還原等處理動作。本發明係使用一自然語言處理(Natural Language Processing,NLP)技術來進行語法處理及分析,並以Stanford Parser分析工具來進行語法處理及分析的工作。Stanford Parser分析工具為史丹佛大學自然語言處理研究團(The Stanford Natural Language Processing Group)所發展出來的文法解析工具。 The grammar processing program S01 performs sentence cutting on the contents of the source file and the user file, and analyzes and extracts all the words and phrases in the sentence through the grammatical structure, and performs processing operations such as part-of-speech tagging and root-restore. The present invention uses a Natural Language Processing (NLP) technique for grammar processing and analysis, and uses Stanford Parser analysis tools for grammar processing and analysis. The Stanford Parser analysis tool is a grammar parser developed by The Stanford Natural Language Processing Group.

請參照圖2所示,其為圖1中,語法處理程序S01的步驟流程圖。於此,語法處理程序S01可包含一斷句步驟P01、一完全抄襲偵測步驟P02、一語法分析步驟P03及一字根還原步驟P04。 Please refer to FIG. 2, which is a flow chart of the steps of the syntax processing program S01 in FIG. Here, the grammar processing program S01 may include a sentence breaking step P01, a complete plagiarism detecting step P02, a parsing step P03, and a root resting step P04.

首先,進行斷句步驟P01。 First, the sentence step P01 is performed.

於斷句步驟P01中,是將來源文件與使用者文件分別使用Stanford Parser切割成獨立句子,並從樹狀結構中找出片語(phrase)、字詞(word,或稱單字)關係及子句等語法結構。 In the step S01 of the sentence segment, the source file and the user file are respectively cut into independent sentences using Stanford Parser, and the phrase, word (word, or word) relationship and clause are found from the tree structure. And other grammatical structures.

接著,進行完全抄襲偵測步驟P02。 Next, a full plagiarism detection step P02 is performed.

於完全抄襲偵測步驟P02中,是偵測使用者文件中,完全抄襲來源文件內的句子。換言之,係比對使用者文件內的句子及來源文件內的句子,先找出完全抄襲的部分,以降低後續計算及處理的複雜度。 In the complete plagiarism detection step P02, it is detected in the user file that the sentence in the source file is completely copied. In other words, the sentence in the user file and the sentence in the source file are compared first to find the completely plagiarized part, so as to reduce the complexity of subsequent calculation and processing.

接著,進行語法分析步驟P03。 Next, a parsing step P03 is performed.

於語法分析步驟P03中,是將來源文件與使用者文件分別 進行語法分析,以分別得到一來源字詞資料與一來源片語資料,以及得到一使用者字詞資料與一使用者片語資料。其中,語法分析(Syntactic Analysis,或稱Parsing),是利用電腦來分析句子的文法規則及架構,再以樹狀結構的模式展開。另外,亦使用Stanford Parcer將來源文件與使用者文件與一英語詞彙資料庫(例如WordNet)進行比對,以進行字詞的詞性標註及片語擷取。換言之,語法分析步驟P03就是將已經斷句及完全抄襲偵測後的句子,利用Stanford Parser所產生的樹狀結構及其相依關係,並以WordNet作為輔助工具,以擷取出來源文件及使用者文件中所有的字詞及片語。其中於詞性標註中,係將Stanford Parcer所得到的詞性標註轉換成WordNet可讀的形式:包含四種詞性:名詞(Nouns)、動詞(Verbs)、形容詞(djectives)及副詞(Adverbs)。另外,於片語擷取中,係以片語作為抄襲偵測的比對單位,並輔以語意,且分別利用Stanford Parser所產生的樹狀結構及相依關係,擷取出所有的片語。 In the parsing step P03, the source file and the user file are respectively separated. Grammatical analysis is performed to obtain a source word data and a source document data respectively, and obtain a user word data and a user phrase material. Among them, Syntactic Analysis (or Parsing) is a grammar rule and structure that uses a computer to analyze sentences, and then develops in a tree structure. In addition, Stanford Parcer is also used to compare source files and user files with an English vocabulary database (such as WordNet) for word-based tagging and phrase capture. In other words, the parsing step P03 is to use the tree structure generated by Stanford Parser and its dependence relationship after the sentence has been broken and completely plagiarized, and use WordNet as an auxiliary tool to extract the source file and the user file. All words and phrases. Among the part-of-speech tagging, the tokens obtained by Stanford Parcer are converted into WordNet readable form: including four parts of speech: Nouns, Verbs, daggerives and Adverbs. In addition, in the phrase acquisition, the phrase is used as the comparison unit of plagiarism detection, supplemented by semantic meaning, and the tree structure and the dependence relationship generated by Stanford Parser are used respectively to extract all the words.

最後,進行字根還原步驟P04。 Finally, the root restoration step P04 is performed.

於字根還原步驟P04中,係將步驟P03所擷取出之字詞及片語,利用WordNet進行字根還原,避免同一字詞因單複數或時態的不同而造成誤判。經過上述的4個步驟後,可完成語法處理程序S01,並可分別得到來源字詞資料與來源片語資料,以及使用者字詞資料與使用者片語資料。於此,來源字詞資料或使用者字詞資料可分別包含至少一個字詞,且來源片語資料或使用者片語資料分別包含至少一個片語(或稱有意義的英文字詞的組合)。 In the root restoration step P04, the words and phrases extracted in step P03 are used to restore the roots by using WordNet, so as to avoid misjudgment of the same words due to singular plural or tense. After the above four steps, the grammar processing program S01 can be completed, and the source word data and the source phrase data, as well as the user word data and the user phrase data, can be obtained respectively. Here, the source word data or the user word data may respectively contain at least one word, and the source phrase material or the user phrase material respectively includes at least one phrase (or a combination of meaningful English words).

以下,以實際的英文句子來說明上述的語法處理程序S01。為了簡化說明,來源文件以一個句子為例,並例如為「John has turned on his new radio.」,而使用者文件亦以一個句子為例,並例如為「John switched the new radio on.」。 Hereinafter, the above-described grammar processing program S01 will be described in actual English sentences. For simplification of the description, the source file is exemplified by a sentence, for example, "John has turned on his new radio.", and the user file is also exemplified by a sentence, for example, "John switched the new radio on."

在使用Stanford Parser進行詞性標註及語法分析,並擷取出所有的字詞及片語後,得到的結果如下:S11={John,has,turned on,his new radio},tree_p11={turned on,his new radio},T11={John,has}。另外,S1={John,switched on,the new radio},tree_p1={the new radio},dep_p1={switched on},T1={John}。於此,Sni為第n份來源文件的第i句,tree_pni為第n份來源文 件的第i句中,從樹狀結構擷取的片語集合,Tni為第n份來源文件中第i句的字詞集合,Sj為使用者文件的第j句,tree_pj為使用者文件的第j句中,從樹狀結構擷取的片語集合,dep_pj為使用者文件的第j句中,從Typed dependencies擷取的片語集合,而Tj為使用者文件中第j句的字詞集合。 After using Stanford Parser for part-of-speech tagging and grammar analysis, and extracting all the words and phrases, the results are as follows: S 11 ={John,has,turned on,his new radio},tree_p 11 ={turned on ,his new radio},T 11 ={John,has}. In addition, S 1 = {John, switched on, the new radio}, tree_p 1 = {the new radio}, dep_p 1 = {switched on}, T 1 = {John}. Here, S ni is the i-th sentence of the nth source file, and tree_p ni is the phrase set extracted from the tree structure in the i-th sentence of the nth source file, T ni is the nth source file the i-th sentence word set, S j for the j-th sentence of the user file, tree_p j for the j-th sentence of the user file, a collection of phrases from the retrieved tree, dep_p j for the first user files In the j sentence, the set of phrases retrieved from Typed dependencies, and T j is the set of words in the jth sentence of the user file.

接著進行字根還原,可得到:S11={John,have,turn on,his new radio},tree_p11={turn on,his new radio},T11={John,have}。另外,S1={John,switch_on,the new radio},tree_p1={the new radio},dep_p1={switch_on},T1={John}。在此定義,來源字詞資料為T11、來源片語資料為tree_p11、使用者字詞資料為T1,而使用者片語資料為tree_p1加上dep_p1Then the root is restored, and you can get: S 11 ={John,have,turn on,his new radio},tree_p 11 ={turn on,his new radio},T 11 ={John,have}. In addition, S 1 = {John, switch_on, the new radio}, tree_p 1 = {the new radio}, dep_p 1 = {switch_on}, T 1 = {John}. Here, the source word data is T 11 , the source phrase data is tree_p 11 , the user word data is T 1 , and the user phrase data is tree_p 1 plus dep_p 1 .

接著,請再參照圖1所示,進行片語識別程序S02。其中,片語識別程序S02係包括一集合產生步驟,集合產生步驟是比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料,以產生一集合字詞資料,另外,集合產生步驟亦比對來源片語資料與使用者片語資料,以產生一集合片語資料。於此,片語識別程序S02仍以片語為比對的單位。 Next, referring to FIG. 1, the phrase recognition program S02 is performed. The phrase recognition program S02 includes a set generation step of comparing the source word data with the user word data and comparing the source phrase data with the user phrase data to generate a set word. In addition, the collection generation step also compares the source phrase data with the user phrase data to generate a set of phrase materials. Here, the phrase recognition program S02 still uses the phrase as the unit of comparison.

在上一程序中已將使用者文件中所有的片語擷取出,分別為tree_p1及dep_p1,其中,dep_p1內的片語已經過WordNet的識別,但tree_p1只是單純從樹狀結構擷取,對抄襲比對而言並沒有意義。因此需透過來源文件與使用者文件中句子的比對,篩選出具有意義的片語,再將兩句子的字詞集合和片語集合聯合起來,以產生集合字詞資料及集合片語資料。 In the previous program, all the words in the user file have been extracted, which are tree_p 1 and dep_p 1 respectively . Among them, the words in dep_p 1 have been recognized by WordNet, but tree_p 1 is simply from the tree structure. Take, it does not make sense for plagiarism. Therefore, through the comparison of the source document and the sentence in the user file, the meaningful phrase is filtered, and the word collection and the phrase collection of the two sentences are combined to generate the aggregate word data and the collection document data.

其過程為,係先透過句子的比對以及機器可讀辭典(即WordNet)的查詢以篩選出符合資格的片語。不符合資格的片語則需要進行拆解,而拆解後的片語又重新進行篩選動作,直到所有拆解出來的片語都經過篩選或被拆解成為單字(字詞)為止。其中,將來源文件與使用者文件中無法配對(不相同)之片語拆解,並將拆解而得到之字詞中可被識別者歸入集合字詞資料,將拆解而得到的片語再配對,配對成功放入集合片語資料中,沒有配對成功再放入WordNet查詢,若可查詢到則收入集合字詞資料內,若查詢不到再拆解,直到集合片語資料中的片語無法於WordNet中查到時,則完成片語識別程序S02。其中,集合字詞資料中之字詞係包含 來源文件與使用者文件中經字根還原後的所有字詞,而集合片語資料中之片語係至少包含來源文件與使用者文件中相同的片語。於此,集合字詞資料內之字詞係與WordNet辭典內之字詞相對應。所謂相對應是指可於WordNet辭典內被查詢到,並為可識別或有意義之字詞。 The process is to first select a qualified phrase through a comparison of sentences and a query of a machine-readable dictionary (ie, WordNet). Ineligible phrases need to be disassembled, and the disassembled phrase is re-screened until all the disassembled phrases have been filtered or disassembled into words (words). Among them, the words that cannot be paired (different) in the source file and the user file are disassembled, and the identifiable words in the disassembled words are classified into the collective word data, and the pieces obtained by disassembling are disassembled. After the words are paired, the pairing is successfully put into the collection document data. If there is no matching, then the WordNet query is entered. If it can be queried, it will be included in the income collection word data. If the query is not re-disassembled, it will not be included in the collection document data. When the phrase cannot be found in WordNet, the phrase recognition program S02 is completed. Among them, the words in the collection word data contain The source file and all the words in the user file that have been restored by the root, and the phrase in the collection phrase data contains at least the same phrase in the source file as the user file. Here, the words in the collection word data correspond to the words in the WordNet dictionary. The so-called relative refers to words that can be queried in the WordNet dictionary and are identifiable or meaningful.

仍沿用上述來源文件及使用者文件的英文句子,經語法處理程序S01後得到:S11={John,have,turn on,his new radio},tree_p11={turn on,his new radio},T11={John,have}。另外,S1={John,switch_on,the new radio},tree_p1={the new radio},dep_p1={switch_on},T1={John}。 The English sentence of the above source file and user file is still used, and is obtained by the syntax processing program S01: S 11 ={John,have,turn on,his new radio},tree_p 11 ={turn on,his new radio},T 11 ={John,have}. In addition, S 1 = {John, switch_on, the new radio}, tree_p 1 = {the new radio}, dep_p 1 = {switch_on}, T 1 = {John}.

再經片語識別程序S02的比對、篩選、拆解…等過程後,得到的結果如下:S11={John,have,turn_on,his,new radio},S1={John,switch_on,the,new radio},集合字詞資料={John,have,turn_on,his,John,switch_on,the},以及集合片語資料={new radio,new radio}。 After the process of comparison, screening, disassembly, etc. of the phrase recognition program S02, the results are as follows: S 11 = {John, have, turn_on, his, new radio}, S 1 = {John, switch_on, the , new radio}, collection word data = {John, have, turn_on, his, John, switch_on, the}, and collection phrase data = {new radio, new radio}.

最後,請再參照圖1所示,進行相似度比對程序S03。 Finally, please refer to FIG. 1 again to perform the similarity comparison program S03.

相似度比對程序S03係依據集合字詞資料及集合片語資料,並從語意方面進行相似度比對,以藉由集合字詞資料及集合片語資料之語意相似度及次序相似度,計算使用者文件相對於來源文件之抄襲程度。其中,於語意相似度及次序相似度的計算中,係以片語為比對單位。另外,係分別計算集合字詞資料及集合片語資料中,字詞對字詞、字詞對片語及片語對片語的語意相似度。此外,計算語意相似度或次序相似度係分別依據字詞對字詞之相似最大值、字詞對片語之相似最大值,及片語對片語之相似最大值,以下將分別說明。 The similarity comparison program S03 is based on the aggregate word data and the aggregate document data, and the similarity comparison is performed from the semantic aspect to calculate the semantic similarity and order similarity of the aggregate word data and the collective document data. The degree of plagiarism of the user's files relative to the source file. Among them, in the calculation of semantic similarity and order similarity, the phrase is used as the comparison unit. In addition, the semantic similarity between the word-to-word, the word-to-speech and the phrase-to-speech is calculated in the aggregate word data and the collective document data. In addition, the calculation of the semantic similarity or the order similarity is respectively based on the similar maximum value of the word pair, the similar maximum value of the word to the phrase, and the similar maximum value of the phrase to the phrase, which will be respectively described below.

於計算字詞w對字詞w的語意相似度中,係依據字詞w的詞性分別進行計算。另外,係將集合字詞資料中的所有字詞w彼此進行比對,以找出字詞對字詞之一相似度最大值。因為在WordNet中,名詞和動詞會以上下義詞關係組成階層式架構,而形容詞和副詞則無法以階層式架構呈現,因此,在計算字詞之間的語意相似度時,須先判斷兩個字詞的詞性,並採取不同的相似度計算方法。如果兩字詞為名詞或動詞,將使用Path-based measure來計算相似度,如果兩字詞是形容詞或副詞,則使用Gloss-based measure來計算相似度。 In calculating the semantic similarity of the word w to the word w, it is calculated according to the part of speech of the word w. In addition, all the words w in the aggregate word data are compared with each other to find the maximum similarity of the word pair. Because in WordNet, nouns and verbs form a hierarchical structure with the above-mentioned semantic relationship, while adjectives and adverbs cannot be presented in a hierarchical structure. Therefore, when calculating the semantic similarity between words, two The part of speech, and adopt different methods of similarity calculation. If the two words are nouns or verbs, the Path-based measure is used to calculate the similarity. If the two words are adjectives or adverbs, the Gloss-based measure is used to calculate the similarity.

本發明是以WordNet中名詞和動詞的階層架構為基礎,結 合PATH(Rada,et al.,1989)和WUP(Wu & palmer,1994)的方法來計算名詞或動詞的語意相似度,而形容詞或副詞則使用VECTOR(Patwardhan,2003)來計算語意相似度,故以下說明係將PATH、WUP及VECTOR的論文全文內容納入本發明之揭露說明內。詳細內容及公式如下所示。 The invention is based on the hierarchical structure of nouns and verbs in WordNet. Combine PATH (Rada, et al., 1989) and WUP (Wu & palmer, 1994) to calculate the semantic similarity of nouns or verbs, while adjectives or adverbs use VECTOR (Patwardhan, 2003) to calculate semantic similarity. Therefore, the following description incorporates the full text of the papers of PATH, WUP and VECTOR into the disclosure of the present invention. The details and formulas are as follows.

PATH方法是使用兩字詞在WordNet中最短路徑長度來計算,如公式3-1所示: The PATH method is calculated using the shortest path length in WordNet using two words, as shown in Equation 3-1:

WUP方法則是考慮了兩字詞的深度,以及它們的最小共通父節點(LCS)在WordNet中的深度來計算字詞相似度,如公式3-2所式: The WUP method considers the depth of two words and the depth of their smallest common parent node (LCS) in WordNet to calculate word similarity, as in Equation 3-2:

VECTOR方式則利用兩字詞w1、w2在WordNet中的註解組成向量v1、v2,再使用Cosine(餘弦)計算字詞的相似度,如公式3-3所示: The VECTOR method uses the annotations of the two words w 1 and w 2 in WordNet to form the vectors v1 and v2, and then uses Cosine (cosine) to calculate the similarity of the words, as shown in Equation 3-3:

因此,整體而言,字詞與字詞(以下稱為狀況A、狀況B、狀況C)的語意相似度計算可如公式3-4~3-6所示,其中,len(w1)為集合字詞資料內的字詞w1的長度: Therefore, overall, the semantic similarity between words and words (hereinafter referred to as status A, status B, and status C) can be calculated as shown in Equations 3-4 to 3-6, where len(w 1 ) is The length of the word w 1 in the collection word data:

另外,於計算字詞對片語的語意相似度中,係以字詞為基礎進行計算。其中,係將集合字詞資料中之每一字詞與集合片語資料中的每一字詞進行比對,以找出字詞對片語之一相似度最大值。由於集合片語 資料中的片語p是在片語識別程序S02中,單純因兩句子的比對而產生,且在WordNet中沒有被查詢到,因此,需要一一比對集合片語資料中所有片語p內的所有字詞,找出當中的相似度最大值。 In addition, in calculating the semantic similarity of words to phrases, the calculation is based on words. Wherein, each word in the aggregate word data is compared with each word in the aggregate document data to find a maximum value of the similarity of the word to the phrase. Due to the collection The phrase p in the data is generated in the phrase recognition program S02, simply due to the alignment of two sentences, and is not queried in WordNet. Therefore, it is necessary to compare all the phrases in the collection phrase data. All the words in it, find out the maximum similarity among them.

假設片語p={p_t1,p_t2,...,p_tN,...},其中,p_tN為片語p的第N個字詞,因此,字詞與片語(以下稱為狀況D、狀況E)的相似度計算方式可如公式3-7~3-10所示:sim(w,p)=max N sim(w,p_t N ) (3-7) Suppose the phrase p={p_t 1 , p_t 2 ,...,p_t N ,...}, where p_t N is the Nth word of the phrase p, therefore, the words and phrases (hereinafter referred to as The similarity calculation method of condition D and condition E) can be as shown in formula 3-7~3-10: sim ( w,p )=max N sim ( w,p _ t N ) (3-7)

此外,於計算片語對片語的語意相似度中,也是以字詞為基礎進行計算。其中,係將集合片語資料中的所有字詞彼此進行比對,以找出片語對片語之一相似度最大值。由於集合片語資料中的片語p1、p2是在片語識別程序S02中單純因兩句子的比對而產生,且在WordNet中沒有被查詢到,因此,需要一一比對集合片語資料中,所有片語之所有字詞,找出當中相似度最大值。 In addition, in calculating the semantic similarity of the phrase to the phrase, the calculation is also based on the word. Among them, all the words in the collective document data are compared with each other to find the maximum similarity of the phrase to the phrase. Since the words p 1 and p 2 in the aggregate document data are generated by the alignment of the two sentences in the phrase recognition program S02 and are not queried in WordNet, it is necessary to compare the collection pieces one by one. In the corpus, all the words of all the words are used to find the maximum similarity among them.

假設片語p1={p1_t1,p1_t2,...,p1_tM,...},其中,p1_tM為片語p1的第M個字詞,片語p2={p2_t1,p2_t2,...,p2_tN,...},其中,p2_tN為片語p2的第N個字詞,因此,片語與片語(以下稱為狀況F)的計算公式如3-11~3-14所示,其中len(p1)為片語p1的長度: Suppose the phrase p 1 ={p 1 _t 1 , p 1 _t 2 ,..., p 1 _t M ,...}, where p 1 _t M is the Mth word of the phrase p 1 , slice p 2 ={p 2 _t 1 ,p 2 _t 2 ,...,p 2 _t N ,...}, where p 2 _t N is the Nth word of the phrase p 2 , therefore, the slice The calculation formula of the language and the phrase (hereinafter referred to as the condition F) is as shown in 3-11~3-14, where len(p 1 ) is the length of the phrase p 1 :

接著,需要將集合字詞資料和集合片語資料按照次序排序並去除重覆的項目,組成一聯集U,再跟來源文件與使用者文件進行語意比對及次序比對。 Then, the aggregate word data and the aggregate document data need to be sorted in order and the repeated items are removed to form a joint U, and then the source file and the user file are semantically compared and aligned.

假設此聯集U={JTP1,JTP2,...,JTPf,...,JTPF},Sni={TP1,TP2,...,TPg,...TPG},而Sj={TP’1,TP’2,...,TP’h,...TP’H},再參考Li,et al.(2006)所定義的句子相似度計算方法,同時考量同義字替代及字詞次序的改變,結合語意相似度(Semantic Similarity)與次序相似度(Order Similarity)來計算Sni和Sj的整體相似度,計算步驟如下:第一步驟-計算句子語意相似度如下:聯集U內所有的JTPf與句子Sni內所有的TPg進行比對:sem 1={max g sim(JTP 1,TP g ),max g sim(JTP 2 ,TP g ),…max g sim(JTP f ,TP g )} (3-15) Suppose this union U={JTP 1 , JTP 2 ,...,JTP f ,...,JTP F },S ni ={TP 1 ,TP 2 ,...,TP g ,...TP G }, and S j ={TP' 1 , TP' 2 ,...,TP' h ,...TP' H }, and then refer to the sentence similarity calculation method defined by Li, et al. (2006), At the same time, the synonym substitution and the change of word order are considered. The overall similarity between S ni and S j is calculated by combining Semantic Similarity and Order Similarity. The calculation steps are as follows: First step - Calculate the sentence semantic similarity as follows: the union U for all the JTP f sentence S ni all TP g for comparison: sem 1 = {max g sim (JTP 1, TP g), max g sim (JTP 2, TP g ),...max g sim ( JTP f ,TP g )} (3-15)

聯集U內所有的JTPf與句子Sj內所有的TPh進行比對:sem 2={max h sim(JTP 1 ,TP' h ),max h sim(JTP 2 ,TP' h ),…max h sim(JTP f ,TP' h )}(3-16) All JTP f in the union U is compared with all TP h in the sentence S j : sem 2 ={max h sim ( JTP 1 ,TP' h ) , max h sim ( JTP 2 ,TP' h ),... Max h sim ( JTP f ,TP' h )}(3-16)

最後,兩句子的語意相似度需要使用cosine similarity來計算: Finally, the semantic similarity of two sentences needs to be calculated using cosine similarity:

第二步驟-計算句子次序相似度如下:在計算sem1和sem2的同時,需要將每一個產生最大相似度的g和h記錄起來,若相似度皆為0,則order就是0,藉此可組成order1和order2,最後,兩句子的次序相似度為: The second step - calculating the sentence order similarity is as follows: in the calculation of sem 1 and sem 2 , each of g and h which produces the greatest similarity needs to be recorded. If the similarity is 0, the order is 0, thereby Can form order 1 and order 2 , and finally, the order similarity of the two sentences is:

最後,進行第三步驟-計算句子整體相似度如下:於此,係給予不同的權重,並結合上述兩種相似度計算公式,計算句子整體相似度:Sim(S ni ,S j )=αSim sem +(1-α)Sim order (3-19),其中α≦1,而Li,et al.(2006)認為,α應大於0.5,因此可介於0.5~1之間(0.5≦α≦1),使用者可依據其需求而自訂。 Finally, the third step is performed - the overall similarity of the sentences is calculated as follows: Here, different weights are given, and the overall similarity of the sentences is calculated in combination with the above two similarity calculation formulas: Sim ( S ni , S j ) = αSim sem +(1- α ) Sim order (3-19), where α ≦1, and Li, et al. (2006) believe that α should be greater than 0.5, so it can be between 0.5 and 1 (0.5≦ α ≦1 ), users can customize according to their needs.

當抄襲程度高於一閥值τ時,則本發明可認定使用者文件 有抄襲行為。換言之,若Sim(Sni,Sj)>τ時,則使用者文件會被視為有抄襲來源文件的行為。 When the degree of plagiarism is higher than a threshold value τ, the present invention can determine that the user file has plagiarism. In other words, if Sim(S ni , S j )>τ, the user file will be treated as a plagiarized source file.

以下,仍利用上述的例子說明相似度比對程序S03的過程。 Hereinafter, the procedure of the similarity comparison program S03 will be described using the above example.

經由前面程序S01及S02之後,S11={John,have,turn_on,his,new radio},S1={John,switch_on,the,new radio},集合字詞資料={John,have,turn_on,his,John,switch_on,the},集合片語資料={new radio,new radio},以及聯集U={John,has,turn_on,his,new radio,switch_on,the}. After the previous procedures S01 and S02, S 11 = {John, have, turn_on, his, new radio}, S 1 = {John, switch_on, the, new radio}, set word data = {John, have, turn_on, His, John, switch_on, the}, collection phrase data = {new radio, new radio}, and union U = {John, has, turn_on, his, new radio, switch_on, the}.

因此,字詞及片語的所有可能比對狀況如下表所示: Therefore, all possible comparisons of words and phrases are shown in the following table:

藉由將來源文件的句子與聯集U內容進行比對,便會出現6將比對狀況(狀況A~F)。其中,狀況A、狀況B及狀況C分別為字詞與字詞的比對,狀況D及狀況E分別為字詞與片語的比對,而將況F為片語與片語的比對。 By comparing the sentences of the source file with the contents of the union U, there will be a comparison condition (conditions A~F). Among them, the situation A, the situation B and the situation C are the comparison of the words and the words respectively, the situation D and the situation E are the comparison of the words and the words respectively, and the condition F is the comparison between the words and the words. .

若詞性無法直接給予0,則需要個別處理。以狀況D的例子來說,Sim(John,new radio)=max(Sim(John,new),Sim(John,radio))=max(0,Sim(John,radio))=Sim(John,radio)。 If the part of speech cannot be directly given 0, it needs to be handled individually. In the case of condition D, Sim(John, new radio)=max(Sim(John,new),Sim(John,radio))=max(0,Sim(John,radio))=Sim(John,radio) ).

再將來源文件句子與聯集U進行比對,可得到以下的比對表: Then compare the source file sentences with the union U to get the following comparison table:

於此,將行(Column)中最大的數字記錄起來,得到sem1={1,1,2,1,2,1,0},並將產生行最大數字時的列(Row)的位置記錄起來,得到order1={1,2,3,4,5,3,0}。 Here, the largest number in the column is recorded, and sem 1 = { 1, 1 , 2, 1 , 2, 1, 0} is obtained, and the position record of the column (Row) when the row maximum number is generated is generated. Get up and get order 1 ={1,2,3,4,5,3,0}.

接著,將使用者文件句子與聯集U進行比對,可得到以下的比對表: Next, the user file sentence is compared with the union U to obtain the following comparison table:

於此,再將行中最大的數字記錄起來,假設Sim(have,switch_on)=0.2(可自訂其它數字),可得到sem2={1,0.2,1,0,2,2,1},並將產生行最大數字時的列的位置記錄起來,得到order2={1,2,2,0,4,2,3}。 Here, record the largest number in the line, assuming Sim (have, switch_on) = 0.2 (can customize other numbers), you can get sem 2 = {1, 0.2, 1, 0, 2, 2, 1} And record the position of the column when the row has the largest number, and get order 2 ={1,2,2,0,4,2,3}.

最後,計算句子的整體相似度得到: Finally, calculate the overall similarity of the sentence to get:

整體相似度Sim(S 11 ,S 1)=α×0.79931+(1-α)×0.60114 Overall similarity Sim ( S 11 ,S 1 )= α ×0.79931+(1- α )×0.60114

假設α=0.6(可為其它數字),閥值τ例如為0.7(可依需求自訂其它數字),則Sim(S11,S1)=0.60.79931+0.40.60114=0.720042> τ(0.7),因此,判定使用者文件相對於來源文件有抄襲行為。 Assuming α = 0.6 (may be other numbers), the threshold τ is, for example, 0.7 (other numbers can be customized according to requirements), then Sim(S11, S1) = 0.60.79931 + 0.40.60114 = 0.720042 > τ (0.7), therefore, it is determined that the user file has plagiarism relative to the source file.

綜上所述,因依據本發明之英文資料之抄襲偵測方法包括語法處理程序、片語識別程序以及相似度比對程序。其中,係透過將來源文件與使用者文件分別進行語法分析,並比對來源字詞資料與使用者字詞資料及比對來源片語資料與使用者片語資料,以及比對來源片語資料與使用者片語資料,再從語意方面進行相似度比對,藉此計算出使用者文件相對於來源文件之抄襲程度。因此,與習知相較,本發明不僅可利用語法結構分析及擷取句子中的字詞及片語,更以片語為抄襲偵測的比對單位,並輔以語意,因此可改善習知技術中使用單一字詞進行比對的正確性。 In summary, the plagiarism detection method according to the English document of the present invention includes a grammar processing program, a phrase recognition program, and a similarity comparison program. Among them, the source document and the user file are separately parsed, and the source word data and the user word data are compared and the source phrase data and the user phrase data are compared, and the source document data is compared. And the user's phrase data, and then similarity comparison in terms of semantics, thereby calculating the degree of plagiarism of the user file relative to the source file. Therefore, compared with the prior art, the present invention can not only utilize the grammatical structure analysis and extract words and phrases in sentences, but also use the phrase as the comparison unit of plagiarism detection, supplemented by semantic meaning, thus improving the habit The correctness of the alignment is determined using a single word in the technique.

以上所述僅為舉例性,而非為限制性者。任何未脫離本發明之精神與範疇,而對其進行之等效修改或變更,均應包含於後附之申請專利範圍中。 The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

S01‧‧‧語法處理程序 S01‧‧‧ grammar handler

S02‧‧‧片語識別程序 S02‧‧‧Word recognition program

S03‧‧‧相似度比對程序 S03‧‧‧similarity comparison program

Claims (16)

一種英文資料之抄襲偵測方法,係實施於一電腦上,該抄襲偵測方法包括:一語法處理程序,係包括一語法分析步驟,其將一來源文件與一使用者文件分別進行一語法分析,以分別得到一來源字詞資料與一來源片語資料,以及一使用者字詞資料與一使用者片語資料;一片語識別程序,係包括一集合產生步驟,其比對該來源字詞資料與該使用者字詞資料及比對該來源片語資料與該使用者片語資料,以產生一集合字詞資料,並比對該來源片語資料與該使用者片語資料,以產生一集合片語資料,其中該集合字詞資料中之字詞係包含該來源文件與該使用者文件中經字根還原後的所有字詞,該集合片語資料中之片語係包含該來源文件與該使用者文件中相同的片語;以及一相似度比對程序,係依據該集合字詞資料及該集合片語資料並從語意及次序方面進行相似度比對,以藉由分析該集合字詞資料及該集合片語資料與該來源文件及該使用者文件間之語意相似度及次序相似度,計算該使用者文件相對於該來源文件之抄襲程度。 A plagiarism detection method for English data is implemented on a computer, and the plagiarism detection method comprises: a grammar processing program, comprising a grammar analysis step, respectively performing a parsing analysis on a source file and a user file To obtain a source word material and a source phrase material, and a user word material and a user phrase material respectively; a language recognition program includes a set generation step, which compares the source word Data and the user word data and the source document data and the user phrase data to generate a set of word data, and compare the source document data with the user phrase data to generate a collection of phrase data, wherein the words in the collection word data include the source file and all the words restored by the root in the user file, and the phrase language in the collection phrase data includes the source The file is the same phrase in the user file; and a similarity comparison program is based on the set word data and the set of phrase data and is similar in terms of semantics and order Comparing, the plagiarism of the user file relative to the source file is calculated by analyzing the aggregate word data and the semantic similarity and order similarity between the source document and the user file and the user file . 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中該來源字詞資料或該使用者字詞資料分別包含至少一個字詞,該來源片語資料或該使用者片語資料分別包含至少一個片語。 The method for detecting plagiarism of English materials as described in claim 1 wherein the source word material or the user word data respectively comprise at least one word, the source phrase material or the user phrase material Contain at least one phrase respectively. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中該語法分析步驟係使用一自然語言處理技術。 The plagiarism detection method of the English document as described in claim 1 of the patent application, wherein the grammar analysis step uses a natural language processing technique. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中於該語法處理程序中,更包含一斷句步驟、一完全抄襲偵測步驟及一字根還原步驟。 The plagiarism detection method of the English document as described in claim 1 , wherein the grammar processing program further comprises a sentence break step, a complete plagiarism detection step and a root root restoration step. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中於該語法分析步驟中,係將該來源文件與該使用者文件與一辭典比對,以進行詞性標註及片語擷取。 For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the grammar analysis step, the source file and the user file are compared with a dictionary for the part-of-speech tagging and phrase 撷take. 如申請專利範圍第5項所述之英文資料之抄襲偵測方法,其中該集合字詞資料內之字詞係與該辭典內之字詞相對應。 The method for detecting plagiarism of English materials as described in claim 5, wherein the words in the collective word data correspond to words in the dictionary. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中於該片語識別程序中,係以片語作為比對單位。 For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the phrase recognition program, the phrase is used as the comparison unit. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中於該語意相似度及該次序相似度的計算中,係以片語為比對單位。 For example, in the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, in the calculation of the semantic similarity and the similarity of the order, the phrase is used as the comparison unit. 如申請專利範圍第7項所述之英文資料之抄襲偵測方法,其中於該相似度比對程序中,係分別計算該集合字詞資料及該集合片語資料中,字詞對字詞、字詞對片語及片語對片語的語意相似度。 The method for detecting plagiarism of English materials as described in claim 7 of the patent application, wherein in the similarity comparison program, the word-to-word, the word-to-word, The semantic similarity of words to phrases and phrases. 如申請專利範圍第9項所述之英文資料之抄襲偵測方法,其中於計算字詞對字詞的語意相似度中,係依據字詞的詞性分別進行計算。 For example, in the plagiarism detection method of the English materials mentioned in the ninth application patent range, the semantic similarity of the calculated words to the words is calculated according to the part of speech of the words. 如申請專利範圍第9項所述之英文資料之抄襲偵測方法,其中於計算字詞對字詞的語意相似度中,係將該集合字詞資料中的所有字詞彼此進行比對,以找出字詞對字詞之一相似度最大值。 The plagiarism detection method for the English materials mentioned in claim 9 wherein, in calculating the semantic similarity of the words to the words, all the words in the collective word data are compared with each other to Find the maximum similarity between a word and a word. 如申請專利範圍第9項所述之英文資料之抄襲偵測方法,其中於計算字詞對片語或片語對片語的語意相似度中,係分別以字詞為基礎進行計算。 For example, in the plagiarism detection method of the English materials mentioned in the scope of the patent application, in the semantic similarity of the calculated words to the phrase or the phrase to the phrase, the calculation is based on the words respectively. 如申請專利範圍第11項所述之英文資料之抄襲偵測方法,其中於計算字詞對片語的語意相似度中,係將該集合字詞資料中之每一字詞與集合片語資料中的每一字詞進行比對,以找出字詞對片語之一相似度最大值。 For example, in the plagiarism detection method of the English materials mentioned in claim 11, wherein in calculating the semantic similarity of the words to the words, each word and collection phrase data in the collective word data is Each word in the comparison is compared to find the maximum similarity of the word to the phrase. 如申請專利範圍第13項所述之英文資料之抄襲偵測方法,其中於計算片語對片語的語意相似度中,係將該集合片語資料中的所有字詞彼此進行比對,以找出片語對片語之一相似度最大值。 For example, in the plagiarism detection method of the English document mentioned in claim 13, wherein in calculating the semantic similarity of the phrase to the phrase, all the words in the collective phrase data are compared with each other to Find out the maximum similarity of the phrase to the phrase. 如申請專利範圍第14項所述之英文資料之抄襲偵測方法,其中於計算該語意相似度或該次序相似度係分別依據字詞對字詞之該相似最大值、字詞對片語之該相似最大值,及片語對片語之該相似最大值。 The method for detecting plagiarism of English materials as described in claim 14 wherein the semantic similarity or the similarity of the order is based on the similar maximum value of the word pair and the word-to-word. The similarity maximum, and the similar maximum of the phrase to the phrase. 如申請專利範圍第1項所述之英文資料之抄襲偵測方法,其中當該抄襲程度高於一閥值時,則該使用者文件有抄襲行為。 For example, the plagiarism detection method of the English document mentioned in the first paragraph of the patent application, wherein when the plagiarism is higher than a threshold, the user file has plagiarism.
TW102102093A 2013-01-18 2013-01-18 Plagiarism detecting method of information in english TWI594135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW102102093A TWI594135B (en) 2013-01-18 2013-01-18 Plagiarism detecting method of information in english

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW102102093A TWI594135B (en) 2013-01-18 2013-01-18 Plagiarism detecting method of information in english

Publications (2)

Publication Number Publication Date
TW201430591A TW201430591A (en) 2014-08-01
TWI594135B true TWI594135B (en) 2017-08-01

Family

ID=51796893

Family Applications (1)

Application Number Title Priority Date Filing Date
TW102102093A TWI594135B (en) 2013-01-18 2013-01-18 Plagiarism detecting method of information in english

Country Status (1)

Country Link
TW (1) TWI594135B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI719537B (en) * 2019-07-16 2021-02-21 國立清華大學 Text comparison method, system and computer program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200519638A (en) * 2003-12-11 2005-06-16 Inst Information Industry Method for feature extraction and data decoding and method and system for searching piratic articles
TWI368144B (en) * 2008-04-11 2012-07-11 Univ Hong Kong Chinese Systems and methods for checking similarity of files

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200519638A (en) * 2003-12-11 2005-06-16 Inst Information Industry Method for feature extraction and data decoding and method and system for searching piratic articles
TWI368144B (en) * 2008-04-11 2012-07-11 Univ Hong Kong Chinese Systems and methods for checking similarity of files

Also Published As

Publication number Publication date
TW201430591A (en) 2014-08-01

Similar Documents

Publication Publication Date Title
Alzahrani et al. Fuzzy semantic-based string similarity for extrinsic plagiarism detection
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
Vani et al. Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges
US20150227505A1 (en) Word meaning relationship extraction device
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
Zaninello et al. Multiword expression aware neural machine translation
JP2011118689A (en) Retrieval method and system
Jayan et al. A hybrid statistical approach for named entity recognition for malayalam language
Unnikrishnan et al. A novel approach for English to South Dravidian language statistical machine translation system
Hamdi et al. The effects of factorizing root and pattern mapping in bidirectional Tunisian-standard Arabic machine translation
Zeroual et al. Adapting a decision tree based tagger for Arabic
Garg et al. Maulik: A plagiarism detection tool for hindi documents
Mataoui et al. A new syntax-based aspect detection approach for sentiment analysis in Arabic reviews
Abdurakhmonova et al. UZBEK ELECTRONIC CORPUS AS A TOOL FOR LINGUISTIC ANALYSIS
Pal et al. Automatic classification of bengali sentences based on sense definitions present in bengali wordnet
Duran et al. Some issues on the normalization of a corpus of products reviews in Portuguese
Hakkani-Tur et al. Statistical sentence extraction for information distillation
Das Semi-supervised and latent-variable models of natural language semantics
Mirrezaei et al. The triplex approach for recognizing semantic relations from noun phrases, appositions, and adjectives
TWI594135B (en) Plagiarism detecting method of information in english
TWI636370B (en) Establishing chart indexing method and computer program product by text information
Bloodgood et al. Data cleaning for xml electronic dictionaries via statistical anomaly detection
Khoufi et al. Chunking Arabic texts using conditional random fields
TWM623980U (en) System of screening for text data relevance
Reddy et al. POS Tagger for Kannada Sentence Translation

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees