TW202314581A

TW202314581A - Method and system of screening for text data relevance

Info

Publication number: TW202314581A
Application number: TW110135727A
Authority: TW
Inventors: 邱方孝
Original assignee: 飛資得資訊股份有限公司
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2023-04-01
Also published as: TWI813028B

Abstract

Abstract of Invention Method and system of screening for text data relevance A method and system of screening for text data relevance, which can be used to perform word segmentation processing, screening processing, relevance processing and integration processing on multiple text data comparison (such as theses, etc.) and forms a relevance index file based on the adjacent screening of word segmentation, so as to quickly sort out a brief information of the text data comparison, and further analyze the originality of the text data to be compared based on the relevance index file.

Description

Method and system for filtering and linking text data

本發明係關於一種文字資料之篩選關聯方法及系統；特別運用一種以前後相鄰的篩選斷詞為基礎，可快速整理及分析文字資料，並可以對照文字資料分析待比對文字資料的原創性的文字資料之篩選關聯方法及系統。 The present invention relates to a method and system for screening and association of textual data; in particular, it uses a method and system based on the screening and segmentation of words adjacent to each other, which can quickly organize and analyze textual data, and can analyze the originality of the textual data to be compared against the textual data A method and system for screening and linking text data.

近年來，論文抄襲事件層出不窮，社會大眾開始對論文的原創性產生疑慮，雖然目前市面上已有許多論文、文章抄襲比對的偵測系統，但這些系統多是在對發表研究論文的著作權人採取懷疑態度的情況下進行抄襲比對偵測，對著作權人是不公平的。此外，部分單位甚至要求著作權人必須先提交抄襲比對結果，並要求相似程度在一定比例下，才能讓論文著作權人逕行發表，因此著作權人需要先用此方法證明自己文件未抄襲他人，此種做法對著作權人是採取不信任之態度，非常不恰當。發明人認為應反向思考、正向針對著作權人的論文發表提供檢測原創性的工具，為其論文發表之參考，發表單位並可以制定原創性比例作為論文品質管理之參考依據。 In recent years, the plagiarism incidents of papers have emerged one after another, and the public has begun to have doubts about the originality of papers. Although there are many detection systems for plagiarism comparison of papers and articles on the market, most of these systems are aimed at copyright holders who publish research papers. It is unfair to the copyright owner to carry out plagiarism comparison detection under the circumstance of adopting a skeptical attitude. In addition, some units even require the copyright owner to submit the plagiarism comparison results first, and require the degree of similarity to be below a certain percentage before the copyright owner of the paper can publish it directly. Therefore, the copyright owner needs to use this method to prove that his document has not plagiarized others. It is very inappropriate to take an attitude of distrust towards the copyright owner. The inventor believes that it is necessary to think backwards and provide a tool for detecting originality in the publication of copyright holders' papers, as a reference for the publication of their papers, and the publishing unit can formulate an originality ratio as a reference for the quality management of papers.

關於抄襲比對系統，近年來，在學術研究中，論文抄襲的議題已愈發嚴重，由於該議題持續發燒，抄襲偵測(plagiarism Detection)越來越被重視了，抄襲(plagiarism)議題主要分為以下種類：1.毫無修改的複製貼上或片段抄襲(copy/paste/clone plagiarism)。2.段落改寫(Paraphrasing plagiarism)：透過抄襲段落、切換詞彙或是改寫句子結構或語法風格。3.隱喻抄襲(Metaphor plagiarism)：透過清晰，更好地表達別人的想法方式。4.想法抄襲(Idea plagiarism)：想法或解決方案是從其他來源借來的，當作自己的研究論文。5.自我抄襲(Self/recycled plagiarism)：用自己發表過的文章，當作新的研究結果再發表一次。6.引用抄襲：引用適當來源的參考文獻，但是其描述跟原始內容的用詞跟句子，甚至結構語法相似。 Regarding the plagiarism comparison system, in recent years, in academic research, the issue of plagiarism has become more and more serious. Plagiarism Detection is getting more and more attention. Plagiarism issues are mainly divided into the following categories: 1. Copy/paste/clone plagiarism without modification. 2. Paraphrasing plagiarism: by plagiarizing passages, switching vocabulary, or rewriting sentence structure or grammatical style. 3. Metaphor plagiarism (Metaphor plagiarism): Through clarity, better express the ideas of others. 4. Idea plagiarism: Ideas or solutions are borrowed from other sources as your own research paper. 5. Self-plagiarism (Self/recycled plagiarism): Use your own published articles and publish them again as new research results. 6. Citing plagiarism: Citing references from appropriate sources, but their descriptions are similar to the original content in terms of words, sentences, and even structural grammar.

在這些種類的抄襲中，以「毫無修改的複製貼上或片段抄襲」、「段落改寫」最受大家關注，此兩種抄襲方式可透過比對該論文與被抄襲文獻資料，即可明顯看出抄襲行為，故該兩者最令人詬病。 Among these types of plagiarism, "unmodified copy and paste or fragment plagiarism" and "paragraph rewriting" are the most concerned. These two plagiarism methods can be clearly identified by comparing the paper with the plagiarized documents. Plagiarism can be seen, so the two are the most criticized.

在1995年就有學者進行研究，該論文在數位文件上進行複製偵測，而隨著自然語言處理以及硬體設備的演進之後，近年來也有很多不同的方法推陳出新，而在抄襲偵測領域上，主要分為數種方法：1.基於字串的方法(Character-Based Methods)：此方法為論文抄襲偵測最大宗的方法，待比對論文跟現有論文資料庫進行比較，透過尋找符合字串，進而判斷出論文抄襲的比例，也因此可以告訴系統使用者，抄襲段落以及語句。Shrestha以及Solorio在2013年發表，透過將停用詞、命名實體以及所有詞彙以 n-grams的方式，透過考慮該偵測論文與文本資料庫文章是否有n-gram符合程度過高的文章，進而偵測抄襲。Nguyen等人在2016年提出，透過抄襲檢測，偵測越南文的文章是否抄襲，該方法透過子字串n-gram的方法。此類的方法有以下三種缺陷：一、若該論文出現論文資料庫沒有的文字時，會導致比對不出相似文句，因而偵測不出抄襲論文；二、使用者可以透過更改詞彙或是交換詞彙順序，進而避開此種方法偵測方式，導致偵測不出相似詞句；三、由於此種方法是比較字串，若輸入字串長度過長，容易導致稀釋輸入論文，進而降低抄襲相似度。2.基於向量的方法(Vector-Based Methods)：此方法透過萃取詞彙和語法功能，並將其分類為向量而不是字符串。這個的相似度通常都是用雅卡爾係數(Jaccard coefficient)、權等骰子係數(Dice coefficient)、重疊係數(Overlap coefficient)或餘弦相似度(Cosine Similarity)等方法來衡量論文以及論文之間的相似程度。Mahdavi等人發表，透過向量空間模型偵測波斯文章是否抄襲，透過將文章轉為TF-IDF的方法，比較其中的文章相似度。Jiffriya等人在2013年提出，將文章轉為向量再透過K-means演算法進行分群，分群完後，將文章基於tri-gram進行抄襲偵測。此種方法的缺點，是透過詞頻來衡量文章中的一個詞的重要性，有時候重要的詞出現的次數可能不夠多，會導致比對出的結果差，而此種計算無法體現位置資訊與詞在上下文的重要性。3.基於語法的方法(Syntax-Based Methods)：此種方法透過使用句法特徵像是詞性、句子的相依樹以及字在不同的陳述來偵測抄襲，使用詞性來呈現字詞架構並且計算相似度。此種方法可以找到語句結構類似的段落，但是找不到段落改寫、抽換詞彙以及轉換文句結構的抄襲。基於語法的方法有幾種缺陷，一、中文語法相較英文語法複雜許多，若是將我們中文的抄襲系統透過語法的方式來偵測論文抄襲，會導致比對結果極差；二、此種方法透過句法的特徵來偵測抄襲的內容，會導致找到相似句法特徵，但是沒有抄襲的文字，僅句法相同，導致判別錯誤。4.基於語義的方法(Semantic-Based Methods)：此方法透過讓系統了解段落語意，將文章轉為向量，可以用來偵測換順序、換主被動，但是該方法不能找到抄襲的段落以及句子。Torres於2009年提出透過建立字典的方式協助進行偵測抄襲，Resnik在1999透過外部的資源協助使用語意來偵測抄襲。透過語意的方式解決偵測抄襲會找到相似語意的論文，但是無法得知抄襲的段落及詞彙，沒辦法進行驗證抄襲。 In 1995, some scholars conducted research. The paper carried out copy detection on digital files. With the evolution of natural language processing and hardware equipment, many different methods have been introduced in recent years. In the field of plagiarism detection , mainly divided into several methods: 1. Character-Based Methods: This method is the most common method for plagiarism detection in papers. The papers to be compared are compared with the existing paper database, and by searching for matching strings , and then determine the proportion of plagiarism in the paper, and therefore can tell the system user, plagiarized paragraphs and sentences. Shrestha and Solorio published in 2013, by combining stop words, named entities, and all words with The method of n-grams detects plagiarism by considering whether there is an article with a high degree of n-gram matching between the detection paper and the text database article. Nguyen et al. proposed in 2016 to detect whether Vietnamese articles are plagiarized through plagiarism detection. This method uses the substring n-gram method. This type of method has the following three defects: 1. If there are words in the paper that are not in the paper database, similar sentences cannot be compared, so plagiarized papers cannot be detected; 2. Users can change the vocabulary or Swapping the order of words, thereby avoiding the detection method of this method, resulting in no similar words and sentences being detected; 3. Since this method is to compare strings, if the length of the input string is too long, it is easy to dilute the input paper, thereby reducing plagiarism similarity. 2. Vector-Based Methods: This method extracts lexical and grammatical features and classifies them as vectors rather than strings. This similarity is usually measured by methods such as Jaccard coefficient, Dice coefficient, Overlap coefficient or Cosine Similarity to measure the similarity between papers and papers degree. Mahdavi et al published a vector space model to detect plagiarism of Persian articles, and compared the similarity of articles by converting the articles to TF-IDF. Jiffriya et al. proposed in 2013 to convert articles into vectors and then use the K-means algorithm to group them. After grouping, the articles were used for plagiarism detection based on tri-grams. The disadvantage of this method is that the importance of a word in an article is measured by word frequency. Sometimes important words may not appear many times, which will lead to poor comparison results, and this calculation cannot reflect the location information and The importance of words in context. 3. Syntax-Based Methods (Syntax-Based Methods): This method uses syntactic features Signs are parts of speech, dependency trees of sentences and words in different statements to detect plagiarism, use parts of speech to represent word structure and calculate similarity. This method can find paragraphs with similar sentence structure, but cannot find the plagiarism of paragraph rewriting, word extraction and sentence structure conversion. There are several defects in the grammar-based method. First, Chinese grammar is much more complex than English grammar. If our Chinese plagiarism system detects plagiarism by grammar, the comparison results will be extremely poor; second, this method Detecting plagiarized content through syntactic features will lead to finding similar syntactic features, but no plagiarized text, only the same syntax, resulting in discrimination errors. 4. Semantic-Based Methods (Semantic-Based Methods): This method converts articles into vectors by letting the system understand the semantics of paragraphs. It can be used to detect order changes and active-passive changes. However, this method cannot find plagiarized paragraphs and sentences. . Torres proposed to help detect plagiarism by building a dictionary in 2009, and Resnik used external resources to help detect plagiarism by using semantics in 1999. Solving plagiarism detection through semantics will find papers with similar semantics, but it is impossible to know the plagiarized paragraphs and vocabulary, and there is no way to verify plagiarism.

發明人有鑑於此，乃苦思細索，積極研究，加以多年從事相關產品研究之經驗，並經不斷試驗及改良，終於發展出本發明。 In view of this, the inventor thinks hard, studies actively, adds years of experience in related product research, and through continuous testing and improvement, finally develops the present invention.

本發明的目的在於提供一種可快速整理出文字資料的簡要資訊的文字資料之篩選關聯方法。 The purpose of the present invention is to provide a method for screening and associating text data that can quickly sort out the brief information of the text data.

本發明達成上述目的之方法包括下列步驟：S11.以一斷詞詞彙庫為基礎，對一文字資料進行斷詞處理以產生一斷詞資訊；S12.對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊；該篩選斷詞資訊具有二個以上的篩選斷詞；S13.對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊；該等關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成。 The method for the present invention to achieve the above-mentioned purpose comprises the following steps: S11. Based on a word-segmentation vocabulary database, carry out word-segmentation processing to a text data To generate a word segmentation information; S12. Screening the word segmentation information to generate a screening word segmentation information; The screening word segmentation information has more than two screening word segmentation information; S13. Relevant to the screening word segmentation information Process to generate a plurality of relational sequence information; each of the relational sequence information is composed of more than two adjacent screening words.

較佳者，在進行該步驟S11之前，可先進行一步驟S110；該步驟S110為：收集該文字資料中的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫，並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫，藉以獲得更貼近文字資料之本意的關聯性序列資訊。 Preferably, before performing the step S11, a step S110 can be performed; the step S110 is: collect the author-defined keywords in the text data to establish a professional keyword vocabulary database, and store the professional keyword vocabulary The word-segmentation vocabulary is imported into the word-segmentation vocabulary, so as to obtain correlation sequence information that is closer to the original meaning of the text data.

較佳者，在該步驟S12中，在篩選處理以後，可先進行同義字詞處理，再進行後續步驟；該同義字詞處理為：對該篩選處理後的篩選斷詞進行文字同義檢查，將同義字、同義詞轉換成標準文字。 Preferably, in this step S12, after the screening process, the synonym word processing can be carried out first, and then the subsequent steps are carried out; the synonym word processing is: the text synonym check is carried out for the screen word segmentation after the screening process, and the Synonyms and synonyms are converted into standard text.

本發明的又一目的在於提供一種可快速整理出文字資料的簡要資訊的文字資料之篩選關聯系統。 Another object of the present invention is to provide a text data screening and association system that can quickly sort out the brief information of the text data.

本發明達成上述目的之結構包括：一儲存模組，用於儲存一斷詞詞彙庫；一斷詞處理模組，用於對一文字資料進行斷詞處理以產生一斷詞資訊；一篩選處理模組，用於並對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊；一關聯性處理模組，用於對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊。 The structure of the present invention to achieve the above object includes: a storage module for storing a word segmentation vocabulary; a word segmentation processing module for performing word segmentation processing on a text data to generate a word segmentation information; a screening processing module A set is used for performing screening processing on the word segmentation information to generate a piece of screening word segmentation information; a correlation processing module is used for performing correlation processing on the screening word segmentation information to generate a plurality of correlation sequence information.

本發明的再一目的在於提供一種，可快速整理出多份對照文字資料的簡要資訊，並將各對照文字資料的 Another object of the present invention is to provide a method that can quickly sort out the brief information of multiple comparison text data, and

料；步驟S28為：建立交集序列，將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序；步驟S29為：分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料，藉以分析待比對文字資料的原創性。 material; step S28 is: establish an intersection sequence, and arrange all the relative sequence information identical to the relevant sequence information to be compared; step S29 is: analyze each piece of sequence information that has the same correlation with the text data to be compared The originality of the textual materials to be compared is analyzed.

較佳者，在該步驟S23中，在篩選處理以後，可先進行同義字詞處理，再進行後續步驟，可增加關聯性比對效果。 Preferably, in the step S23, after the screening process, the synonym word processing can be performed first, and then the subsequent steps can be performed, which can increase the correlation comparison effect.

本發明的又一目的在於提供一種，可快速整理出多份對照文字資料的簡要資訊，並將各對照文字資料的簡要資訊整合在一起，進而可方便分析待比對文字資料的原創性的文字資料之篩選關聯系統。 Another object of the present invention is to provide a method that can quickly sort out the brief information of multiple comparison text materials, and integrate the brief information of each comparison text data, so as to facilitate the analysis of the originality of the text to be compared Data filtering and association system.

本發明達成上述目的之結構包括：一儲存模組，用於儲存一斷詞詞彙庫及一對照集合資訊； The structure of the present invention to achieve the above-mentioned purpose includes: a storage module for storing a word segmentation vocabulary and a comparison set information;

一斷詞處理模組，用於對該對照集合資訊的各個對照文字資料進行斷詞處理以分別產生一對照斷詞資訊；一篩選處理模組，用於並對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊；一關聯性處理模組，用於對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊；一整合模組，用於將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。 A word segmentation processing module, which is used to perform word segmentation processing on each comparison text data of the comparison set information to generate a pair of comparison and segmentation information respectively; a screening processing module, which is used to filter the comparison and segmentation information Processing to generate a pair of comparison and screening word segmentation information; a correlation processing module, used to perform correlation processing on the comparison screening and segmentation information to generate a plurality of comparison correlation sequence information; an integration module for Integrate all of the comparative correlation sequence information to create a correlation index file.

較佳者，該斷詞處理模組、篩選處理模組及關聯性處理模組對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊，且該文字資料之篩選關聯系統更包括：一比對模組，以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對，找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料；一交集模組，將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序，藉以建立交集序列；一分析模組，分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料。 Preferably, the word segmentation processing module, screening processing module and correlation processing module perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of correlation sequence information to be compared , and the screening association system of the text data further includes: a comparison module, based on the waiting Comparing the correlation sequence information with the correlation index file respectively, finding out each control text data having the same comparison correlation sequence information as the correlation sequence information waiting for comparison; an intersection module, combining all the correlation sequence information with The arrangement sequence of the reference sequence information with the same correlation sequence information to be compared is used to establish an intersection sequence; an analysis module is used to analyze each comparison text data having the same correlation sequence information with the text data to be compared.

本發明為達到上述及其他目的，其所採取之技術手段、元件及其功效，茲採一較佳實施例配合圖示說明如下。 In order to achieve the above and other objects of the present invention, the technical means, components and effects thereof adopted are illustrated below in a preferred embodiment.

100、100a:文字資料之篩選關聯系統 100, 100a: Screening and association system for text data

1、1a:儲存模組 1, 1a: storage module

2、2a:斷詞處理模組 2, 2a: Segmentation processing module

3、3a:篩選處理模組 3, 3a: screening processing module

4、4a:關聯性處理模組 4, 4a: Relevance processing module

5a:整合模組 5a: Integrated modules

6a:比對模組 6a: Compare modules

7a:交集模組 7a: Intersection Module

8a:分析模組 8a: Analysis module

9、9a:斷詞系統 9, 9a: Segmentation system

[圖1]為本發明的第一實施例的文字資料之篩選關聯方法的流程圖。 [ Fig. 1 ] is a flow chart of the method for screening and associating text data according to the first embodiment of the present invention.

[圖2]為本發明的可自動執行第一實施例之方法的具體實施例之一的方塊圖。 [ FIG. 2 ] is a block diagram of one of the specific embodiments of the present invention that can automatically execute the method of the first embodiment.

[圖3]為本發明的第二實施例的文字資料之篩選關聯方法的流程圖。 [ FIG. 3 ] is a flow chart of the method for screening and associating text data according to the second embodiment of the present invention.

[圖4]為本發明的可自動執行第二實施例之方法的具體實施例之一的方塊圖。 [ Fig. 4 ] is a block diagram of one of the specific embodiments of the method of the present invention that can automatically execute the second embodiment.

圖1~2為本發明的第一實施例。如圖1所示，本發明文字資料之篩選關聯方法包括下列步驟：S11.以一斷詞詞彙庫為基礎，對一文字資料進行斷詞處理以產生一斷詞資訊；S12.對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊；該篩選斷詞資訊具有二個以上的篩選斷詞；S13.對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊；該等關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成；藉此方法，可快速整理出文字資料的簡要資訊。下文將詳予說明。 1 to 2 are the first embodiment of the present invention. As shown in Figure 1, the screening association method of text data of the present invention comprises the following steps: S11. Based on a word segmentation vocabulary database, carry out word segmentation processing to a text data to generate a word segmentation information; S12. The word segmentation information Perform screening processing to generate a screening segmentation information; the screening segmentation information has more than two screening segmentations; S13. Perform correlation processing on the screening segmentation information to generate multiple correlation sequences The sequence information is composed of two or more adjacent screening words; by this method, the brief information of the text data can be sorted out quickly. It will be explained in detail below.

步驟S11為以一斷詞詞彙庫為基礎，對一文字資料進行斷詞處理以產生一斷詞資訊。 Step S11 is to perform word segmentation processing on a text data based on a word segmentation vocabulary database to generate word segmentation information.

文字資料可以是各種已經公開的文字資料，例如博碩士論文、學術論文、一般文章或句子等。此外，針對例如論文等大篇幅的文字資料而言，可以直接將論文視為一份文字資料，也可以在將論文分段處理以後，形成多份文字資料。分段處理的方式很多，茲舉例說明如下。在進行分段處理時，能以例如換行符號、連續空格、驚嘆號(！)、分號(：)、波浪號(~)、問號(？)、逗號(，)、句號(。)…等符號為基礎，將一份文字資料以其長度不少於適當長度以上為分界點，分成多份文字資料。在進行分段處理時，亦能以文字資料的各個章、節為分段基礎，將一份文字資料分成多份文字資料。在進行分段處理時，還能配合斷詞詞彙庫一起使用，以例如十、二十個…等預定數量的篩選斷詞為一段的方式為基礎，進而將一份文字資料分成多份文字資料。 Textual materials can be various published textual materials, such as doctoral and master's theses, academic papers, general articles or sentences, etc. In addition, for large text materials such as papers, the paper can be directly regarded as one text data, or multiple text materials can be formed after the paper is divided into sections. There are many methods of segmentation processing, and the examples are as follows. When performing segmentation processing, symbols such as newline symbols, continuous spaces, exclamation points (!), semicolons (:), tildes (~), question marks (?), commas (,), periods (.)... Based on this, a text data is divided into multiple text data with its length not less than the appropriate length as the cut-off point. When performing segment processing, it is also possible to divide a text data into multiple text data based on each chapter and section of the text data. When performing segmentation processing, it can also be used together with the word segmentation vocabulary, based on the method of screening and segmenting a predetermined number of words such as ten, twenty, etc., as the basis, and then divide a text data into multiple text data .

斷詞處理是依據斷詞詞彙庫中所記載的多個詞將文字資料轉變成斷詞資訊。斷詞詞彙庫的多個詞可依據詞性進行分類，例如以普通名詞(Na)、外文(FW)、動作及物動詞(VC)、動作不及物動詞(VA)、地方詞(Nc)、專有名詞(Nb)、狀態使動動詞(VHC)、冒號 Segmentation processing is to convert text data into word segmentation information according to multiple words recorded in the word segmentation vocabulary. Multiple words in the word segmentation vocabulary can be classified according to part of speech, such as common nouns (Na), foreign words (FW), action verbs (VC), action intransitive verbs (VA), local words (Nc), Proper noun (Nb), verb of status (VHC), colon

術名稱…等，將這些作者自訂的關鍵詞匯入斷詞詞彙庫後再進行斷詞處理及後續步驟，能藉以獲得更貼近文字資料之本意的關聯性序列資訊。 Names of techniques, etc., put these author-defined key words into the word segmentation vocabulary and then perform word segmentation processing and subsequent steps, so as to obtain relevant sequence information that is closer to the original meaning of the text data.

圖2所示為可自動執行第一實施例的文字資料之篩選關聯方法的文字資料之篩選關聯系統的具體實施例之一。如圖2所示，本發明提供一種文字資料之篩選關聯系統100，其中包括：一儲存模組1，用於儲存一斷詞詞彙庫；一斷詞處理模組2，用於對一文字資料進行斷詞處理以產生一斷詞資訊；一篩選處理模組3，用於並對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊；一關聯性處理模組4，用於對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊。儲存模組1、斷詞處理模組2、篩選處理模組3及關聯性處理模組4等可建立於一或多個電腦及/或雲端伺服器中。當文字資料之篩選關聯系統100建立於一雲端伺服器中時，可設有一對應的網頁，使用者在輸入文字資料以後，即可獲得多個關聯性序列資訊(圖中未示)。 Figure 2 shows one of the specific embodiments of the system for screening and correlating text data that can automatically execute the method for screening and correlating text data in the first embodiment. As shown in Figure 2, the present invention provides a kind of textual data screening association system 100, which includes: a storage module 1, used to store a hyphenated vocabulary; Segmentation processing to generate a word segmentation information; a screening processing module 3, used to filter and process the segmentation information to generate a screening segmentation information; a relevance processing module 4, used to filter the segmentation information Word information is subjected to association processing to generate a plurality of association sequence information. The storage module 1, word segmentation processing module 2, screening processing module 3 and correlation processing module 4 etc. can be established in one or more computers and/or cloud servers. When the system 100 for filtering and correlating text data is established in a cloud server, a corresponding web page can be set up, and the user can obtain a plurality of correlation sequence information (not shown in the figure) after inputting text data.

圖3~4為本發明的第二實施例。如圖3~4所示，本發明文字資料之篩選關聯方法包括下列步驟：S21.以二份以上的對照文字資料建立一對照集合資訊；S22.以一斷詞詞彙庫為基礎，對該等對照文字資料進行斷詞處理以分別產生一對照斷詞資訊；S23.對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊；該等對照篩選斷詞資訊分別具有二個以上的篩選斷詞；S24.對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊；該等對照關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成；S25.將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔；藉此方法，可快速整理出多份對照文字資料的簡要資訊，並將各對照文字資料的簡要資訊整合在一起，進而可方便分析待比對文字資料的原創性。 3 to 4 are the second embodiment of the present invention. As shown in Figures 3 to 4, the screening association method of the text data of the present invention comprises the following steps: S21. set up a contrast collection information with more than two contrast text data; S22. Segmentation processing is performed against the text data to generate a pair of comparison and segmentation information; S23. Screening is performed on the comparison and segmentation information to generate a comparison and screening segmentation information; the comparison and screening segmentation information has two or more The screening word; S24. Screening for such comparisons Selecting word information for correlation processing to generate a plurality of comparative correlation sequence information respectively; these comparative correlation sequence information are respectively composed of more than two adjacent screening and segmentation words; S25. Sequence information is integrated to create a correlation index file; by this method, the brief information of multiple comparative text data can be quickly sorted out, and the brief information of each comparative text data can be integrated together to facilitate the analysis of the text data to be compared originality.

步驟S21為以二份以上的對照文字資料建立一對照集合資訊。對照集合資訊可以包含各種文字資料，例如包含臺灣博碩士論文知識加值系統中的部分或全部論文。此外，在建立對照集合資訊時，可以例如電子類、機械類、10年內文字資料…等不同範圍分別建立不同的照集合資訊。在第二實施例中所述的對照文字資料與待比對文字資料與第一實施例的文字資料相同，都可以是各種已經公開的文字資料，例如博碩士論文、學術論文、一般文章或句子等，其差異在於在第二實施例中需要將待比對文字資料逐一與各對照文字資料比對分析，故有不同名稱以利區分。 Step S21 is to create a collation set information with more than two collation text data. The comparative collection information may contain various textual materials, for example, some or all of the papers in the knowledge value-added system for doctoral and master's thesis in Taiwan. In addition, when creating comparison collection information, you can create different photo collection information for different fields such as electronics, machinery, text data within 10 years... and so on. The contrasting text data described in the second embodiment and the text data to be compared are the same as the text data in the first embodiment, and can be various published text data, such as doctoral and master's thesis, academic papers, general articles or sentences etc., the difference is that in the second embodiment, it is necessary to compare and analyze the textual data to be compared with each of the control textual data one by one, so there are different names to facilitate the distinction.

步驟S22~S24是分別對對照集合資訊中的每一份對照文字資料進行斷詞處理、篩選處理及關聯性處理，可分別產生對照斷詞資訊、對照篩選斷詞資訊及多個對照關聯性序列資訊。 Steps S22-S24 are to perform segmentation processing, screening processing and correlation processing on each comparison text data in the comparison collection information respectively, and respectively generate comparison segmentation information, comparison screening segmentation information and multiple comparison correlation sequences Information.

步驟S25為將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。整合建立的關聯性索引檔可方便與待比對文字資料進行比對，進而方便分析待比對文字資料的原創性。 Step S25 is to integrate all the comparative correlation sequence information together to create a correlation index file. The associated index file created by integration can be It can be compared with the text data to be compared, and then it is convenient to analyze the originality of the text data to be compared.

如圖3所示，在進行步驟S22之前，可先進行步驟S220；步驟S220為：收集該等對照文字資料及待比對文字資料中的一部分或全部的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫，並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫，能藉以獲得更貼近文字資料之本意的關聯性索引檔。此外，專業關鍵詞詞彙庫的整理工作中可以加入去除重複的工作，藉以增加處理效率。 As shown in Figure 3, step S220 can be performed before step S22; step S220 is: collect the author-defined keywords of some or all of the comparative text data and the text data to be compared to establish a professional key Word vocabulary, and importing the professional keyword vocabulary into the word segmentation vocabulary can be used to obtain a relevance index file that is closer to the original meaning of the text data. In addition, the work of removing duplication can be added to the finishing work of the professional keyword vocabulary database, so as to increase the processing efficiency.

本發明的第二實施例，可快速整理出對照文字資料的簡要資訊，並可進一步將各對照文字資料的簡要資訊整合在一起，藉以方便使用者以待比對文字資料進行比對分析。例如透過下列的步驟S26~S29以分析待比對文字資料的原創性。 In the second embodiment of the present invention, the brief information of the comparative text data can be quickly sorted out, and the brief information of each comparative text data can be further integrated, so as to facilitate the user to perform comparative analysis on the text data to be compared. For example, the originality of the text data to be compared is analyzed through the following steps S26-S29.

步驟S26為對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊。步驟S22~S24及步驟S26的各處理方式與步驟S11~S13一樣，故產生的對照關聯性序列資訊及待比對關聯性序列資訊具有相對應的型態，可方便比對。此外，在S12、S23/或S26中，在篩選處理以後，可先進行同義字詞處理，再進行後續步驟。同義字詞處理為：對篩選處理後的篩選斷詞進行文字同義檢查，將部分或全部同義字、同義詞(有些不適合同義字詞處理的特殊詞除外)轉換成標準文字，可增加關聯性比對效果。例如將”冷氣”、”空調”全改成” Step S26 is to perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of related sequence information to be compared. The processing methods of steps S22-S24 and step S26 are the same as steps S11-S13, so the generated comparison related sequence information and the related sequence information to be compared have corresponding types, which can facilitate the comparison. In addition, in S12, S23 and/or S26, after the screening process, synonyms may be processed first, and then subsequent steps are performed. The synonymous word processing is: carry out the text synonym check on the screened words after screening, convert some or all synonyms and synonyms (except some special words that are not suitable for synonymous word processing) into standard text, which can increase the correlation comparison Effect. For example, change "air conditioner" and "air conditioner" to "

步驟S22進行斷詞處理。 Step S22 performs word segmentation processing.

步驟S23進行篩選處理，可將各個篩選斷詞依序編號，例如將ID1的第一個被保留的篩選斷詞記為ID1tp1。 Step S23 performs the screening process, and each screening word can be numbered sequentially, for example, the first reserved filtering word of ID1 is recorded as ID1tp1.

步驟S24進行關聯性處理，可將各個對照關聯性序列資訊依序編號，例如將ID1的第一個對照關聯性序列資訊記為ID1S1。 Step S24 carries out association processing, which can sequentially number each comparison association sequence information, for example, record the first comparison association sequence information of ID1 as ID1S1.

步驟S25建立關聯性索引檔，各個對照篩選斷詞資訊可視為該關聯性索引檔的索引(即稱Index或Key)，並能以該對照篩選斷詞資訊的編號為該關聯性索引檔的資料(Data)。在建立關聯性索引檔時，任何一個對照篩選斷詞資訊都可能與另一個對照篩選斷詞資訊相同(例如ID1S2、ID2S1)。因此，一個索引可對照多個不同的資料，其資料的數量是眾多的，其所儲存的總資料長度是隨著加入更多對照文字資料而增加的。 Step S25 establishes the correlation index file, and each comparison and screening word segmentation information can be regarded as the index (namely Index or Key) of the correlation index file, and the serial number of the comparison screening and segmentation information can be used as the data of the correlation index file (Data). When creating a relational index file, any word segmentation information for comparison screening may be the same as another word segmentation information for comparison screening (for example, ID1S2, ID2S1). Therefore, an index can compare a plurality of different data, and the number of the data is large, and the total length of the stored data increases as more text data for comparison is added.

步驟S26對待比對文字資料進行斷詞…等處理，可將待比對文字資料記為IDx。 In step S26, word segmentation, etc. are performed on the text data to be compared, and the text data to be compared can be recorded as IDx.

步驟S27：使用待比對關聯性序列資訊為索引去搜尋，讀取關聯性索引檔中具有相同索引的所有資料。 Step S27: Use the related sequence information to be compared as an index To search, read all data with the same index in the associative index file.

步驟S28為建立交集序列，可將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序(即分類Sorting)。 Step S28 is to establish an intersection sequence, which can arrange all the reference related sequence information that is the same as the relative sequence information to be compared in the same order (ie sorting).

步驟S29為分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料，藉以產生待比對文字資料相對於每一份對照文字資料的原創性分析結果。比對的方法很多，例如利用統計分析方法分析待比對文字資料在對照集合資訊中每一對照文字資料的相似度參考比例，可用一般習用的如Dice Coefficient法則等理論。此外，亦可以簡單易懂概括性的方法進行簡易分析。 Step S29 is to analyze each control text data having the same correlation sequence information as the text data to be compared, so as to generate an originality analysis result of the text data to be compared relative to each control text data. There are many methods of comparison, such as using statistical analysis methods to analyze the reference ratio of the similarity of each text data in the comparison set information of the text data to be compared, and commonly used theories such as the Dice Coefficient rule can be used. In addition, it is also possible to perform simple analysis in an easy-to-understand and general method.

藉由上述的文字資料之篩選關聯方法，可快速分析待比對文字資料與各對照文字資料間之關聯性，並可進一步分析待比對文字資料的原創性。 By means of the screening and correlation method of the text data mentioned above, the correlation between the text data to be compared and each control text data can be quickly analyzed, and the originality of the text data to be compared can be further analyzed.

圖4所示為可自動執行第二實施例的文字資料之篩選關聯方法的文字資料之篩選關聯系統的具體實施例之一。如圖4所示，本發明提供一種文字資料之篩選關聯系統100a，其中包括：一儲存模組1a，用於儲存一斷詞詞彙庫及一對照集合資訊；一斷詞處理模組2a，用於對該對照集合資訊的各個對照文字資料進行斷詞處理以分別產生一對照斷詞資訊；一篩選處理模組3a，用於並對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊；一關聯性處理模組4a，用於對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊；一整合模組5a，用於將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。此外，該斷詞處理模組2a、篩選處理模組3a及關聯性處理模組4a可進一步對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊，且該文字資料之篩選關聯系統 100a更包括：一比對模組6a，以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對，找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料；一交集模組7a，將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序，藉以建立交集序列；一分析模組8a，分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料。 FIG. 4 shows one of the specific embodiments of the system for screening and correlating text data that can automatically execute the method for screening and correlating text data in the second embodiment. As shown in Figure 4, the present invention provides a kind of textual data screening association system 100a, which includes: a storage module 1a, used to store a word segmentation vocabulary database and a comparison set information; a word segmentation processing module 2a, with Segmentation processing is performed on each of the comparison text data of the comparison collection information to generate a comparison and segmentation information respectively; a screening processing module 3a is used to perform screening processing on the comparison and segmentation information to generate a comparison screening Segmentation information; a correlation processing module 4a, which is used to perform correlation processing on the comparison and screening segmentation information to generate a plurality of comparison correlation sequence information respectively; an integration module 5a, which is used to correlate all comparisons The sexual sequence information is integrated to create a relational index file. In addition, the word segmentation processing module 2a, screening processing module 3a and association processing module 4a can further perform word segmentation processing, screening processing and association processing on a text data to be compared to generate multiple associations to be compared Sexual sequence information, and the screening and association system of the text data 100a further includes: a comparison module 6a, which uses the correlation sequence information to be compared with the correlation index file respectively to find out the comparison correlation sequence information having the same correlation sequence information as the correlation sequence information to be compared Each contrasting text data; an intersection module 7a, which arranges all the contrasting correlation sequence information identical to the correlation sequence information to be compared, so as to establish an intersection sequence; an analysis module 8a, analyzes each copy and the correlation sequence information to be compared A comparison text data that has the same associative sequence information as the text data.

儲存模組1a、斷詞處理模組2a、篩選處理模組3a、關聯性處理模組4a、整合模組5a、比對模組6a、交集模組7a及分析模組8a等可建立於一或多個電腦及/或雲端伺服器中。當文字資料之篩選關聯系統100a建立於一雲端伺服器中時，可設有一對應的網頁，使用者在輸入待比對文字資料以後，即可獲得原創性分析結果(圖中未示)。 The storage module 1a, word segmentation processing module 2a, screening processing module 3a, relevance processing module 4a, integration module 5a, comparison module 6a, intersection module 7a and analysis module 8a can be built in one or multiple computers and/or cloud servers. When the text data screening and association system 100a is established in a cloud server, a corresponding web page can be set up, and the user can obtain the originality analysis result (not shown in the figure) after inputting the text data to be compared.

另外，前述的與斷詞處理相關的部分，例如步驟S11、S22及斷詞詞彙庫等，可以採用例如臺灣中央研究院發展的CKIP或已公開電腦程式碼的結巴等習知的斷詞系統9、9a，藉以節省成本。 In addition, the aforementioned parts related to word segmentation processing, such as steps S11, S22 and word segmentation vocabulary database, etc., can adopt known word segmentation systems such as CKIP developed by Taiwan Academia Sinica or stuttering of published computer codes9 , 9a, in order to save costs.

如前所述，文字資料可以是各種已經公開的文字資料，且例如論文等大篇幅的文字資料而言，可以直接將論文視為一份文字資料，也可以在將論文分段處理以後，形成多份文字資料。這些經分段處理而形成的多份文字資料之間可另外互相關聯以便做成統合的原創性分析結果。舉例來說，一篇論文的編號是IDa1，而該論文經過分段(例如以章節分段)後的編號分別是IDa2~IDan，即言，不但將該論文視為一份文字資料，該論文的每一分段(每一章節)也都可視為一份文字資料。如此一來，經分析後，不但可獲得待比對文字資料相對於該論文的原創性分析結果，還可獲得待比對文字資料相對於該論文的每一分段(每一章節)的原創性分析結果。 As mentioned above, written materials can be various published written materials, and for large-scale written materials such as papers, the paper can be directly regarded as a single written data, or the paper can be processed in sections to form Multiple text files. These multiple pieces of text data formed by segment processing can be correlated with each other in order to make an integrated analysis result of originality. For example, the number of a thesis is IDa1, and the numbers of the thesis after being segmented (for example, divided into chapters) are IDa2~IDan, that is, Not only the thesis is regarded as a text material, but each section (each chapter) of the thesis is also regarded as a text material. In this way, after analysis, not only the originality analysis results of the text data to be compared relative to the paper can be obtained, but also the text data to be compared relative to each segment (each chapter) of the paper can be obtained. Originality Analysis Results.

以上為本發明所舉之實施例，僅為便於說明而設，當不能以此限制本發明之意義，即大凡依所列申請專利範圍所為之各種變換設計，均應包含在本發明之專利範圍中。 The above are the embodiments of the present invention, which are only for convenience of description, and should not limit the meaning of the present invention, that is, all the various transformation designs made according to the scope of the listed patent application should be included in the scope of the present invention. middle.

Claims

A method for screening and associating text data, comprising the following steps:

S11. Based on a word-segmentation vocabulary database, perform word-segmentation processing on a text data to generate word-segmentation information;

S12. Filtering the segmentation information to generate screening segmentation information; the screening segmentation information has more than two screening segmentation information;

S13. Relevant processing is performed on the screening and segmentation information to generate a plurality of correlation sequence information; each of the correlation sequence information is composed of more than two consecutive screening segmentation information.

Such as the method for screening and associating textual data in claim item 1, wherein before performing the step S11, a step S110 can be performed first; the step S110 is: collecting the author-defined keywords in the textual data to establish a professional keyword vocabulary database, and import the professional keyword vocabulary database into the segmentation vocabulary database.

Such as the screening and association method of text data in claim item 1, wherein in the step S12, after the screening process, the synonymous word processing can be performed first, and then the subsequent steps are performed; the synonymous word processing is: the screening process Screen word breaks for text synonym checks, and convert synonyms and synonyms into standard text.

A screening and association system for text data, including:

A storage module for storing a word segmentation vocabulary;

A word-segmentation processing module, used to perform word-segmentation processing on a text data to generate Generating a broken word information;

A screening processing module, used for and performing screening processing on the segmented word information to generate a screened segmented word information;

A correlation processing module is used for performing correlation processing on the screened word segmentation information to generate a plurality of correlation sequence information.

S21. Create a comparison collection information with more than two comparison text data;

S22. Based on a word-segmentation vocabulary database, perform word-segmentation processing on the contrasting text data to generate a pair of contrasting word-segmentation information;

S23. Perform screening processing on the comparison and segmentation information to generate a pair of comparison and screening segmentation information; each of the comparison and screening segmentation information has more than two screening segmentation information;

S24. Relevant processing is performed on the comparison and screening segmentation information to generate a plurality of comparison correlation sequence information respectively; the comparison correlation sequence information is respectively composed of two or more adjacent screening segmentation information;

S25. Integrate all the reference correlation sequence information to create a correlation index file.

Such as the method for screening and associating text data in claim item 5, wherein before performing the step S22, a step S220 is performed; the step S220 is: collecting part or all of the text data for comparison and the text data to be compared The author customizes keywords to build a professional keyword vocabulary, and imports the professional keyword vocabulary into the word segmentation vocabulary library.

Such as the method for screening and associating text data of claim 5, wherein after step S25, steps S26 to S29 are performed; step S26 is: performing segmentation processing, screening processing and association processing on a text data to be compared to generate multiple a correlation sequence information to be compared; step S27 is: compare the correlation sequence information to be compared with the correlation index file respectively, and find out the control correlation with the same correlation sequence information as the correlation sequence information to be compared Each control text data of the sequence information; step S28 is: establish an intersection sequence, and arrange all the control correlation sequence information that is the same as the correlation sequence information to be compared; step S29 is: analyze each piece of text data with the correlation sequence information to be compared Collating text data with the same associative sequence information.

For example, the method for screening and associating text data in claim item 5, wherein in the step S23, after the screening process, synonymous words can be processed first, and then subsequent steps can be performed.

A screening and association system for text data, including:

A storage module for storing a segmentation vocabulary and a comparison set information;

A word segmentation processing module, which is used to perform word segmentation processing on each comparison text data of the comparison set information to generate a pair of comparison segmentation information respectively;

A screening processing module, used for and performing screening processing on the comparison and segmentation information to generate a comparison and screening segmentation information;

A correlation processing module, which is used to perform correlation processing on the comparison screening and segmentation information to generate a plurality of comparison correlation sequence information;

An integration module is used to integrate all the comparative correlation sequence information to create a correlation index file.

Such as the screening and correlation system of text data in claim item 9, wherein the word segmentation processing module, screening processing module and correlation processing module perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate A plurality of correlation sequence information to be compared, and the screening correlation system of the text data further includes: a comparison module, which compares the correlation sequence information to be compared with the correlation index file respectively, and finds out Each comparison text data of the comparison correlation sequence information which is the same as the correlation sequence information waiting to be compared; an intersection module arranges all the comparison correlation sequence information which is the same as the correlation sequence information to be compared in order to establish an intersection Sequence: an analysis module, which analyzes each comparison text data having the same correlation sequence information as the text data to be compared.