TWM623980U - System of screening for text data relevance - Google Patents

System of screening for text data relevance Download PDF

Info

Publication number
TWM623980U
TWM623980U TW110211297U TW110211297U TWM623980U TW M623980 U TWM623980 U TW M623980U TW 110211297 U TW110211297 U TW 110211297U TW 110211297 U TW110211297 U TW 110211297U TW M623980 U TWM623980 U TW M623980U
Authority
TW
Taiwan
Prior art keywords
text data
information
screening
comparison
word segmentation
Prior art date
Application number
TW110211297U
Other languages
Chinese (zh)
Inventor
邱方孝
Original Assignee
飛資得資訊股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 飛資得資訊股份有限公司 filed Critical 飛資得資訊股份有限公司
Priority to TW110211297U priority Critical patent/TWM623980U/en
Publication of TWM623980U publication Critical patent/TWM623980U/en

Links

Images

Abstract

一種文字資料之篩選關聯系統,可在對多份對照文字資料(例如論文等)進行斷詞處理、篩選處理、關聯性處理及整合處理後,以前後相鄰的篩選斷詞為基礎而形成關聯性索引檔,藉以快速整理出對照文字資料的簡要資訊,還可依該關聯性索引檔進一步分析一待比對文字資料的原創性。 A screening and association system for text data, after word segmentation, screening, correlation processing and integration processing are performed on multiple contrasting text data (such as papers, etc.) The related index file is used to quickly sort out the brief information of the text data to be compared, and the originality of the text data to be compared can be further analyzed according to the related index file.

Description

文字資料之篩選關聯系統 Screening and Association System for Text Data

本創作係關於一種文字資料之篩選關聯系統;特別運用一種以前後相鄰的篩選斷詞為基礎,可快速整理及分析文字資料,並可以對照文字資料分析待比對文字資料的原創性的文字資料之篩選關聯系統。 This creation is about a screening and correlation system of text data; it especially uses a kind of screening and segmentation based on the adjacent adjacent words, which can quickly organize and analyze text data, and can analyze the original text of the text data to be compared against the text data. Data filtering and correlation system.

近年來,論文抄襲事件層出不窮,社會大眾開始對論文的原創性產生疑慮,雖然目前市面上已有許多論文、文章抄襲比對的偵測系統,但這些系統多是在對發表研究論文的著作權人採取懷疑態度的情況下進行抄襲比對偵測,對著作權人是不公平的。此外,部分單位甚至要求著作權人必須先提交抄襲比對結果,並要求相似程度在一定比例下,才能讓論文著作權人逕行發表,因此著作權人需要先用此方法證明自己文件未抄襲他人,此種做法對著作權人是採取不信任之態度,非常不恰當。創作人認為應反向思考、正向針對著作權人的論文發表提供檢測原創性的工具,為其論文發表之參考,發表單位並可以制定原創性比例作為論文品質管理之參考依據。 In recent years, plagiarism incidents have emerged one after another, and the public has begun to have doubts about the originality of papers. Although there are many detection systems for plagiarism comparison of papers and articles on the market, most of these systems are used to detect copyright holders who publish research papers. It is unfair to the copyright owner to conduct plagiarism comparison detection with a skeptical attitude. In addition, some units even require the copyright owner to submit the plagiarism comparison results first, and require a certain degree of similarity before the copyright owner of the paper can publish the paper. Therefore, the copyright owner needs to use this method to prove that his document has not been copied from others. The practice is to take an attitude of distrust to the copyright owner, which is very inappropriate. The creators believe that they should think backwards and provide tools for detecting originality for the publication of the copyright owner's papers, as a reference for the publication of their papers, and the publishing unit may formulate the originality ratio as a reference for the quality management of papers.

關於抄襲比對系統,近年來,在學術研究中,論文抄襲的議題已愈發嚴重,由於該議題持續發燒,抄襲 偵測(plagiarism Detection)越來越被重視了,抄襲(plagiarism)議題主要分為以下種類:1.毫無修改的複製貼上或片段抄襲(copy/paste/clone plagiarism)。2.段落改寫(Paraphrasing plagiarism):透過抄襲段落、切換詞彙或是改寫句子結構或語法風格。3.隱喻抄襲(Metaphor plagiarism):透過清晰,更好地表達別人的想法方式。4.想法抄襲(Idea plagiarism):想法或解決方案是從其他來源借來的,當作自己的研究論文。5.自我抄襲(Self/recycled plagiarism):用自己發表過的文章,當作新的研究結果再發表一次。6.引用抄襲:引用適當來源的參考文獻,但是其描述跟原始內容的用詞跟句子,甚至結構語法相似。 Regarding the plagiarism comparison system, in recent years, in academic research, the issue of plagiarism has become more and more serious. Plagiarism Detection has been paid more and more attention. Plagiarism issues are mainly divided into the following categories: 1. Copy/paste/clone plagiarism without modification. 2. Paraphrasing plagiarism: By copying paragraphs, switching vocabulary, or rewriting sentence structure or grammatical style. 3. Metaphor plagiarism: A way of expressing the ideas of others better through clarity. 4. Idea plagiarism: An idea or solution is borrowed from other sources as one's own research paper. 5. Self/recycled plagiarism: Use your own published articles and republish them as new research results. 6. Citing plagiarism: citing references from appropriate sources, but whose descriptions are similar to the original content in terms of words, sentences, and even structural grammar.

在這些種類的抄襲中,以「毫無修改的複製貼上或片段抄襲」、「段落改寫」最受大家關注,此兩種抄襲方式可透過比對該論文與被抄襲文獻資料,即可明顯看出抄襲行為,故該兩者最令人詬病。 Among these types of plagiarism, "unmodified copy and paste or fragment plagiarism" and "paragraph rewriting" are the most concerned. These two plagiarism methods can be clearly seen by comparing the paper with the plagiarized literature. Seeing plagiarism, these two are the most criticized.

在1995年就有學者進行研究,該論文在數位文件上進行複製偵測,而隨著自然語言處理以及硬體設備的演進之後,近年來也有很多不同的方法推陳出新,而在抄襲偵測領域上,主要分為數種方法:1.基於字串的方法(Character-Based Methods):此方法為論文抄襲偵測最大宗的方法,待比對論文跟現有論文資料庫進行比較,透過尋找符合字串,進而判斷出論文抄襲的比例,也因此可以告訴系統使用者,抄襲段落以及語句。Shrestha以及Solorio在2013年發表,透過將停用詞、命名實體以及所有詞彙以 n-grams的方式,透過考慮該偵測論文與文本資料庫文章是否有n-gram符合程度過高的文章,進而偵測抄襲。Nguyen等人在2016年提出,透過抄襲檢測,偵測越南文的文章是否抄襲,該方法透過子字串n-gram的方法。此類的方法有以下三種缺陷:一、若該論文出現論文資料庫沒有的文字時,會導致比對不出相似文句,因而偵測不出抄襲論文;二、使用者可以透過更改詞彙或是交換詞彙順序,進而避開此種方法偵測方式,導致偵測不出相似詞句;三、由於此種方法是比較字串,若輸入字串長度過長,容易導致稀釋輸入論文,進而降低抄襲相似度。2.基於向量的方法(Vector-Based Methods):此方法透過萃取詞彙和語法功能,並將其分類為向量而不是字符串。這個的相似度通常都是用雅卡爾係數(Jaccard coefficient)、權等骰子係數(Dice coefficient)、重疊係數(Overlap coefficient)或餘弦相似度(Cosine Similarity)等方法來衡量論文以及論文之間的相似程度。Mahdavi等人發表,透過向量空間模型偵測波斯文章是否抄襲,透過將文章轉為TF-IDF的方法,比較其中的文章相似度。Jiffriya等人在2013年提出,將文章轉為向量再透過K-means演算法進行分群,分群完後,將文章基於tri-gram進行抄襲偵測。此種方法的缺點,是透過詞頻來衡量文章中的一個詞的重要性,有時候重要的詞出現的次數可能不夠多,會導致比對出的結果差,而此種計算無法體現位置資訊與詞在上下文的重要性。3.基於語法的方法(Syntax-Based Methods):此種方法透過使用句法特 徵像是詞性、句子的相依樹以及字在不同的陳述來偵測抄襲,使用詞性來呈現字詞架構並且計算相似度。此種方法可以找到語句結構類似的段落,但是找不到段落改寫、抽換詞彙以及轉換文句結構的抄襲。基於語法的方法有幾種缺陷,一、中文語法相較英文語法複雜許多,若是將我們中文的抄襲系統透過語法的方式來偵測論文抄襲,會導致比對結果極差;二、此種方法透過句法的特徵來偵測抄襲的內容,會導致找到相似句法特徵,但是沒有抄襲的文字,僅句法相同,導致判別錯誤。4.基於語義的方法(Semantic-Based Methods):此方法透過讓系統了解段落語意,將文章轉為向量,可以用來偵測換順序、換主被動,但是該方法不能找到抄襲的段落以及句子。Torres於2009年提出透過建立字典的方式協助進行偵測抄襲,Resnik在1999透過外部的資源協助使用語意來偵測抄襲。透過語意的方式解決偵測抄襲會找到相似語意的論文,但是無法得知抄襲的段落及詞彙,沒辦法進行驗證抄襲。 In 1995, some scholars conducted research. The paper carried out copy detection on digital files. With the evolution of natural language processing and hardware equipment, many different methods have been introduced in recent years. In the field of plagiarism detection , mainly divided into several methods: 1. Character-Based Methods: This method is the most common method for the detection of plagiarism in papers. The papers to be compared are compared with the existing papers database, and by searching for matching strings , and then determine the proportion of plagiarism in the paper, and therefore can tell the system users, plagiarized paragraphs and sentences. Shrestha and Solorio published in 2013, by combining stop words, named entities, and all words with In the n-grams method, plagiarism is detected by considering whether the detected paper and the text database article have articles with a high degree of n-gram conformity. In 2016, Nguyen et al. proposed to detect plagiarism in Vietnamese articles through plagiarism detection. This method uses the method of substring n-gram. This kind of method has the following three defects: First, if the paper has words that are not in the paper database, it will lead to no similar texts and sentences, so plagiarized papers cannot be detected; second, users can change the vocabulary or Swap the order of words, and then avoid this method of detection, resulting in no similar words and sentences to be detected; 3. Because this method is to compare strings, if the length of the input string is too long, it is easy to dilute the input paper, thereby reducing plagiarism similarity. 2. Vector-Based Methods: This method extracts lexical and grammatical features and classifies them as vectors instead of strings. This similarity is usually measured by methods such as the Jaccard coefficient, the Dice coefficient, the Overlap coefficient or the Cosine Similarity to measure the similarity between papers and papers degree. Mahdavi et al. published a vector space model to detect plagiarism in Persian articles, and compare the similarity of the articles by converting the articles to TF-IDF. Jiffriya et al. proposed in 2013 to convert the article into a vector and then group it through the K-means algorithm. After the grouping, the article is based on tri-gram for plagiarism detection. The disadvantage of this method is that the importance of a word in the article is measured by word frequency. Sometimes important words may not appear enough times, which will lead to poor comparison results, and this calculation cannot reflect location information and The importance of words in context. 3. Syntax-Based Methods: This method uses syntactic special Signs are parts of speech, dependency trees of sentences, and words in different statements to detect plagiarism, use parts of speech to represent word structures and calculate similarity. This method can find paragraphs with similar sentence structure, but can not find paragraph rewriting, swapping vocabulary and changing sentence structure plagiarism. Grammar-based methods have several flaws. First, Chinese grammar is much more complicated than English grammar. If we use our Chinese plagiarism system to detect plagiarism by grammar, it will result in extremely poor comparison results; second, this method Detecting plagiarized content through syntactic features will result in finding words with similar syntactic features, but without plagiarism, only the syntax is the same, resulting in a wrong judgment. 4. Semantic-Based Methods: This method converts the article into a vector by letting the system understand the semantics of the paragraph, which can be used to detect changing the order and changing the active and passive, but this method cannot find plagiarized paragraphs and sentences. . Torres proposed to help detect plagiarism by building a dictionary in 2009, and Resnik in 1999 used external resources to help detect plagiarism using semantics. Solving plagiarism detection by means of semantics will find papers with similar semantics, but cannot know the plagiarized paragraphs and vocabulary, and there is no way to verify plagiarism.

創作人有鑑於此,乃苦思細索,積極研究,加以多年從事相關產品研究之經驗,並經不斷試驗及改良,終於發展出本創作。 In view of this, the creators thought hard, actively researched, and after years of experience in related product research, and through continuous experiments and improvements, finally developed this creation.

本創作的目的在於提供一種可快速整理出文字資料的簡要資訊的文字資料之篩選關聯方法。 The purpose of this creation is to provide a method for filtering and correlating text data that can quickly sort out the brief information of text data.

本創作達成上述目的之方法包括下列步驟:S11.以一斷詞詞彙庫為基礎,對一文字資料進行斷詞處理 以產生一斷詞資訊;S12.對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊;該篩選斷詞資訊具有二個以上的篩選斷詞;S13.對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊;該等關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成。 The method of this creation to achieve the above purpose includes the following steps: S11. Based on a word segmentation vocabulary database, perform word segmentation processing on a text data to generate a piece of word segmentation information; S12. perform a screening process on the word segmentation information to generate a piece of word segmentation information; the filtered word segmentation information has more than two selected word segmentations; S13. perform correlation on the filtered word segmentation information Processing to generate a plurality of related sequence information; the related sequence information is respectively composed of two or more adjacent screening segmented words.

較佳者,在進行該步驟S11之前,可先進行一步驟S110;該步驟S110為:收集該文字資料中的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫,並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫,藉以獲得更貼近文字資料之本意的關聯性序列資訊。 Preferably, before performing the step S11, a step S110 may be performed first; the step S110 is: collecting the author-defined keywords in the text data to establish a professional keyword vocabulary database, and using the professional keyword vocabulary The library imports the word segmentation vocabulary library, so as to obtain related sequence information that is closer to the original meaning of the text data.

較佳者,在該步驟S12中,在篩選處理以後,可先進行同義字詞處理,再進行後續步驟;該同義字詞處理為:對該篩選處理後的篩選斷詞進行文字同義檢查,將同義字、同義詞轉換成標準文字。 Preferably, in this step S12, after the screening process, the synonymous word processing may be performed first, and then the subsequent steps are performed; the synonymous word processing is: performing a text synonym check on the screened segmented words after the screening process, and Synonyms and synonyms are converted into standard characters.

本創作的又一目的在於提供一種可快速整理出文字資料的簡要資訊的文字資料之篩選關聯系統。 Another purpose of this creation is to provide a text data screening and correlation system that can quickly sort out the brief information of text data.

本創作達成上述目的之結構包括:一儲存模組,用於儲存一斷詞詞彙庫;一斷詞處理模組,用於對一文字資料進行斷詞處理以產生一斷詞資訊;一篩選處理模組,用於並對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊;一關聯性處理模組,用於對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊。 The structure of this creation to achieve the above purpose includes: a storage module for storing a word segmentation vocabulary database; a word segmentation processing module for performing word segmentation processing on a text data to generate word segmentation information; a screening processing module A group is used to filter the segmented information to generate a filtered segmented information; a correlation processing module is used to perform correlation processing on the filtered segmented information to generate a plurality of related sequence information.

本創作的再一目的在於提供一種,可快速整理出多份對照文字資料的簡要資訊,並將各對照文字資料的 簡要資訊整合在一起,進而可方便分析待比對文字資料的原創性的文字資料之篩選關聯方法。 Another purpose of this creation is to provide a method that can quickly sort out the brief information of multiple comparison text data, and combine the information of each reference text data. The brief information is integrated together to facilitate the analysis of the originality of the text data to be compared and the method of screening and correlation of text data.

本創作達成上述目的之方法包括下列步驟:S21.以二份以上的對照文字資料建立一對照集合資訊;S22.以一斷詞詞彙庫為基礎,對該等對照文字資料進行斷詞處理以分別產生一對照斷詞資訊;S23.對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊;該等對照篩選斷詞資訊分別具有二個以上的篩選斷詞;S24.對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊;該等對照關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成;S25.將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。 The method of this creation to achieve the above-mentioned purpose includes the following steps: S21. Create a comparison set of information with more than two copies of the comparison text data; S22. Based on a word segmentation vocabulary database, perform word segmentation processing on the comparison text data to separate them. Generate a contrast word segmentation information; S23. Perform screening processing on the comparison word segmentation information to respectively generate a comparison filter word segmentation information; the comparison filter word segmentation information has two or more filter segmentation information respectively; S24. The Perform correlation processing on the comparison and screening segment information to respectively generate a plurality of comparison-related sequence information; the comparison-related sequence information is respectively composed of two or more adjacent screening segmented words; S25. All comparison The relational sequence information is integrated together to create a relational index file.

較佳者,在進行該步驟S22之前,先進行一步驟S220;該步驟S220為:收集該等對照文字資料及該待比對文字資料中的一部分或全部的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫,並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫,藉以獲得更貼近文字資料之本意的關聯性索引檔。 Preferably, before the step S22 is performed, a step S220 is performed first; the step S220 is: collecting a part or all of the author-defined keywords in the comparison text data and the to-be-compared text data to establish a professional A keyword vocabulary database is imported, and the professional keyword vocabulary database is imported into the word segmentation vocabulary database, so as to obtain a related index file that is closer to the original meaning of the text data.

較佳者,在該步驟S25以後,進行步驟S26~S29;步驟S26為:對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊;步驟S27為:以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對,找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資 料;步驟S28為:建立交集序列,將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序;步驟S29為:分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料,藉以分析待比對文字資料的原創性。 Preferably, after step S25, steps S26-S29 are performed; step S26 is: performing word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of related sequence information to be compared; Step S27 is: aligning the related sequence information to be compared with the related index file, respectively, to find each reference text information that has the same related sequence information to be compared with the related sequence information to be compared. Step S28 is: establishing an intersection sequence, and arranging all the relative sequence information that is the same as the related sequence information to be compared; Step S29 is: analyzing each piece of the text data to be compared with the same related sequence information The comparison text data, in order to analyze the originality of the text data to be compared.

較佳者,在該步驟S23中,在篩選處理以後,可先進行同義字詞處理,再進行後續步驟,可增加關聯性比對效果。 Preferably, in this step S23, after the screening process, the synonym processing may be performed first, and then the subsequent steps may be performed to increase the effect of correlation comparison.

本創作的又一目的在於提供一種,可快速整理出多份對照文字資料的簡要資訊,並將各對照文字資料的簡要資訊整合在一起,進而可方便分析待比對文字資料的原創性的文字資料之篩選關聯系統。 Another purpose of this creation is to provide a text that can quickly sort out the brief information of multiple comparison text data, and integrate the brief information of each reference text data, so as to facilitate the analysis of the originality of the text data to be compared. Data filtering and correlation system.

本創作達成上述目的之結構包括:一儲存模組,用於儲存一斷詞詞彙庫及一對照集合資訊; The structure of this creation to achieve the above-mentioned purpose includes: a storage module for storing a word segmentation vocabulary database and a comparison set information;

一斷詞處理模組,用於對該對照集合資訊的各個對照文字資料進行斷詞處理以分別產生一對照斷詞資訊;一篩選處理模組,用於並對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊;一關聯性處理模組,用於對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊;一整合模組,用於將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。 a word segmentation processing module for performing word segmentation processing on each comparison text data of the comparison set information to generate a comparison word segmentation information respectively; a screening processing module for screening the comparison word segmentation information processing to respectively generate a comparison screening segment information; a correlation processing module for performing correlation processing on the comparison screening segmentation information to respectively generate a plurality of comparison correlation sequence information; an integration module for Integrate all reference related sequence information together to create a related index file.

較佳者,該斷詞處理模組、篩選處理模組及關聯性處理模組對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊,且該文字資料之篩選關聯系統更包括:一比對模組,以該等待 比對關聯性序列資訊分別與該關聯性索引檔進行比對,找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料;一交集模組,將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序,藉以建立交集序列;一分析模組,分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料。 Preferably, the word segmentation processing module, the screening processing module and the correlation processing module perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of related sequence information to be compared , and the screening and association system for the text data further includes: a comparison module, which is used for the waiting The alignment related sequence information is compared with the related index file respectively, and each reference text data with the same comparative related sequence information as the related sequence information to be compared is found out; The sequence information of the related sequence information to be compared is the same as the related sequence information, so as to establish the intersection sequence; an analysis module analyzes each control text data with the same related sequence information as the text data to be compared.

本創作為達到上述及其他目的,其所採取之技術手段、元件及其功效,茲採一較佳實施例配合圖示說明如下。 In order to achieve the above and other purposes, the technical means, components and their effects adopted by this creation are described below with a preferred embodiment in conjunction with the diagrams.

100、100a:文字資料之篩選關聯系統 100, 100a: Screening and association system for text data

1、1a:儲存模組 1. 1a: storage module

2、2a:斷詞處理模組 2, 2a: word segmentation processing module

3、3a:篩選處理模組 3, 3a: Screening processing module

4、4a:關聯性處理模組 4, 4a: Association processing module

5a:整合模組 5a: Integrate modules

6a:比對模組 6a: Comparison module

7a:交集模組 7a: Intersection module

8a:分析模組 8a: Analysis module

9、9a:斷詞系統 9, 9a: word break system

[圖1]為本創作的第一實施例的文字資料之篩選關聯方法的流程圖。 [FIG. 1] A flowchart of a method for filtering and correlating text data according to the first embodiment of the present invention.

[圖2]為本創作的可自動執行第一實施例之方法的具體實施例之一的方塊圖。 [FIG. 2] A block diagram of one of the specific embodiments of the method of the present invention that can automatically execute the first embodiment.

[圖3]為本創作的第二實施例的文字資料之篩選關聯方法的流程圖。 [FIG. 3] A flowchart of a method for filtering and correlating textual data according to the second embodiment of the present invention.

[圖4]為本創作的可自動執行第二實施例之方法的具體實施例之一的方塊圖。 [FIG. 4] A block diagram of one of the specific embodiments of the method of the present invention that can automatically execute the second embodiment.

圖1~2為本創作的第一實施例。如圖1所示,本創作文字資料之篩選關聯方法包括下列步驟:S11.以一斷詞詞彙庫為基礎,對一文字資料進行斷詞處理以產生一斷詞資訊;S12.對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊;該篩選斷詞資訊具有二個以上的篩選斷詞;S13.對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序 列資訊;該等關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成;藉此方法,可快速整理出文字資料的簡要資訊。下文將詳予說明。 1 to 2 are the first embodiment of this creation. As shown in FIG. 1 , the method for screening and correlating written text data includes the following steps: S11. Based on a word segmentation vocabulary database, perform word segmentation processing on a text data to generate word segmentation information; S12. The word segmentation information Perform screening processing to generate a screening segmentation information; the screening segmentation information has more than two screening segmentations; S13. Perform correlation processing on the screening segmentation information to generate multiple correlation sequences Column information; the related sequence information is composed of two or more adjacent screening segmented words; by this method, the brief information of the text data can be quickly sorted out. This will be explained in detail below.

步驟S11為以一斷詞詞彙庫為基礎,對一文字資料進行斷詞處理以產生一斷詞資訊。 Step S11 is to perform word segmentation processing on a word data based on a word segmentation vocabulary database to generate word segmentation information.

文字資料可以是各種已經公開的文字資料,例如博碩士論文、學術論文、一般文章或句子等。此外,針對例如論文等大篇幅的文字資料而言,可以直接將論文視為一份文字資料,也可以在將論文分段處理以後,形成多份文字資料。分段處理的方式很多,茲舉例說明如下。在進行分段處理時,能以例如換行符號、連續空格、驚嘆號(!)、分號(:)、波浪號(~)、問號(?)、逗號(,)、句號(。)…等符號為基礎,將一份文字資料以其長度不少於適當長度以上為分界點,分成多份文字資料。在進行分段處理時,亦能以文字資料的各個章、節為分段基礎,將一份文字資料分成多份文字資料。在進行分段處理時,還能配合斷詞詞彙庫一起使用,以例如十、二十個…等預定數量的篩選斷詞為一段的方式為基礎,進而將一份文字資料分成多份文字資料。 The text data can be various published text data, such as doctoral and master theses, academic papers, general articles or sentences. In addition, for large-length text materials such as papers, the paper can be directly regarded as one text data, or multiple text materials can be formed after the paper is processed into sections. There are many ways of segmentation processing, which are illustrated as follows. When performing segmentation processing, you can use symbols such as line breaks, continuous spaces, exclamation points (!), semicolons (:), tildes (~), question marks (?), commas (,), periods (.)...etc. On the basis of this, divide a piece of written data into multiple pieces of written data with its length not less than the appropriate length as the dividing point. When performing segmentation processing, it is also possible to divide a text data into multiple text data based on each chapter and section of the text data. When performing segmentation processing, it can also be used together with the word segmentation vocabulary database. Based on the method of screening a predetermined number of word segmentations such as ten, twenty, etc. as a segment, a text data is divided into multiple text data. .

斷詞處理是依據斷詞詞彙庫中所記載的多個詞將文字資料轉變成斷詞資訊。斷詞詞彙庫的多個詞可依據詞性進行分類,例如以普通名詞(Na)、外文(FW)、動作及物動詞(VC)、動作不及物動詞(VA)、地方詞(Nc)、專有名詞(Nb)、狀態使動動詞(VHC)、冒號 (COLONCATEGORY)…等各種詞性分類。 Word segmentation processing is to convert text data into word segmentation information based on multiple words recorded in the word segmentation vocabulary database. Multiple words in the word segmentation vocabulary database can be classified according to part of speech, such as common noun (Na), foreign language (FW), action transitive verb (VC), action intransitive verb (VA), local word (Nc), Proper Noun (Nb), Verb of Condition (VHC), Colon (COLONCATEGORY) ... and other part-of-speech classifications.

步驟S12為對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊;該篩選斷詞資訊具有二個以上的篩選斷詞。篩選處理是將斷詞資訊中的部分具有意義的詞性保留,並去掉其他詞性,例如保留普通名詞(Na)、外文(FW)、動作及物動詞(VC)、動作不及物動詞(VA)、地方詞(Nc)、專有名詞(Nb)、狀態使動動詞(VHC)…等。所有在篩選處理後被保留下的詞統稱為篩選斷詞。 Step S12 is to filter the segmented information to generate a segmented segmented information; the segmented segmented information has more than two segmented segments. The filtering process is to retain some meaningful parts of speech in the segmentation information, and remove other parts of speech, such as retaining common noun (Na), foreign language (FW), action transitive verb (VC), action intransitive verb (VA) , local words (Nc), proper nouns (Nb), state verbs (VHC)...etc. All words retained after the screening process are collectively referred to as screening segmented words.

步驟S13對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊;該等關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成。透過關聯性處理將二個以上的前後相鄰的篩選斷詞組合在一起,能在一定程度上區分同領域但技術特徵不同的文字資料,特別是可區分關鍵詞大部分相同的文字資料間之差異。 Step S13 performs correlation processing on the filtered segment information to generate a plurality of related sequence information; the related sequence information is respectively composed of two or more adjacent filtered segment information. Combining two or more adjacent filtered and segmented words through correlation processing can distinguish text data in the same field but different technical characteristics to a certain extent, especially between text data with the same keywords. difference.

本創作的第一實施例是一種快速整理比關鍵詞更貼近文字資料之本意的篩選斷詞資訊,不論是用於分析他人的文字資料還是自己的文字資料,都可達到快速整理出文字資料的簡要資訊的目的,進而可方便對文字資料的分析及利用。 The first embodiment of the present creation is a method of quickly sorting out word segmentation information that is closer to the original meaning of text data than keywords. Whether it is used to analyze other people's text data or your own text data, it can quickly sort out text data. The purpose of brief information is to facilitate the analysis and utilization of text data.

如圖1所示,在進行步驟S11之前,可先進行步驟S110;步驟S110為:收集該文字資料中的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫,並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫。一般而言,例如論文等文字資料都有作者自訂的關鍵詞,關鍵詞包含有例如專有名稱、科學技 術名稱…等,將這些作者自訂的關鍵詞匯入斷詞詞彙庫後再進行斷詞處理及後續步驟,能藉以獲得更貼近文字資料之本意的關聯性序列資訊。 As shown in FIG. 1 , before step S11 is performed, step S110 may be performed first; step S110 is: collecting author-defined keywords in the text data to establish a professional keyword vocabulary database, and storing the professional keyword vocabulary database Import the word segmentation vocabulary. Generally speaking, text materials such as papers have keywords customized by the author, and the keywords include, for example, proper names, scientific and technological The key words customized by the author are added to the word segmentation vocabulary database, and then word segmentation processing and subsequent steps are performed, so as to obtain related sequence information that is closer to the original meaning of the text data.

圖2所示為可自動執行第一實施例的文字資料之篩選關聯方法的文字資料之篩選關聯系統的具體實施例之一。如圖2所示,本創作提供一種文字資料之篩選關聯系統100,其中包括:一儲存模組1,用於儲存一斷詞詞彙庫;一斷詞處理模組2,用於對一文字資料進行斷詞處理以產生一斷詞資訊;一篩選處理模組3,用於並對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊;一關聯性處理模組4,用於對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊。儲存模組1、斷詞處理模組2、篩選處理模組3及關聯性處理模組4等可建立於一或多個電腦及/或雲端伺服器中。當文字資料之篩選關聯系統100建立於一雲端伺服器中時,可設有一對應的網頁,使用者在輸入文字資料以後,即可獲得多個關聯性序列資訊(圖中未示)。 FIG. 2 shows one specific embodiment of a text data screening and correlation system that can automatically execute the text data screening and correlation method of the first embodiment. As shown in FIG. 2, the present creation provides a text data screening and association system 100, which includes: a storage module 1 for storing a word segmentation vocabulary database; a word segmentation processing module 2 for performing a text data processing word segmentation processing to generate word segmentation information; a screening processing module 3 for performing screening processing on the word segmentation information to generate a screening segmentation information; a correlation processing module 4 for the screening segmentation Correlation processing is performed on the word information to generate a plurality of related sequence information. The storage module 1, the word segmentation processing module 2, the filtering processing module 3, and the correlation processing module 4, etc. can be established in one or more computers and/or cloud servers. When the text data screening and correlation system 100 is established in a cloud server, a corresponding web page may be provided, and the user can obtain a plurality of correlation sequence information (not shown in the figure) after inputting the text data.

圖3~4為本創作的第二實施例。如圖3~4所示,本創作文字資料之篩選關聯方法包括下列步驟:S21.以二份以上的對照文字資料建立一對照集合資訊;S22.以一斷詞詞彙庫為基礎,對該等對照文字資料進行斷詞處理以分別產生一對照斷詞資訊;S23.對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊;該等對照篩選斷詞資訊分別具有二個以上的篩選斷詞;S24.對該等對照篩 選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊;該等對照關聯性序列資訊分別由二個以上的前後相鄰的篩選斷詞所組成;S25.將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔;藉此方法,可快速整理出多份對照文字資料的簡要資訊,並將各對照文字資料的簡要資訊整合在一起,進而可方便分析待比對文字資料的原創性。 3 to 4 are the second embodiment of the present creation. As shown in Figures 3 to 4, the method for screening and correlating textual data of this creation includes the following steps: S21. Create a set of contrasting information based on two or more contrasting textual data; S22. Based on a word segmentation vocabulary database, compare these Perform word segmentation processing against the text data to respectively generate a comparison word segmentation information; S23. Perform a screening process on the comparison word segmentation information to respectively generate a comparison screening word segmentation information; the comparison screening word segmentation information has two or more pieces respectively The screening segment; S24. Screen these comparisons Perform correlation processing on the selected segmented word information to generate a plurality of comparison related sequence information respectively; the comparison related sequence information is respectively composed of two or more adjacent screening segmented words; S25. Relevant all comparisons The sequence information is integrated together to create a related index file; this method can quickly sort out the brief information of multiple reference text data, and integrate the brief information of each reference text data together, so as to facilitate the analysis of the text data to be compared originality.

步驟S21為以二份以上的對照文字資料建立一對照集合資訊。對照集合資訊可以包含各種文字資料,例如包含臺灣博碩士論文知識加值系統中的部分或全部論文。此外,在建立對照集合資訊時,可以例如電子類、機械類、10年內文字資料…等不同範圍分別建立不同的照集合資訊。在第二實施例中所述的對照文字資料與待比對文字資料與第一實施例的文字資料相同,都可以是各種已經公開的文字資料,例如博碩士論文、學術論文、一般文章或句子等,其差異在於在第二實施例中需要將待比對文字資料逐一與各對照文字資料比對分析,故有不同名稱以利區分。 Step S21 is to create a set of comparison information with more than two pieces of comparison text data. The reference set information can contain various text data, for example, some or all of the theses in the Taiwan Ph. In addition, when creating the comparison collection information, different photo collection information can be created in different scopes, such as electronic, mechanical, text data within 10 years, etc. The text data for comparison and the text data to be compared in the second embodiment are the same as the text data in the first embodiment, and can be various published text data, such as doctoral and master theses, academic papers, general articles or sentences etc., the difference lies in that in the second embodiment, the text data to be compared need to be compared and analyzed with each reference text data one by one, so there are different names to facilitate the distinction.

步驟S22~S24是分別對對照集合資訊中的每一份對照文字資料進行斷詞處理、篩選處理及關聯性處理,可分別產生對照斷詞資訊、對照篩選斷詞資訊及多個對照關聯性序列資訊。 Steps S22 to S24 are to perform word segmentation processing, screening processing and correlation processing on each piece of reference text data in the reference collection information, respectively, to generate reference word segmentation information, comparison screening word segmentation information, and multiple comparison correlation sequences. Information.

步驟S25為將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。整合建立的關聯性索引檔可方 便與待比對文字資料進行比對,進而方便分析待比對文字資料的原創性。 Step S25 is to integrate all the reference related sequence information to create a related index file. Integrate the established relational index file to It is then compared with the text data to be compared, so as to facilitate the analysis of the originality of the text data to be compared.

如圖3所示,在進行步驟S22之前,可先進行步驟S220;步驟S220為:收集該等對照文字資料及待比對文字資料中的一部分或全部的作者自訂關鍵詞以建立一專業關鍵詞詞彙庫,並將該專業關鍵詞詞彙庫匯入該斷詞詞彙庫,能藉以獲得更貼近文字資料之本意的關聯性索引檔。此外,專業關鍵詞詞彙庫的整理工作中可以加入去除重複的工作,藉以增加處理效率。 As shown in FIG. 3 , before step S22 is performed, step S220 may be performed first; step S220 is: collecting the comparison text data and a part or all of the author-defined keywords in the text data to be compared to establish a professional key The word vocabulary database is imported, and the professional keyword vocabulary database is imported into the word segmentation vocabulary database, so as to obtain a related index file that is closer to the original meaning of the text data. In addition, the work of removing duplicates can be added to the sorting of the professional keyword vocabulary database, so as to increase the processing efficiency.

本創作的第二實施例,可快速整理出對照文字資料的簡要資訊,並可進一步將各對照文字資料的簡要資訊整合在一起,藉以方便使用者以待比對文字資料進行比對分析。例如透過下列的步驟S26~S29以分析待比對文字資料的原創性。 The second embodiment of the present creation can quickly sort out the brief information of the reference text data, and further integrate the brief information of each reference text data together, so as to facilitate the user to compare and analyze the text data to be compared. For example, the following steps S26-S29 are used to analyze the originality of the text data to be compared.

步驟S26為對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊。步驟S22~S24及步驟S26的各處理方式與步驟S11~S13一樣,故產生的對照關聯性序列資訊及待比對關聯性序列資訊具有相對應的型態,可方便比對。此外,在S12、S23/或S26中,在篩選處理以後,可先進行同義字詞處理,再進行後續步驟。同義字詞處理為:對篩選處理後的篩選斷詞進行文字同義檢查,將部分或全部同義字、同義詞(有些不適合同義字詞處理的特殊詞除外)轉換成標準文字,可增加關聯性比對效果。例如將”冷氣”、”空調”全改成” 冷氣”等。另外,對照關聯性序列資訊、待比對關聯性序列資訊可由二個以上的前後相鄰的篩選斷詞所組成。在對照關聯性序列資訊、待比對關聯性序列資訊中,篩選斷詞的數量越多,則該對照關聯性序列資訊、待比對關聯性序列資訊越容易反映其對應的文字資料的概念,但也可能形成限制太多而找不到與待比對文字資料類似對照文字資料的情況。因此,基本上採用二個前後相鄰的篩選斷詞組成對照關聯性序列資訊、待比對關聯性序列資訊,而在例如對照集合資訊中的對照文字資料的數量極多的時候,為了加快分析速度,可採用三個或更多的前後相鄰的篩選斷詞組成對照關聯性序列資訊、待比對關聯性序列資訊。 Step S26 is to perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of related sequence information to be compared. The processing methods of steps S22-S24 and S26 are the same as those of steps S11-S13, so the generated comparison related sequence information and the related sequence information to be compared have corresponding types, which can facilitate the comparison. In addition, in S12, S23/or S26, after the screening process, the synonym process may be performed first, and then the subsequent steps may be performed. The processing of synonyms is: check the text synonyms of the screened and segmented words after screening, and convert some or all synonyms and synonyms (except some special words that are not suitable for synonym processing) into standard text, which can increase the correlation comparison. Effect. For example, change "air conditioner" and "air conditioner" to " In addition, the related sequence information to be compared and the related sequence information to be compared can be composed of two or more adjacent screening segments. In the related sequence information to be compared and the related sequence information to be compared, The greater the number of screening segmented words, the easier it is for the relative sequence information to be compared and the related sequence information to be compared to reflect the concept of the corresponding text data, but it may also result in too many restrictions and the text to be compared cannot be found. The data is similar to the comparison of text data. Therefore, basically two adjacent screening segments are used to form the comparison related sequence information and the related sequence information to be compared. In many cases, in order to speed up the analysis, three or more adjacent screening segments can be used to form the related sequence information for comparison and the related sequence information to be compared.

步驟S27為以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對,分別找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料。藉由上述的文字資料之篩選關聯方法,可快速分析待比對文字資料與各對照文字資料間之關聯性,進而方便分析待比對文字資料的原創性。此外,關聯性索引檔格式簡便,可方便加入新的對照關聯性序列資訊,可克服習用反向資料庫因資料新增需要頻繁系統重整之缺點。 Step S27 is to compare the correlation sequence information to be compared with the correlation index file, respectively, and to find each reference text data that has the same correlation sequence information to be compared with the correlation sequence information to be compared. With the above-mentioned method for screening and correlating textual data, the correlation between the textual data to be compared and each comparison textual data can be quickly analyzed, thereby facilitating the analysis of the originality of the textual data to be compared. In addition, the relational index file has a simple format, which can easily add new reference relational sequence information, which can overcome the shortcoming of the conventional reverse database requiring frequent system reorganization due to new data.

茲以下列範例概述斷詞…等處理的進行方式。各範例的編號僅為便於說明而設,當不能以此限制本創作之意義。步驟S21建立對照集合資訊,可將各個對照文字資料依序編號,例如將編號1的對照文字資料記為ID1。對照集合資訊為儲存ID1,ID2,…,IDn的集合。 The following example outlines how hyphenation, etc. processing works. The numbering of each example is only for the convenience of description, and should not limit the meaning of this creation. Step S21 is to create a set of comparison information, and each comparison text data can be numbered in sequence, for example, the comparison text data with the number 1 is recorded as ID1. The reference set information is a set of storing ID1, ID2, . . . , IDn.

Figure 110211297-A0101-12-0015-1
Figure 110211297-A0101-12-0015-1

步驟S22進行斷詞處理。 Step S22 performs word segmentation processing.

Figure 110211297-A0101-12-0015-2
Figure 110211297-A0101-12-0015-2

步驟S23進行篩選處理,可將各個篩選斷詞依序編號,例如將ID1的第一個被保留的篩選斷詞記為ID1tp1。 In step S23, a screening process is performed, and each screening segment can be numbered in sequence, for example, the first reserved screening segment of ID1 is recorded as ID1tp1.

Figure 110211297-A0101-12-0015-17
Figure 110211297-A0101-12-0015-17

步驟S24進行關聯性處理,可將各個對照關聯性序列資訊依序編號,例如將ID1的第一個對照關聯性序列資訊記為ID1S1。 In step S24, correlation processing is performed, and each reference correlation sequence information may be sequentially numbered, for example, the first reference correlation sequence information of ID1 is recorded as ID1S1.

Figure 110211297-A0101-12-0016-4
Figure 110211297-A0101-12-0016-4

步驟S25建立關聯性索引檔,各個對照篩選斷詞資訊可視為該關聯性索引檔的索引(即稱Index或Key),並能以該對照篩選斷詞資訊的編號為該關聯性索引檔的資料(Data)。在建立關聯性索引檔時,任何一個對照篩選斷詞資訊都可能與另一個對照篩選斷詞資訊相同(例如ID1S2、ID2S1)。因此,一個索引可對照多個不同的資料,其資料的數量是眾多的,其所儲存的總資料長度是隨著加入更多對照文字資料而增加的。 Step S25 establishes a relevancy index file, and each comparison and screening word segmentation information can be regarded as an index (namely called Index or Key) of the relevant index file, and the number of the comparison filter word segmentation information can be used as the data of the relevant index file (Data). When creating a related index file, any one of the comparison filter segmentation information may be the same as another comparison filter segmentation information (eg ID1S2, ID2S1). Therefore, an index can be compared with a plurality of different data, the amount of the data is large, and the total data length stored in it increases as more reference text data is added.

Figure 110211297-A0101-12-0016-18
Figure 110211297-A0101-12-0016-18

Figure 110211297-A0101-12-0017-7
Figure 110211297-A0101-12-0017-7

步驟S26對待比對文字資料進行斷詞…等處理,可將待比對文字資料記為IDx。 In step S26, word segmentation is performed on the text data to be compared, and the text data to be compared can be recorded as IDx.

Figure 110211297-A0101-12-0017-8
Figure 110211297-A0101-12-0017-8

步驟S27:使用待比對關聯性序列資訊為索引 去搜尋,讀取關聯性索引檔中具有相同索引的所有資料。 Step S27: Use the related sequence information to be compared as an index To search, read all data with the same index in the associated index file.

Figure 110211297-A0101-12-0018-9
Figure 110211297-A0101-12-0018-9

步驟S28為建立交集序列,可將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序(即分類Sorting)。 Step S28 is to create an intersection sequence, which can arrange all the related related sequence information that is the same as the related related sequence information to be compared (ie, sorting).

Figure 110211297-A0101-12-0018-10
Figure 110211297-A0101-12-0018-10

步驟S29為分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料,藉以產生待比對文字資料相對於每一份對照文字資料的原創性分析結果。比對的方法很多,例如利用統計分析方法分析待比對文字資料在對照集合資訊中每一對照文字資料的相似度參考比例,可用一般習用的如Dice Coefficient法則等理論。此外,亦可以簡單易懂概括性的方法進行簡易分析。 Step S29 is to analyze each piece of reference text data having the same related sequence information as the text data to be compared, so as to generate an originality analysis result of the text data to be compared with respect to each piece of reference text data. There are many comparison methods. For example, statistical analysis methods are used to analyze the similarity reference ratio of each comparison text data in the comparison set information of the text data to be compared, and commonly used theories such as the Dice Coefficient rule can be used. In addition, simple analysis can also be performed in an easy-to-understand and generalized way.

Figure 110211297-A0101-12-0018-19
Figure 110211297-A0101-12-0018-19

Figure 110211297-A0101-12-0019-12
Figure 110211297-A0101-12-0019-12

藉由上述的文字資料之篩選關聯方法,可快速分析待比對文字資料與各對照文字資料間之關聯性,並可進一步分析待比對文字資料的原創性。 By means of the above-mentioned method for screening and correlating textual data, the correlation between the textual data to be compared and each reference textual data can be quickly analyzed, and the originality of the textual data to be compared can be further analyzed.

圖4所示為可自動執行第二實施例的文字資料之篩選關聯方法的文字資料之篩選關聯系統的具體實施例之一。如圖4所示,本創作提供一種文字資料之篩選關聯系統100a,其中包括:一儲存模組1a,用於儲存一斷詞詞彙庫及一對照集合資訊;一斷詞處理模組2a,用於對該對照集合資訊的各個對照文字資料進行斷詞處理以分別產生一對照斷詞資訊;一篩選處理模組3a,用於並對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊;一關聯性處理模組4a,用於對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊;一整合模組5a,用於將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。此外,該斷詞處理模組2a、篩選處理模組3a及關聯性處理模組4a可進一步對一待比對文字資料進行斷詞處理、篩選處理及關聯性處理以產生多個待比對關聯性序列資訊,且該文字資料之篩選關聯系統 100a更包括:一比對模組6a,以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對,找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料;一交集模組7a,將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序,藉以建立交集序列;一分析模組8a,分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料。 FIG. 4 shows one specific embodiment of a text data screening and correlation system that can automatically execute the text data screening and correlation method of the second embodiment. As shown in FIG. 4 , the present invention provides a text data screening and association system 100a, which includes: a storage module 1a for storing a word segmentation vocabulary database and a comparison set information; a word segmentation processing module 2a for using performing word segmentation processing on each comparison text data of the comparison set information to respectively generate a comparison segmentation information; a screening processing module 3a is used to perform screening processing on the comparison segmentation information to respectively generate a comparison screening word segmentation information; a correlation processing module 4a for performing correlation processing on the comparison and screening word segmentation information to generate a plurality of comparison correlation sequence information respectively; an integration module 5a for correlating all comparisons The sex sequence information is integrated together to create a relational index file. In addition, the word segmentation processing module 2a, the screening processing module 3a and the correlation processing module 4a can further perform word segmentation processing, screening processing and correlation processing on a text data to be compared to generate a plurality of correlations to be compared Sex sequence information, and the screening association system for the text data 100a further includes: an alignment module 6a, respectively aligning the related sequence information to be compared with the related index file to find out the related sequence information that has the same relative sequence information to be compared with the related sequence information to be compared Each of the comparison text data; an intersection module 7a, arranging all the comparison related sequence information that is the same as the related sequence information to be compared, so as to establish an intersection sequence; an analysis module 8a, analyze each copy and the related sequence information to be compared. A comparison text that has the same sequence information of relevance to the text.

儲存模組1a、斷詞處理模組2a、篩選處理模組3a、關聯性處理模組4a、整合模組5a、比對模組6a、交集模組7a及分析模組8a等可建立於一或多個電腦及/或雲端伺服器中。當文字資料之篩選關聯系統100a建立於一雲端伺服器中時,可設有一對應的網頁,使用者在輸入待比對文字資料以後,即可獲得原創性分析結果(圖中未示)。 The storage module 1a, the word segmentation processing module 2a, the screening processing module 3a, the correlation processing module 4a, the integration module 5a, the comparison module 6a, the intersection module 7a, and the analysis module 8a can be established in one or multiple computers and/or cloud servers. When the text data screening and association system 100a is established in a cloud server, a corresponding web page can be provided, and the user can obtain originality analysis results after inputting the text data to be compared (not shown in the figure).

另外,前述的與斷詞處理相關的部分,例如步驟S11、S22及斷詞詞彙庫等,可以採用例如臺灣中央研究院發展的CKIP或已公開電腦程式碼的結巴等習知的斷詞系統9、9a,藉以節省成本。 In addition, the aforementioned parts related to word segmentation processing, such as steps S11, S22 and word segmentation vocabulary database, etc., can use conventional word segmentation systems such as CKIP developed by Taiwan Academia Sinica or Jaba which has published computer code. , 9a, in order to save costs.

如前所述,文字資料可以是各種已經公開的文字資料,且例如論文等大篇幅的文字資料而言,可以直接將論文視為一份文字資料,也可以在將論文分段處理以後,形成多份文字資料。這些經分段處理而形成的多份文字資料之間可另外互相關聯以便做成統合的原創性分析結果。舉例來說,一篇論文的編號是IDa1,而該論文經過分段(例如以章節分段)後的編號分別是IDa2~IDan,即言, 不但將該論文視為一份文字資料,該論文的每一分段(每一章節)也都可視為一份文字資料。如此一來,經分析後,不但可獲得待比對文字資料相對於該論文的原創性分析結果,還可獲得待比對文字資料相對於該論文的每一分段(每一章節)的原創性分析結果。 As mentioned above, the written data can be various kinds of written data that have been published, and for large-scale written data such as papers, the paper can be directly regarded as a written data, or it can be formed after the dissertation is processed in sections. Multiple texts. These multiple pieces of text data formed by segmental processing can be additionally correlated with each other so as to make a unified original analysis result. For example, the number of a paper is IDa1, and the number of the paper after being segmented (for example, by chapters) is IDa2~IDan, that is, Not only the thesis is regarded as a written material, but each subsection (each chapter) of the thesis can also be regarded as a written material. In this way, after analysis, not only the originality analysis results of the text to be compared relative to the paper can be obtained, but also the results of the original analysis of the text to be compared relative to each section (each chapter) of the paper. Originality analysis results.

以上為本創作所舉之實施例,僅為便於說明而設,當不能以此限制本創作之意義,即大凡依所列申請專利範圍所為之各種變換設計,均應包含在本創作之專利範圍中。 The above-mentioned examples of this creation are only for the convenience of description, and should not limit the meaning of this creation, that is, all kinds of transformation designs based on the listed patent application scope should be included in the patent scope of this creation. middle.

100a:文字資料之篩選關聯系統 100a: Screening and Association System for Text Data

1a:儲存模組 1a: storage module

2a:斷詞處理模組 2a: Word segmentation processing module

3a:篩選處理模組 3a: Screening processing module

4a:關聯性處理模組 4a: Association processing module

5a:整合模組 5a: Integrate modules

6a:比對模組 6a: Comparison module

7a:交集模組 7a: Intersection module

8a:分析模組 8a: Analysis module

9a:斷詞系統 9a: Hyphenation system

Claims (3)

一種文字資料之篩選關聯系統,其中包括: A screening and association system for text data, including: 一儲存模組,用於儲存一斷詞詞彙庫; a storage module for storing a word segmentation vocabulary; 一斷詞處理模組,用於對一文字資料進行斷詞處理以產生一斷詞資訊; A word segmentation processing module for performing word segmentation processing on a text data to generate word segmentation information; 一篩選處理模組,用於並對該斷詞資訊進行篩選處理以產生一篩選斷詞資訊; a filtering processing module for filtering the segmented information to generate a filtered segmented information; 一關聯性處理模組,用於對該篩選斷詞資訊進行關聯性處理以產生多個關聯性序列資訊。 A correlation processing module is used to perform correlation processing on the filtered word segmentation information to generate a plurality of correlation sequence information. 一種文字資料之篩選關聯系統,其中包括: A screening and association system for text data, including: 一儲存模組,用於儲存一斷詞詞彙庫及一對照集合資訊; a storage module for storing a word segmentation vocabulary database and a comparison set information; 一斷詞處理模組,用於對該對照集合資訊的各個對照文字資料進行斷詞處理以分別產生一對照斷詞資訊; a word segmentation processing module for performing word segmentation processing on each comparison text data of the comparison set information to generate a comparison word segmentation information respectively; 一篩選處理模組,用於並對該等對照斷詞資訊進行篩選處理以分別產生一對照篩選斷詞資訊; a screening processing module for performing screening processing on the comparison segment information to generate a comparison filter segment information respectively; 一關聯性處理模組,用於對該等對照篩選斷詞資訊進行關聯性處理以分別產生多個對照關聯性序列資訊; a correlation processing module for performing correlation processing on the comparison screening segment information to generate a plurality of comparison correlation sequence information respectively; 一整合模組,用於將全部的對照關聯性序列資訊整合一起建立一關聯性索引檔。 An integration module is used to integrate all the reference related sequence information together to create a related index file. 如請求項2之文字資料之篩選關聯系統,其中該斷詞處理模組、篩選處理模組及關聯性處理模組對一待比對文字資料進行斷詞處理、篩選處理及關聯性處 理以產生多個待比對關聯性序列資訊,且該文字資料之篩選關聯系統更包括:一比對模組,以該等待比對關聯性序列資訊分別與該關聯性索引檔進行比對,找出具有與該等待比對關聯性序列資訊相同的對照關聯性序列資訊的各個對照文字資料;一交集模組,將所有與待比對關聯性序列資訊相同的對照關聯性序列資訊排列順序,藉以建立交集序列;一分析模組,分析每一份與待比對文字資料具有相同關聯性序列資訊的對照文字資料。 According to the text data screening and correlation system of claim 2, wherein the word segmentation processing module, the screening processing module and the correlation processing module perform word segmentation processing, screening processing and correlation processing on a text data to be compared The method is used to generate a plurality of related sequence information to be compared, and the screening and correlation system for the text data further includes: an alignment module, which is used to compare the related sequence information to be compared with the related index file respectively, Find out each comparison text data with the same comparison-related sequence information as the waiting-aligned-related sequence information; an intersection module, arranging all the comparison-related sequence information that is the same as the to-be-aligned related sequence information, Thereby, an intersection sequence is established; an analysis module analyzes each control text data with the same related sequence information as the text data to be compared.
TW110211297U 2021-09-23 2021-09-23 System of screening for text data relevance TWM623980U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW110211297U TWM623980U (en) 2021-09-23 2021-09-23 System of screening for text data relevance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW110211297U TWM623980U (en) 2021-09-23 2021-09-23 System of screening for text data relevance

Publications (1)

Publication Number Publication Date
TWM623980U true TWM623980U (en) 2022-03-01

Family

ID=81747428

Family Applications (1)

Application Number Title Priority Date Filing Date
TW110211297U TWM623980U (en) 2021-09-23 2021-09-23 System of screening for text data relevance

Country Status (1)

Country Link
TW (1) TWM623980U (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI813028B (en) * 2021-09-23 2023-08-21 飛資得資訊股份有限公司 Method and system of screening for text data relevance

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI813028B (en) * 2021-09-23 2023-08-21 飛資得資訊股份有限公司 Method and system of screening for text data relevance

Similar Documents

Publication Publication Date Title
Sharjeel et al. COUNTER: corpus of Urdu news text reuse
Jabbar et al. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach
Zhang et al. Multilingual sentence categorization and novelty mining
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
Généreux et al. Introducing the reference corpus of contemporary portuguese on-line
Chinsha et al. Aspect based opinion mining from restaurant reviews
Singh et al. Writing Style Change Detection on Multi-Author Documents.
Momtaz et al. Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents.
Sakai et al. ASKMi: A Japanese Question Answering System based on Semantic Role Analysis.
TWM623980U (en) System of screening for text data relevance
de Melo et al. Taxonomic data integration from multilingual Wikipedia editions
Yeshambel et al. Construction of morpheme-based Amharic stopword list for information retrieval system
Nehrdich A method for the calculation of parallel passages for Buddhist Chinese sources based on million-scale nearest neighbor search
Klang et al. Linking, searching, and visualizing entities in wikipedia
Lejeune et al. Daniel: Language independent character-based news surveillance
Yeshambel et al. Evaluation of corpora, resources and tools for Amharic information retrieval
Puscasu A multilingual method for clause splitting
TWI813028B (en) Method and system of screening for text data relevance
Ma et al. Combining n-gram and dependency word pair for multi-document summarization
Kaur et al. Keyword extraction for punjabi language
Al-Arfaj et al. Arabic NLP tools for ontology construction from Arabic text: An overview
Yadav et al. Graph-based extractive text summarization based on single document
Htay et al. Constructing english-myanmar parallel corpora
TWI594135B (en) Plagiarism detecting method of information in english
CN116028592A (en) Text data screening association method and system