TW201516713A - File classification method based on group characteristic values - Google Patents

File classification method based on group characteristic values Download PDF

Info

Publication number
TW201516713A
TW201516713A TW102137282A TW102137282A TW201516713A TW 201516713 A TW201516713 A TW 201516713A TW 102137282 A TW102137282 A TW 102137282A TW 102137282 A TW102137282 A TW 102137282A TW 201516713 A TW201516713 A TW 201516713A
Authority
TW
Taiwan
Prior art keywords
file
words
word
built
chinese
Prior art date
Application number
TW102137282A
Other languages
Chinese (zh)
Inventor
Ping-Yen Hsieh
Ming-Che Chang
Hua-Chou Chiu
Keh-Hwa Shyu
Original Assignee
Chunghwa Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chunghwa Telecom Co Ltd filed Critical Chunghwa Telecom Co Ltd
Priority to TW102137282A priority Critical patent/TW201516713A/en
Publication of TW201516713A publication Critical patent/TW201516713A/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is a file classification method based on group characteristic values. The method performs a file pretreatment for text normalization and stopwords filtration, calculates the number of each text appeared in the file by means of a pre-built Chinese character library, Chinese thesaurus library and English character library, separately calculates the number of each text not appeared in the pre-built character or thesaurus libraries, and retrieves the greater number of occurrences of the text as a key word. After taking the characteristics from two distinct sources and giving them different weights in accordance with their importance, the degree of similarity compared to group characteristic values according to the preset similarity formula is calculated. If it exceeds a threshold value, it means that a similar file is detected and attributes to the same classification for assisting file classification.

Description

基於群體特徵值的文件分類方法 File classification method based on group eigenvalue

本發明係為一種文件分類方法有關;具體而言,特別是關於一種基於群體特徵值的文件分類方法,用於進行文件之分類,並可搭配至資料防護系統、論文抄襲系統等。而於現今資料安全日益重要的狀況下,利用內容分析進行文件分類,偵測機密文件之存在,避免機密文件外流,更使得資料防護系統的機密防護成為主要技術領域之一。 The invention relates to a file classification method; in particular, it relates to a file classification method based on group feature values, which is used for classifying files, and can be matched to a data protection system, a paper plagiarism system, and the like. In today's increasingly important data security situation, the use of content analysis for file classification, the detection of the existence of confidential documents, to avoid the outflow of confidential documents, and make the data protection system's confidential protection become one of the main technical areas.

在本發明中,提到的技術領域主要包含文件分類及其延伸出的機密防護。而在文件分類方面,欲完成上述系統,解決相關問題,常使用到的方法是數據聚類(cluster analysis),主要分為由上而下的分割法(divisive clustering),以及由下而上的凝聚法(agglomerative clustering)。 In the present invention, the technical field mentioned mainly includes file classification and its extended confidentiality protection. In terms of file classification, in order to complete the above system and solve related problems, the commonly used method is cluster analysis, which is mainly divided into top-down divisive clustering and bottom-up. Agglomerative clustering.

由上而下的分割法是將所有文件作為一個整體分類,然後將之逐漸分小。然而其問題在於,分割法須事先決定要將所有文件切分為幾個分割,這在我們事先無法得知文件有多少種類、多少數量的狀況下,並不符合我們的需求。 The top-down segmentation method classifies all files as a whole and then divides them down. However, the problem is that the segmentation method must decide in advance that all documents should be divided into several segments. This is not in line with our needs when we cannot know in advance how many types and quantities of documents.

至於在凝聚法的部份,須先算出任意兩文件間的距離,再根據相關資料以兩兩合併的方式合併為更大的群組,直至全部文件都在同一群組,或是群組與群組間之相似度低於一指定門檻為止,然而欲完成此一完整流程,需消耗大量時間與記憶體,在實用性部分顯得不足。 As for the part of the cohesion method, the distance between any two files must be calculated first, and then merged into a larger group according to the related data, until all the files are in the same group, or the group and The similarity between groups is lower than a specified threshold. However, to complete this complete process, it takes a lot of time and memory, which is insufficient in the practical part.

而在數據聚類中,關鍵的步驟在於選擇計算距離的方式,目前常見的方式大致上可分為以下數種: In data clustering, the key step is to choose the way to calculate the distance. The current common methods can be roughly divided into the following types:

曼哈頓距離(Manhattan Distance) Manhattan Distance

歐幾里得距離(Euclidean Distance) Euclidean Distance

漢明距離(Hamming Distance) Hamming Distance

餘弦相似性(Cosine Similarity) Cosine Similarity

其中曼哈頓距離表示兩個點上在標準座標系上的絕對軸距總和;歐幾里得距離表示在歐幾里得空間中,兩個點之間的直線距離;漢明距離係指兩個字元串對應位置字元不同的個數。上述方式雖都能得到文件間對應之距離,但由於文件之特徵值以多維為主,且數值之大小差異甚鉅,故上述距離計算方式較不適用於進行文件分類。 The Manhattan distance represents the sum of the absolute wheelbases on the standard coordinate system at two points; the Euclidean distance represents the linear distance between the two points in the Euclidean space; the Hamming distance refers to the two words. The metastring corresponds to the number of different position characters. Although the above method can obtain the corresponding distance between files, since the feature values of the file are mainly multi-dimensional, and the numerical values vary greatly, the above distance calculation method is not suitable for file classification.

餘弦相似性係指通過測量兩個向量內積空間的夾角之餘弦值來度量它們之間的相似性,此餘弦相似性不僅可以用在任何維度的向量比較中,在高維正空間中的利用尤為頻繁。故在較佳的實施例中,我們採用了餘弦相似性作為距離計算的方式,來達成文件分類的目的。 Cosine similarity refers to measuring the similarity between the cosines of the angles of the product spaces in two vectors. This cosine similarity can be used not only in the vector comparison of any dimension, but also in the high-dimensional positive space. . Therefore, in the preferred embodiment, we use cosine similarity as the way of distance calculation to achieve the purpose of file classification.

於計算距離之步驟中,一般會經過斷詞取詞並計算字詞頻率之流程,此一流程主要有兩種常見方式,一是在未有 預建資料之狀況下,以文件內容作為取詞之依據,另一則是以預建字詞庫來作為取詞之參考。 In the step of calculating the distance, the process of taking words and calculating the frequency of words is generally used. There are two common ways of this process, one is that there is no In the case of pre-built materials, the content of the document is used as the basis for word-taking, and the other is based on the pre-built word bank as a reference for word-taking.

在未有預建資料,以文件內容做為取詞依據之狀況下,雖能取得內文關鍵字詞,但因時間複雜度高,易發生耗時過長的狀況,另外所取得之關鍵字詞應如何應用於相似度計算階段,亦是一難以定論之問題。 In the absence of pre-built materials and the content of the documents as the basis for the wording, although the text keywords can be obtained, the time complexity is high, and the time-consuming and long-term situation is easy to occur. How the word should be applied to the similarity calculation stage is also an inconsistible problem.

今若以預建字詞庫做為取詞之依據,雖說解決了高時間複雜度之問題,但相對而言卻只能依據預建字詞庫之內容取詞計算頻率,可能無法取得文章之關鍵字詞,導致出現誤判等狀況。 Nowadays, if the pre-built word library is used as the basis for word retrieval, although the problem of high time complexity is solved, relatively speaking, the frequency of the word can only be calculated according to the content of the pre-built word bank, and the article may not be obtained. Keyword words lead to misjudgments and other conditions.

由此可見,上述習用方式仍有諸多缺失,實非一良善之設計,而亟待加以改良。 It can be seen that there are still many shortcomings in the above-mentioned methods of use, which is not a good design, but needs to be improved.

本案發明人鑑於上述習用方式所衍生的各項缺點,乃亟思加以改良創新,並經多年苦心孤詣潛心研究後,終於成功研發完成本件發明。 In view of the shortcomings derived from the above-mentioned conventional methods, the inventor of the present invention has improved and innovated, and after years of painstaking research, he finally succeeded in researching and developing this invention.

為解決上述狀況,本專利提出一融合兩種方式優點之作法,使之可在低耗時的狀況下,取得內文關鍵字詞與預建字詞庫內容作為搭配,進而產生準確率更高之運算結果。 In order to solve the above situation, the patent proposes a method of combining the advantages of the two methods, so that the content keyword and the pre-built word library content can be obtained as a match under low-time condition, thereby generating higher accuracy. The result of the operation.

於本專利中,當相似度超過一定門檻時,即視為相似文件歸於同類,以此方式節省大量文件間之計算複雜度,使我們能提高系統實用性,以更有效率之方式完成文件分類之工作。 In this patent, when the similarity exceeds a certain threshold, it is regarded as similar files belonging to the same kind. In this way, the computational complexity between a large number of files is saved, so that we can improve the system practicability and complete the file classification in a more efficient manner. Work.

本發明專利目的在於建立一種基於群體特徵值的文件分類方法。透過本專利所提出之文件分類方法,強化現行相關文件分類系統之精確性以及效率,節省大量文件間之計算複雜度,能提高系統實用性,以更有效率之方式完成文件分類之工作,進而利用本專利發展資料防護系統,防範機密文件之外流。 The purpose of the patent of the present invention is to establish a file classification method based on group feature values. Through the document classification method proposed in this patent, the accuracy and efficiency of the current related document classification system are strengthened, the computational complexity between a large number of files is saved, the system practicability can be improved, and the file classification work can be completed in a more efficient manner, and then Use this patent to develop a data protection system to prevent the flow of confidential files.

達成上述發明目的之一種基於群體特徵值的文件分類方法,用以計算出文件與群體已分類文件之間之相似度,並以此作為文件分類之依據。該內容分析方法首先針對一文件進行字詞正規化之處理,使文件中英文之大小寫統一,並刪除各式標點符號,而後進行停用字詞之過濾,將虛字、連接字、無意義字或無需比較之字彙剔除之。至此可進行斷詞取詞之行為,利用預建之中文字庫、中文詞庫以及英文字庫計算各字詞於文件中出現之次數,而未出現於預建字詞庫者則以關鍵字詞之名義,另行計算其出現次數:為預防於取出關鍵中文詞時耗時過長,遂同時搭配了預建中文字庫作為篩選之輔助。取得預建字詞庫與關鍵字詞相對應之出現頻率後,以其頻率建立一向量資訊,並依重要性提高關鍵字詞之權重,利用餘弦相似性計算文件與各分類群體特徵值之相似度,若其值大於1則視為1,而若此值高於一門檻值則視為應分類至該類別中。 A document classification method based on group eigenvalues for achieving the above object is used to calculate the similarity between a document and a group classified document, and use this as a basis for document classification. The content analysis method firstly processes the word normalization for a file, so that the upper and lower case of the file is unified, and deletes various punctuation marks, and then filters the disabled words, and the virtual word, the connected word, and the meaningless word are deleted. Or eliminate the need for comparison vocabulary. At this point, the behavior of word-breaking words can be used. The pre-built Chinese character library, Chinese lexicon and English font are used to calculate the number of occurrences of each word in the file, while those not appearing in the pre-built word database are keyword words. In the name of the other, the number of occurrences is calculated separately: it takes too long to prevent the key Chinese words from being taken out, and the pre-built Chinese character library is also used as an auxiliary for screening. After obtaining the frequency of occurrence of the pre-built word bank and the keyword word, a vector information is established with its frequency, and the weight of the keyword word is increased according to the importance, and the cosine similarity calculation file is used to calculate the similarity of the feature values of each classification group. Degree, if its value is greater than 1, it is regarded as 1, and if the value is higher than a threshold, it shall be deemed to be classified into the category.

本發明之一種基於群體特徵值的文件分類方法,其步驟至少可包括(1)擷取文件特徵之步驟針對文件進行前處理,依序進行字詞正規化、斷詞,並建立停用字詞庫過濾停用字詞之動作;(2)根據一取詞之策略,取得預建中文詞庫、預建英文字庫、關鍵中文詞、關鍵英文字之出現次數;(3)擷取文件特徵向量之步驟,依各字詞出現頻率建立一文件向量資訊,具 有預建中文詞庫、預建英文字庫、高頻關鍵中文詞、高頻關鍵英文字之出現次數;(4)文件分類步驟以文件向量為基礎,計算待分類文件與群體特徵值間之相似度;以及(5)重複前述流程取得與所有分類之相似度後,將此一待分類文件分類至應歸屬之分類中。 A method for classifying a file based on a population feature value according to the present invention may comprise at least (1) a step of extracting a feature of a file, performing pre-processing on the file, sequentially normalizing the word, breaking the word, and establishing a stop word. The library filters the action of stopping words; (2) according to the strategy of acquiring words, obtains the number of occurrences of pre-built Chinese thesaurus, pre-built English fonts, key Chinese words, key English words; (3) extracting file feature vectors a step of creating a file vector information according to the frequency of occurrence of each word, There are pre-built Chinese thesaurus, pre-built English fonts, high-frequency key Chinese words, high-frequency key English words; (4) file classification steps based on the file vector, calculate the similarity between the files to be classified and the group feature values And (5) after repeating the foregoing process to obtain similarity with all the classifications, classify the documents to be classified into the categories to be attributed.

進一步說明,該擷取文件特徵之流程,其步驟更可包括:(1)針對文件進行前處理,首先進行字詞正規化,其中該正規化之動作,包括統一待分析文件之英文大小寫,以及刪除各式標點符號;(2)其次進行斷詞,以及建立停用字詞庫過濾停用字詞之動作,其中該些停用字詞包括虛字、連接字、無意義字或無需比較之字彙;以及(3)最後取得預建中文詞庫、預建英文字庫、關鍵中文詞、關鍵英文字之出現次數,其中擷取文件特徵向量之流程為利用預建之中文字庫、中文詞庫以及英文字庫計算各字詞於文件中之出現次數,而未出現於預建字詞庫者則以關鍵中文詞以及關鍵英文字稱之。 Further, the process of extracting file features may further include: (1) pre-processing the file, first performing word normalization, wherein the normalizing action includes unifying the English capitalization of the file to be analyzed, And deleting various punctuation marks; (2) secondly performing word breaks, and establishing a stop word library to filter the action of deactivating words, wherein the stop words include virtual words, connected words, meaningless words or no comparison The vocabulary; and (3) the number of occurrences of the pre-built Chinese thesaurus, pre-built English fonts, key Chinese words, and key English words. The process of extracting the file feature vector is to use the pre-built Chinese character library and Chinese vocabulary. And the English font calculates the number of occurrences of each word in the document, while those who do not appear in the pre-built word library are referred to as key Chinese words and key English words.

除此之外,其該文件分類流程可為:(1)群體特徵值之擷取包括各歸屬於該分類之所有文件之關鍵字詞頻率,取其平均得到一特定分類文件向量均值;(2)其次以文件向量進行相似度之運算,計算方式可為餘弦相似性方法;(3)取得與所有分類之相似度後,若所有值皆未達一指定門檻,則將此一待分類文件歸為新一分類;否則,將其分類至相似度最高之分類中,完成文件分類工作。 In addition, the file classification process may be: (1) the extraction of the group feature value includes the keyword word frequencies of all the files belonging to the category, and the average of the file is obtained by averaging a specific classification file vector mean; Secondly, the similarity calculation is performed by the file vector, and the calculation method can be the cosine similarity method; (3) after obtaining the similarity with all the classifications, if all the values do not reach a specified threshold, then the file to be classified is returned. For the new classification; otherwise, classify it into the category with the highest similarity and complete the file classification work.

本發明所提供之一種無線與其回程網路設備協同運作節能系統與方法,與其他習用技術相互比較時,更具備下列優點: The invention provides a wireless and its backhaul network equipment cooperative operation energy-saving system and method, and when compared with other conventional technologies, the following advantages are obtained:

1.本發明之一種基於群體特徵值的文件分類方法採用了統計分析方式進行研究開發,可處理非結構化內文,並不為字詞順序所影響。 1. A file classification method based on group eigenvalues of the present invention adopts a statistical analysis method for research and development, and can process unstructured texts, which are not affected by word order.

2.本發明之一種基於群體特徵值的文件分類方法提出同時使用預建字詞庫以及內文關鍵字詞進行斷詞取詞之概念,融合了兩種方式優點,使之可在低耗時的狀況下,取得內文關鍵字詞與預建字詞庫內容作為搭配,同時依其重要性給予不同的權重,進而產生準確率更高之運算結果。 2. A file classification method based on group eigenvalues of the present invention proposes the concept of using a pre-built word library and a semantic keyword word to perform word-breaking words, combining the advantages of the two methods, so that it can be used at low time consumption. In the case of the situation, the content keyword of the text is matched with the content of the pre-built word library, and different weights are given according to the importance thereof, thereby generating a calculation result with higher accuracy.

3.本發明之一種基於群體特徵值的文件分類方法配合擷取待分類文件之關鍵字詞之概念,進而取得具代表性之特徵,不僅於分類完成後擴充已知分類之字詞範圍,同時亦提高文件分類之精確性。 3. The document classification method based on the group feature value of the present invention cooperates with the concept of the keyword word of the file to be classified, thereby obtaining a representative feature, not only expanding the range of words of the known classification after the classification is completed, but also It also improves the accuracy of document classification.

110‧‧‧字詞正規化處理 110‧‧‧ formalization of words

120‧‧‧斷詞 120‧‧‧words

130‧‧‧停用字詞過濾 130‧‧‧Disable word filtering

140‧‧‧預建中文字庫出現次數 140‧‧‧Pre-built Chinese fonts

150‧‧‧高頻中文字 150‧‧‧High frequency Chinese characters

160‧‧‧預建中文詞庫出現次數 160‧‧‧Pre-built Chinese lexicon appearances

170‧‧‧預建英文字庫出現次數 170‧‧‧Pre-built English fonts

180‧‧‧關鍵中文詞出現次數 180‧‧‧Key Chinese word appearances

190‧‧‧關鍵英文字出現次數 190‧‧‧Key English word appearances

210‧‧‧預建中文詞庫出現次數 210‧‧‧Pre-built Chinese lexicon appearances

220‧‧‧關鍵中文詞出現次數 220‧‧‧Key Chinese word appearances

230‧‧‧預建英文字庫出現次數 230‧‧‧Pre-built English fonts

240‧‧‧關鍵英文字出現次數 240‧‧‧Key English word occurrences

250‧‧‧高頻關鍵中文詞出現次數 250‧‧‧High frequency key Chinese word appearances

260‧‧‧高頻關鍵英文字出現次數 260‧‧‧High frequency key English word appearances

270‧‧‧文件向量 270‧‧‧ file vector

310‧‧‧未分類文件向量 310‧‧‧Uncategorized file vector

320‧‧‧特定分類文件向量均值 320‧‧‧Specific classification file vector mean

330‧‧‧分類相似度 330‧‧‧Classification similarity

340‧‧‧分類結果 340‧‧‧ classification results

350‧‧‧所屬分類文件向量均值 350‧‧‧Average file vector mean

請參閱有關本發明之詳細說明及其附圖,將可進一步瞭解本發明之技術內容及其目的功效;有關附圖為:圖1為本發明之基於群體特徵值的文件分類方法之擷取文件特徵流程圖。 Please refer to the detailed description of the present invention and the accompanying drawings, which can further understand the technical content of the present invention and its effect. The related drawings are: FIG. 1 is a file for extracting a file classification method based on group feature values of the present invention. Feature flow chart.

圖2為本發明之基於群體特徵值的文件分類方法之擷取文件特徵向量流程圖。 2 is a flow chart of the file feature vector of the file classification method based on the group feature value of the present invention.

圖3為本發明之基於群體特徵值的文件分類方法之文件分類流程圖。 FIG. 3 is a flow chart of file classification of a file classification method based on group feature values according to the present invention.

圖4為本發明之基於群體特徵值的文件分類方法。 FIG. 4 is a file classification method based on group feature values according to the present invention.

為了使本發明的目的、技術方案及優點更加清楚明白,下面結合附圖及實施例,對本發明進行進一步詳細說明。應當理解,此處所描述的具體實施例僅用以解釋本發明,但並不用於限定本發明。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

以下,結合附圖對本發明進一步說明:請參閱圖1,圖1為本發明之基於群體特徵值的文件分類方法之擷取文件特徵流程圖。如圖1所示,待分析文件進入系統後,即進入字詞正規化處理110中文字不進行處理,英文部分則將大小寫皆統一為小寫,另刪除各式無關字義的全半型標點符號,並以空白置於各中英文字之間,以供能以空白進行斷詞120,切分出所有中英文字。於切分出所有中英文字後,進行停用字詞過濾130,將虛字、連接字、無意義字或無需比較之字彙剔除之,避免文件特徵被無關文意之字詞所影響。 Hereinafter, the present invention will be further described with reference to the accompanying drawings: Referring to FIG. 1, FIG. 1 is a flow chart of extracting file features of a file classification method based on group feature values according to the present invention. As shown in Figure 1, after the file to be analyzed enters the system, the text in the word normalization processing 110 is not processed, and the English part is unified to lowercase, and the full-length punctuation of each type of unrelated meaning is deleted. And placed in the space between the Chinese and English words, in order to provide a blank word 120, and cut out all Chinese and English words. After all the Chinese and English words are separated, the stop word filtering 130 is performed, and the virtual words, the connected words, the meaningless words or the vocabulary without comparison are eliminated, and the file features are prevented from being affected by the unrelated words.

至此可開始進行取詞與計算出現頻率之動作,本發明的方法是綜合習知兩種常見方式:未有預建字詞庫狀況下,以及有預建字詞庫狀況下的優點,並將兩者之缺點予以去除。為保留耗時方面的優點。 At this point, the action of taking words and calculating the frequency of occurrence can be started. The method of the present invention is to integrate two common methods: the situation without the pre-built word library, and the advantages of the pre-built word library, and The shortcomings of both are removed. To preserve the advantages of time-consuming aspects.

故本發明以預建字詞庫之方式為基礎,而為解決無法取得關鍵字詞之問題,本發明佐以未有預建字詞庫狀況之方式為輔,亦即在計算完預建字詞庫之出現頻率後,對未出現在字詞庫中的字詞進行取詞並計算頻率。 Therefore, the present invention is based on the method of pre-built word bank, and in order to solve the problem that the keyword word cannot be obtained, the present invention supplements the method of not having the pre-built word bank, that is, after calculating the pre-built word. After the frequency of occurrence of the lexicon, the words that do not appear in the vocabulary are taken and the frequency is calculated.

整體觀念為此,接著進行細部設定與解說:本發明支援中英文之解析,在取文件特徵時,中文部份我們選擇以二 字詞為單位,原因一是中文一般以詞為單位來表達意思,原因二是若取詞的長度不固定將導致時間複雜度大幅上升,在此二前提下,中文取二字詞為單位;英文部分則沒有這樣的問題,故英文部分我們以字為單位。 The overall concept is to do this, followed by detailed setting and explanation: the invention supports the analysis of Chinese and English. When taking the file feature, we select the Chinese part. Words are units. The first reason is that Chinese generally uses words as a unit to express meaning. The second reason is that if the length of the word is not fixed, the time complexity will increase significantly. Under these two preconditions, Chinese takes two words as the unit; There is no such problem in the English part, so the English part is in words.

此一設定下,在取關鍵中文詞時,可能因為未重複的二字詞過多,導致比對的時間複雜度過高,拖慢整體效能,如此一來就失去了本發明使用預建字詞庫的目的。故在取關鍵中文詞時,本發明利用最終關鍵字詞有高出現頻率的特性,預先使用預建中文字庫取出高頻中文字,若任一二字詞未出現於預建中文詞庫中,又其第一字為高頻中文字,則將其記錄為關鍵中文詞,藉此來提升效能。 Under this setting, when taking key Chinese words, it may be because the number of unrepeated two words is too large, resulting in too high time complexity of the comparison, slowing down the overall performance, thus losing the use of the pre-built words of the present invention. The purpose of the library. Therefore, in taking the key Chinese words, the present invention utilizes the characteristics that the final keyword words have a high frequency of occurrence, and pre-built the Chinese character library to extract high-frequency Chinese characters in advance, if any two words do not appear in the pre-built Chinese thesaurus. And the first word is high-frequency Chinese characters, which will be recorded as key Chinese words to improve performance.

於了解本發明之演算法設計後,繼續進入後續的流程,在完成停用字詞過濾130之後,依據前文中提到的設計方式,計算預建中文字庫中各中文字出現狀況,得到預建中文字庫出現次數140,並設定一定數N,取出現次數前N多的中文字,設定為高頻中文字150,為後續取出關鍵中文詞作準備。另一頭則計算預建中文詞庫出現次數160、預建英文字庫出現次數170,並接著取關鍵中文詞與關鍵英文字,中文二字詞的部份,進行以高頻中文字為基礎,取出關鍵中文詞出現次數180的條件檢查。 After understanding the algorithm design of the present invention, proceeding to the subsequent process, after completing the stop word filtering 130, according to the design method mentioned in the foregoing, calculating the occurrence status of each Chinese character in the pre-built Chinese character library, and obtaining the pre-preparation The number of occurrences of the Chinese character library is 140, and a certain number N is set. The Chinese characters with more than N before the number of occurrences are set to the high frequency Chinese characters 150 to prepare for the subsequent removal of the key Chinese words. The other end calculates the number of occurrences of the pre-built Chinese lexicon 160, the number of pre-built English fonts 170, and then takes the key Chinese words and key English words, the Chinese two-word part, based on the high-frequency Chinese characters, The condition check of the number of occurrences of key Chinese words 180.

若任一二字詞未出現於預建中文詞庫中,又其第一字為高頻中文字,則將其記錄為關鍵中文詞,並累計其出現次數;英文字的部份則未做特殊處理,僅進行計算關鍵英文字出現次數190的條件檢查:若任一英文字未出現於預建英文字庫中,則將其記錄為關鍵英文字,並累計其出現次數。至此,完成本發明之文件特徵擷取流程。 If any two words do not appear in the pre-built Chinese thesaurus, and the first word is a high-frequency Chinese character, it will be recorded as a key Chinese word and its number of occurrences will be accumulated; the part of the English word is not made. Special treatment, only the condition check for calculating the number of occurrences of key English words: 190: If any English word does not appear in the pre-built English font, it is recorded as a key English word and the number of occurrences is accumulated. So far, the document feature extraction process of the present invention has been completed.

請參閱圖2,圖2為本發明之基於群體特徵值的文件分類方法之擷取文件特徵向量流程圖。如圖2所示,其包含預建中文詞庫出現次數210、關鍵中文詞出現次數220、預建英文字庫出現次數230、關鍵英文字出現次數240等四組字詞頻率已知資料,而關鍵字詞的部份,並不取全部的關鍵字詞來做相似度運算,出現次數不多的關鍵字詞反而會造成相似度的偏移。 Please refer to FIG. 2. FIG. 2 is a flow chart of extracting file feature vectors of the file classification method based on the group feature value according to the present invention. As shown in Figure 2, it includes the number of occurrences of the pre-built Chinese lexicon 210, the number of occurrences of key Chinese words 220, the number of occurrences of pre-built English fonts 230, and the number of occurrences of key English words 240. The part of the word does not take all the keyword words to do the similarity operation. The keyword words that appear few times will cause the similarity to shift.

故在此本發明以與前述相似之方式取出高頻關鍵字詞作為相似度運算的參數,設定一定數M,執行以預建中文詞庫出現次數第M多者為基準,出現次數超過此值之關鍵中文詞,記錄為高頻關鍵中文詞出現次數250,以及以預建英文字庫出現次數第M多者為基準,出現次數超過此值之關鍵英文字,記錄為高頻關鍵英文字出現次數260,最後取預建中文詞庫出現次數210、預建英文字庫出現次數230、高頻關鍵中文詞出現次數250、高頻關鍵英文字出現次數260組成文件向量270。 Therefore, in the present invention, the high frequency keyword word is taken out as a parameter of the similarity operation in a manner similar to the foregoing, and a certain number M is set, and the number of occurrences of the pre-built Chinese lexicon number M is used as a reference, and the number of occurrences exceeds this value. The key Chinese words are recorded as the number of occurrences of high-frequency key Chinese words 250, and the number of occurrences of the number of pre-built English fonts is M. The key English words appearing more than this value are recorded as the number of high-frequency key English words. 260. Finally, the pre-built Chinese lexicon appearance number 210, the pre-built English font occurrence number 230, the high frequency key Chinese word appearance frequency 250, and the high frequency key English word appearance frequency 260 constitute a file vector 270.

請參閱圖3,圖3為本發明之基於群體特徵值的文件分類方法之文件分類流程圖。如圖3所示,並依照圖2中之流程,針對待分類之文件取得未分類文件向量310,另取特定分類文件向量均值320,此值概由特定分類各文件向量取平均而得;此處配合本專利所提出擷取關鍵字詞之概念,於分類完成後擴充已知分類之字詞範圍。 Please refer to FIG. 3. FIG. 3 is a flowchart of file classification of a file classification method based on group feature values according to the present invention. As shown in FIG. 3, and according to the flow in FIG. 2, the unclassified file vector 310 is obtained for the file to be classified, and the specific classification file vector mean 320 is taken, which is obtained by averaging the file vectors of the specific classification; In conjunction with the concept of extracting keyword words proposed in this patent, the scope of words of known classifications is expanded after the classification is completed.

故此處特定分類文件向量均值中亦包含各文件之關鍵字詞,提高文件分類之精確性。待完成各分類文件向量均值之蒐集後,此時以未分類文件向量310,與特定分類文件向量均值320進行相似度的運算,針對各特定分類皆會得到一分 類相似度330,若所有的分類相似度皆未達一指定門檻,則將此一待分類文件歸為新一分類;否則,將其分類至相似度最高之分類中,得到分類結果340。於分類完成後,針對此一文件所屬分類重新計算所屬分類文件向量均值350,以供下次分類使用。 Therefore, the vector mean of the specific classification file also includes the keyword words of each file to improve the accuracy of the file classification. After the collection of the vector mean values of the various classification files is completed, the unclassified file vector 310 is used to perform the similarity calculation with the specific classification file vector mean 320, and one point is obtained for each specific classification. The similarity degree 330, if all the classification similarities do not reach a specified threshold, classify the file to be classified into a new classification; otherwise, classify it into the classification with the highest similarity to obtain the classification result 340. After the classification is completed, the average vector value 350 of the classified file is recalculated for the classification of the file to be used for the next classification.

相似度部分主要利用向量餘弦夾角比對法計算之,文件D與分類E之相似度公式為: 其中,F db 代表該文件的預建中文詞庫出現次數210、預建英文字庫出現次數230,F kw 代表該文件之高頻關鍵中文詞出現次數250、高頻關鍵英文字出現次數260,而α則表示於此一演算法中,關鍵字詞之於預建字詞庫之重要性,由於關鍵字詞為未出現於預建字詞庫之高頻字詞,故其表達了該文件在特定領域的特性,具有指標意義,因此一般將α設為一大於1之值,以彰顯關鍵字詞之重要性。接著藉此一相似度公式計算出分類相似度340,其值須落於0與1之間,若大於1則取為1,以此相似度協助文件分類工作。 The similarity part is mainly calculated by the vector cosine angle comparison method. The similarity formula of the file D and the classification E is: Among them, F db represents the number of occurrences of the pre-built Chinese lexicon of the file 210, the number of occurrences of the pre-built English fonts 230, F kw represents the number of occurrences of the high-frequency key Chinese words of the file 250, and the number of occurrences of the high-frequency key English words 260, and α indicates the importance of the keyword word to the pre-built word bank in this algorithm. Since the keyword word is a high-frequency word that does not appear in the pre-built word bank, it expresses the file in The characteristics of a particular domain have an indicator meaning, so α is generally set to a value greater than 1, to highlight the importance of keyword words. Then, according to a similarity formula, the classification similarity 340 is calculated, and the value must fall between 0 and 1. If it is greater than 1, it is taken as 1, and the similarity is used to assist the file classification work.

至此,以上述流程,最終可得到文件與各分類間之相似度,可得知文件之分類結果,藉此發展資料防護系統、文件分類系統等,以協助防範機密文件之外流。 At this point, with the above process, the similarity between the document and each category can be finally obtained, and the classification result of the file can be known, thereby developing a data protection system, a file classification system, etc., to help prevent the outflow of the confidential file.

請參閱圖4,圖4為本發明之無線與其回程網路設備協同運作節能系統與方法之另一運作流程圖。如圖4所示, 上列詳細說明乃針對本發明之一可行實施例進行具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本案之專利範圍中。 Please refer to FIG. 4. FIG. 4 is another operational flowchart of the wireless system and method for cooperating with the backhaul network device of the present invention. As shown in Figure 4, The detailed description of the present invention is intended to be illustrative of a preferred embodiment of the invention, and is not intended to limit the scope of the invention. The patent scope of this case.

綜上所述,本案不僅於技術思想上確屬創新,並具備習用之傳統方法所不及之上述多項功效,已充分符合新穎性及進步性之法定發明專利要件,爰依法提出申請,懇請 貴局核准本件發明專利申請案,以勵發明,至感德便。 To sum up, this case is not only innovative in terms of technical thinking, but also has many of the above-mentioned functions that are not in the traditional methods of the past. It has fully complied with the statutory invention patent requirements of novelty and progressiveness, and applied for it according to law. Approved this invention patent application, in order to invent invention, to the sense of virtue.

Claims (7)

一種基於群體特徵值的文件分類方法,其步驟至少包括:A.擷取文件特徵之步驟針對文件進行前處理,依序進行字詞正規化、斷詞,並建立停用字詞庫過濾停用字詞之動作;B.根據一取詞之策略,取得預建中文詞庫、預建英文字庫、關鍵中文詞、關鍵英文字之出現次數;C.擷取文件特徵向量之步驟,依各字詞出現頻率建立一文件向量資訊,具有預建中文詞庫、預建英文字庫、高頻關鍵中文詞、高頻關鍵英文字之出現次數;D.文件分類步驟以文件向量為基礎,計算待分類文件與群體特徵值間之相似度;以及E.重複前述流程取得與所有分類之相似度後,將此一待分類文件分類至應歸屬之分類中。 A file classification method based on group feature values, the steps of which at least include: A. Steps of extracting file features are pre-processed for files, word normalization, word breaking, and deactivation of word library filtering are disabled. The action of words; B. According to the strategy of taking words, obtain the number of occurrences of pre-built Chinese thesaurus, pre-built English fonts, key Chinese words, key English words; C. Steps of extracting file feature vectors, according to each word The frequency of word occurrence establishes a file vector information, with pre-built Chinese lexicon, pre-built English font, high-frequency key Chinese words, high frequency key English words; D. file classification steps based on file vector, calculate to be classified The similarity between the document and the group feature value; and E. After repeating the foregoing process to obtain the similarity with all the classifications, classify the file to be classified into the category to be attributed. 如申請專利範圍第1項所述之基於群體特徵值的文件分類方法,其中該擷取文件特徵之流程,其步驟更包括:A1.針對文件進行前處理,首先進行字詞正規化,其中該正規化之動作,包括統一待分析文件之英文大小寫,以及刪除各式標點符號;A2.其次進行斷詞,以及建立停用字詞庫過濾停用字詞之動作,其中該些停用字詞包括虛字、連接字、無意義字或無需比較之字彙;以及A3.最後取得預建中文詞庫、預建英文字庫、關鍵中文詞、 關鍵英文字之出現次數,其中擷取文件特徵向量之流程為利用預建之中文字庫、中文詞庫以及英文字庫計算各字詞於文件中之出現次數,而未出現於預建字詞庫者則以關鍵中文詞以及關鍵英文字稱之。 The method for classifying a file based on the group feature value according to the first aspect of the patent application, wherein the step of extracting the feature of the file further comprises: A1. pre-processing the file, first performing word normalization, wherein The normalization action, including unifying the English case of the file to be analyzed, and deleting various punctuation marks; A2. Secondly, the word breaking, and the action of creating a stop word library to filter the stop words, wherein the stop words Words include virtual words, connected words, meaningless words or vocabulary without comparison; and A3. Finally, the pre-built Chinese thesaurus, pre-built English fonts, key Chinese words, The number of occurrences of key English words, wherein the process of extracting the file feature vector is to use the pre-built Chinese character library, Chinese lexicon and English font to calculate the number of occurrences of each word in the file, but does not appear in the pre-built word library. They are called with key Chinese words and key English words. 如申請專利範圍第2項所述之基於群體特徵值的文件分類方法,其中取出關鍵中文詞之流程,為避免取出時耗時過長,取詞時以二字詞為單位,並搭配了預建中文字庫作為篩選之輔助。 For example, the document classification method based on the group eigenvalues mentioned in the second paragraph of the patent application, in which the process of taking out the key Chinese words is taken in order to avoid taking too long, the word is taken as a unit, and the pre-match is used. The Chinese character library is used as an auxiliary for screening. 如申請專利範圍第1項所述之基於群體特徵值的文件分類方法,其中擷取文件特徵向量之流程為針對關鍵中文詞以及關鍵英文字之部分,另行計算其出現次數,取高出現頻率者,協同預建字詞庫之出現頻率構成向量資訊。 For example, the document classification method based on the group feature value described in the first paragraph of the patent application, wherein the process of extracting the file feature vector is for the part of the key Chinese word and the key English word, and the number of occurrences is calculated separately, and the frequency of occurrence is taken. The frequency of occurrence of the collaborative pre-built word library constitutes vector information. 如申請專利範圍第4項所述之基於群體特徵值的文件分類方法,其中為避免出現次數不多的關鍵字詞反造成相似度之偏移,因而取出高頻關鍵字詞之流程,搭配了預建字詞庫作為篩選之輔助。 For example, in the document classification method based on the group feature value described in Item 4 of the patent application, in order to avoid the occurrence of the offset of the similarity of the keyword words which are not frequently generated, the process of taking out the high frequency keyword words is matched with Pre-built word library as an aid to screening. 如申請專利範圍第1項所述之基於群體特徵值的文件分類方法,其中該文件分類流程為:D1.群體特徵值之擷取包括各歸屬於該分類之所有文件之關鍵字詞頻率,取其平均得到一特定分類文件向量均值;D2.其次以文件向量進行相似度之運算,計算方式可為餘弦相似性方法;D3.取得與所有分類之相似度後,若所有值皆未達一指定門 檻,則將此一待分類文件歸為新一分類;否則,將其分類至相似度最高之分類中,完成文件分類工作。 The document classification method based on the group feature value according to claim 1, wherein the file classification process is: D1. The extraction of the group feature value includes the keyword word frequency of each file belonging to the category, The average obtains a specific classification file vector mean; D2. Secondly, the file vector is used for the similarity calculation, and the calculation method can be the cosine similarity method; D3. After obtaining the similarity with all the classifications, if all the values are less than one specified door 槛, the file to be classified is classified into a new category; otherwise, it is classified into the category with the highest similarity to complete the file classification work. 如申請專利範圍第6項所述之基於群體特徵值的文件分類方法,其中該相似度之計算流程中,更包括依重要性提高關鍵字詞權重之步驟。 For example, the document classification method based on the group feature value described in claim 6 of the patent application, wherein the process of calculating the similarity further includes the step of increasing the weight of the keyword word according to the importance.
TW102137282A 2013-10-16 2013-10-16 File classification method based on group characteristic values TW201516713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW102137282A TW201516713A (en) 2013-10-16 2013-10-16 File classification method based on group characteristic values

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW102137282A TW201516713A (en) 2013-10-16 2013-10-16 File classification method based on group characteristic values

Publications (1)

Publication Number Publication Date
TW201516713A true TW201516713A (en) 2015-05-01

Family

ID=53720347

Family Applications (1)

Application Number Title Priority Date Filing Date
TW102137282A TW201516713A (en) 2013-10-16 2013-10-16 File classification method based on group characteristic values

Country Status (1)

Country Link
TW (1) TW201516713A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684121A (en) * 2018-12-20 2019-04-26 鸿秦(北京)科技有限公司 A kind of file access pattern method and system
TWI686716B (en) * 2016-07-25 2020-03-01 斯庫林集團股份有限公司 Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI686716B (en) * 2016-07-25 2020-03-01 斯庫林集團股份有限公司 Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program
CN109684121A (en) * 2018-12-20 2019-04-26 鸿秦(北京)科技有限公司 A kind of file access pattern method and system

Similar Documents

Publication Publication Date Title
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
WO2021227831A1 (en) Method and apparatus for detecting subject of cyber threat intelligence, and computer storage medium
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
El et al. Authorship analysis studies: A survey
Vani et al. Using K-means cluster based techniques in external plagiarism detection
Nguyen et al. Sentiment classification on polarity reviews: an empirical study using rating-based features
Stolerman et al. Breaking the closed-world assumption in stylometric authorship attribution
WO2021121279A1 (en) Text document categorization using rules and document fingerprints
TW201324199A (en) Content analysis method based on similarity matching
Al-Yahya Stylometric analysis of classical Arabic texts for genre detection
Gupta et al. Plagiarism detection in text documents using sentence bounded stop word n-grams
CN113282717B (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
TW201516713A (en) File classification method based on group characteristic values
CN110489759B (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN114461763B (en) Network security event extraction method based on burst word clustering
Saini et al. Intrinsic plagiarism detection system using stylometric features and DBSCAN
Joby et al. Accessing accurate documents by mining auxiliary document information
CN115687960A (en) Text clustering method for open source security information
Zhang et al. Effective and fast near duplicate detection via signature-based compression metrics
Bloothooft et al. Learning name variants from inexact high-confidence matches
Magdy et al. Arabic cross-document person name normalization
Shang Research on Chinese New Word Discovery Algorithm Based on Mutual Information
CN109063117B (en) Network security blog classification method and system based on feature extraction
KR20170067558A (en) A malicious comments detection technique on the Internet using support vector machine
Alexakis et al. Evaluation of Content Fusion Algorithms for Large and Heterogeneous Datasets