TW201113719A

TW201113719A - Characteristic value comparison based content analysis method

Info

Publication number: TW201113719A
Application number: TW98134710A
Authority: TW
Inventors: ming-zhe Zhang; ke-hua Xu; bao-zhong Zhang; Can-Xiong Liu
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2009-10-14
Filing date: 2009-10-14
Publication date: 2011-04-16

Abstract

A characteristic value comparison based content analysis method is disclosed, which comprises the following steps: firstly, calculating a characteristic value of a confidential file to be protected, and constructing a corresponding data characteristic file; and later, at a content analysis stage of a confidential protection operation, comparing data contents with fixed length for a suspicious file by using the characteristic data of the file, wherein if the data accords with the characteristic comparison, the suspicious file is a file containing confidentiality, and a related protection operation appointed by a policy is performed by a protection system, otherwise, the suspicious file is judged as a file which does not contain confidentiality. The present invention improves the confidentiality protection field based partial document matching technology, enhances the comparison efficiency and precision of large-sized document s by utilizing a method for cutting data space, and achieves two difficult requirements of confidentially identification rate and system efficiency by means of performing the adjustment of related parameters during data comparison according to the length of the target file.

Description

201113719 六、發明說明：【發明所屬之技術領域】本發明係關於一種基於特徵值比對的内容分析方法，特別是關於一種利用部份文件比對技術以提供快速及精確的内容分析方法；本發明藉由分割特徵值的資料空間改善了一般文件比對技術於大資料量的分析作業中難以避免的效能問題，亦可提高機密資料的辨識正確率。【先前技術】習用的機密防護系統，在内容分析中採用的部份文件比對方法，必須面臨機密辨識率與系統效能的兩難需求。較精確的機密辨識功能就會產生最多的資料比對健而造録紐能之衝擊，反之若是追糊容分析的執行效能則難以兼顧機密雜_鱗。且因為部份文件輯技術所處理的是指數成長的複減_，錢歸本龍或目標請長度增加時，系統效能就會祕急速惡化㈣境。綠於機_魏術用鳴泛，觸卜舰_爾綠㈣繼嗎:：辨識率及效能方面的需求。由此可見，上述習用方式仍有諸多不足，實非一良善之設計而祕加以改良。本案發明人倾上述習时式所衍钱各點，⑽思加以改良創孤1旨取、研究後，終於成功研發完成本件-種基於特徵值比對的内容分析方法。【發明内容】 201113719 本發明之目的即在於提供一種既快速又精確的内容分析方法，係以切割特徵值資料空間的方法改進大型文件的比對效能與正確性，並依據目標槽案的長度進行請輯參紅鮮，從喊錢密_率婦統效能的兩難需求’且由於字元内碼分佈的特性，本發明在中文與英文檔案互相比對的作業巾更能大幅減少比對次數，驗善内容分析的效能與正確性。可達成上述發明目的之一種基於特徵值比對的内容分析方法，主要係由文件特難職程及文件概崎流輯組成，於建構難文件的特徵值時根據各#料區塊所算出的字碼統計值，分賊存特徵值於所屬的特徵檔案中’之後於文件比對階段再依各資料區塊的字碼統計值進行分區比對·，如此可在大資料量_容分析作業中，尤妓中文與英文職互相比對的情況下大幅減少比對次數而改善系統的效能，同時因為在特徵值比對作業中加上一字碼統計值的區別，使得特徵雜湊值的衝突機率進一步降低，而能提高内容分析的正確率。另外在文件特徵的比對作業中，本發明也依據目標文件的長度調整資料比對之重疊參數，使得_量較小的輯㈣有較精確的機密辨識率，而資料量大的比對作業則可改善系統的執行效能。【實施方式】請參閱圖-，為本發明基於特徵值比對的内容分析方法之流程示意圖，係先在欲防護賴蚊件_敎件特徵_舰(2)進行特徵值的計算’並建構出相應的文件特徵檔案⑶。在機密防護作業_容分析階段，載入特徵值(4)至姉應的浦結構以後，即可侧文件特齡對功能⑶依此文件特徵資料向目標文件⑹進行指定長度㈣料内容比對。若有符合特 201113719 徵比對的貝料則為包含機密的檔案，可以讓防護系統進行政策所指定的相關防4動作，即可判定為不含機密的文件。月參閱圖—，為本發明基於特徵值比對的内容分析方法之文件特徵建構流程圖，其步驟包括： •叹又資料的讀取終點之後進行機密文件的循序讀取； b.排除空自字元以組成—個固定長度的資籠塊並且計算字碼統計值’在攻裡所謂的空白字元係指space，tab，換行符號等字元，屬於與文件機密不相關的資料；當讀取至區塊長度時，其字碼統計值亦已冲算兀成，此一數值係統計資料區塊内的每一字元内碼值之分佈’其計算方法為：字元内碼值字碼統計值小於64 不變介於64與123之間加一介於124與190之間加二大於190 加三 • 上述的内碼條件值係基於中/英文字碼分佈的範圍而選定，以利中/ 英文件的分區比對； C.完成資料區塊的讀取後即利用雜凑演算法計算此區塊的特徵值，可採用CRC4G演算法進行雜麟算，以滿足緒對於正確率與效能的雙重需求； d.依據字石馬統計值將CRC40雜湊值寫至不同的特徵槽案，以區塊長度設為64Bytes的系統為例，可採用如下的分割方式·· IF字碼統計值<56THEN寫出CRC40值至特徵播i : 201113719 ELSE正字碼統計值>55且<60 THEN寫出CRC40值至特徵檔2 ; ELSEIF字碼統計值>59且<90THEN寫出CRC40值至特徵檔3 ; ELSEIF字碼統計值>89且〈105THEN寫出CRC4〇值至特徵檔4 ; ELSE正字碼統計值>1〇4且<11〇ΤΗΕΝ寫出CRC40值至特徵檔5; ELSEIF字碼統計值>1〇9且〈115THEN寫出CRC4〇值至特徵檔6; ELSE IF字碼統計值>114且<12〇THEN寫出crc4〇值至特徵槽7; ELSEIF字碼統計值>119且<™ΤΗΕΝ寫出CROK)值至特徵檔8; ELSE IF字碼統計值>129ΤΗΕΝ寫出crc4〇值至特徵檔9 ; 上述條件值的選定係考量中文字内碼的分佈範圍較廣且本發明的應用％境以中文為主，因此特徵檔案的劃分是在中文碼範圍採取較、”田的切割而央文碼採取較寬的蝴；當祕岐的資料區塊長 Bytes日f ’這些條件值自然會因字碼統計值的變動而不同，另外如果純要求更精細或更粗略的英文对文碼範圍之切割方式亦會產生不同的條件值和檔案的切割數量； e·完成此—特徵值分區寫槽的作業後將字碼統計值重置為零，並依系統内疋之步進參數設定下一個讀取位置，所謂的步進參數係決定每-個取樣資舰_重疊程度，重疊錄何制愈高的機密辨識率，但系統效能則受拖累； f.重覆上述b.至e.的處理步驟直至讀取终點才結束。 μ參閱圖―，為本發縣於特徵值比對_容分析方法的文件特徵比對流程圖，其步驟包括： a.-又疋目標文件的項轉點，亦即文件㈣末端減去資料區塊長度的 201113719 位置； b·依據目標文件的長度設謂_參數，當文㈣小於某一内定值時採用最大的重叠參數值，以便在不影響系統效能的狀況下進行最精確的比對作業； C.=目標文件之特徵建構步驟，包含循序讀取排除空白字元、計子碼統指' 達區塊長度後計算諸雜難等處理步驟； d.依據字碼統計值設定所屬的特徵資航舰及其終點； e·進入特徵值循序比對的作業，如果在特徵資料區發現相同的 CRC40 _值，柯欺純錢錄《細_值；否則重置字碼統計值，並依重疊參數設定下一倘讀取位置，再檢查是否已讀至終點位置；如果已經比對至讀取終點仍未發現相同的雜凑值則可判定為不含機密資訊，傳回亂犯值，否則繼續進行下一資料區塊的比對作業。如此本發日謂特徵«料空間的方法改進了大資料量的比對效能，且因在特徵值比對作業中加上字碼統計值的區別，使得雜凑值衝突的情形進一步減少，從而提高内容分析的正確率。本發明之基於特徵值比對的内容分析方法，與其他習用技術相互比較時’更具備下列的優點： 1. 本發明可改進文件内容分析作業的效能，尤其在中文與英文檔案互相比對時更能大幅減少比對次數而提高執行效率。 2. 本發明可提南機密資料的辨識正確率，減少將文件誤判為含有機密的情形。 201113719 本《明於貝料量小的比對作業中具有精確的機密辨識率，而在資料量大的比對作業則提高執行效能。、上列詳細說明係針對本發明之一可行實施例之具體說明，惟該實施例並非用以限制本發明之專利細，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本案之專·圍巾。、夕综上所述，本料但在撕思想上確屬_，並能較制物品增進上 =項力放應以充分符合新穎性及進步性之法定發明專利要件，麦依法提出申請’騎細獅輸辦請案，_㈣，至感德便。【圖式簡單說明】圖一為本發明基於特徵值比對的内容分析方法找程示意圖；圖二為該基於特徵值比對的内容分析方法之文件特徵建構流程圖；以及 i為該基於特徵值比對_容分析方法之文件特徵比對流程圖。【主要元件符號說明】201113719 VI. Description of the Invention: [Technical Field] The present invention relates to a content analysis method based on eigenvalue comparison, and more particularly to a method for using a partial file comparison technique to provide a fast and accurate content analysis method; The invention improves the performance problem that the common file comparison technique is difficult to avoid in the analysis work of large data volume by dividing the data space of the feature value, and can also improve the recognition accuracy rate of the confidential data. [Prior Art] Conventional confidentiality protection systems, which are part of the file comparison method used in content analysis, must face the dilemma of confidentiality recognition rate and system performance. The more accurate confidentiality identification function will produce the most information than the impact of the new and powerful, and if it is the performance of the paste analysis, it is difficult to balance the confidentiality. And because some of the documentary techniques deal with the reduction of the exponential growth _, when the money is returned to the dragon or the target length increases, the system performance will suddenly deteriorate (4). Green in the machine _ Wei Shu use the Ming Fan, the Caterpillar _ Er Green (four) following:: Identification rate and performance requirements. It can be seen that there are still many shortcomings in the above-mentioned methods of use. It is not a good design and the secret is improved. The inventor of the case poured out all the points of the above-mentioned syllabus, and (10) thought to improve the creation of the singularity. After the research, he finally succeeded in researching and developing the content-based content analysis method based on eigenvalue comparison. SUMMARY OF THE INVENTION 201113719 The object of the present invention is to provide a fast and accurate content analysis method, which is to improve the performance and correctness of large files by cutting the feature value data space, and according to the length of the target slot. Please select the red and fresh, from the dilemma of shouting money secret _ rate of women's performance and because of the characteristics of the character code distribution, the invention can greatly reduce the number of comparisons in the operation of the Chinese and English files. Check the effectiveness and correctness of content analysis. A content analysis method based on eigenvalue comparison which can achieve the above object, mainly consists of a file difficulty course and a file summary flow file, and is calculated according to each # material block when constructing the feature value of the difficult file. The statistic value of the word, the singularity value of the thief is stored in the attribute file of the genus, and then the statistic value of each data block is compared in the file comparison stage, so that in the large data volume _ capacity analysis operation, In the case of the comparison between Chinese and English, the number of comparisons is greatly reduced to improve the performance of the system. At the same time, because of the difference between the statistic value comparison and the one-word statistic, the probability of collision of the feature hash value is further reduced. , can improve the accuracy of content analysis. In addition, in the comparison operation of the document features, the present invention also adjusts the overlapping parameters of the data comparison according to the length of the target file, so that the _ small amount of the series (4) has a more accurate confidentiality rate, and the data volume is larger than the matching operation. It can improve the performance of the system. [Embodiment] Please refer to the figure--, the flow chart of the content analysis method based on the eigenvalue comparison of the present invention, is to firstly calculate and construct the eigenvalues of the larvae 敎特征特征特征舰 (2) The corresponding file profile file (3). In the confidential protection operation _ capacity analysis stage, after loading the eigenvalue (4) to the 浦的结构 structure, the side file aging function can be performed. (3) According to the document characteristic data, the target file (6) can be assigned a specified length (4) material content comparison. . If there is a file containing the special 201113719, it is a file containing confidentiality, and the protection system can be determined to be a confidential file without the relevant action 4 specified in the policy. Referring to the figure-- is a flowchart for constructing a file feature of the content analysis method based on the eigenvalue comparison of the present invention, and the steps thereof include: • Sequencing and reading the confidentiality file after reading the end point of the data; b. Excluding the empty word Yuan to form - a fixed length of the cage block and calculate the word statistic value 'in the attack, the so-called blank character refers to the space, tab, line break symbol and other characters, belonging to the file confidential information; when reading to When the block length is used, the statistical value of the word code has also been calculated, and the distribution of the code value of each character in the data block of the data system is calculated as follows: The code value of the code value code within the character is less than 64 is constant between 64 and 123 plus one between 124 and 190 plus two is greater than 190 plus three • The above inner code condition value is selected based on the range of Chinese/English code distribution to facilitate Chinese/English file Partition comparison; C. After completing the reading of the data block, the hash value algorithm is used to calculate the feature value of the block, and the CRC4G algorithm can be used to perform the hybrid calculation to meet the dual requirements of correctness and efficiency. d According to the grammar value of the word stone, the CRC40 hash value is written to different feature slots. For the system with the block length set to 64 Bytes, the following segmentation method can be adopted. · IF code statistic value <56THEN Write CRC40 value To feature broadcast i : 201113719 ELSE positive word statistic value > 55 and < 60 THEN write CRC40 value to feature file 2; ELSEIF word code statistic value > 59 and <90THEN write CRC40 value to feature file 3; ELSEIF word code The statistical value > 89 and <105THEN writes the CRC4 threshold to the signature 4; the ELSE positive code statistic > 1 〇 4 and <11 〇ΤΗΕΝ writes the CRC40 value to the signature 5; ELSEIF statistic value > 〇9 and <115THEN writes the CRC4 threshold to the signature 6; ELSE IF word statistic value > 114 and <12〇THEN writes the crc4〇 value to the feature slot 7; ELSEIF word statistic value > 119 and <TM ΤΗΕΝ Write the CROK) value to the signature file 8; ELSE IF word statistic value > 129 ΤΗΕΝ write the crc4 〇 value to the signature file 9; the selection of the above condition value takes into account the wide range of the text internal code distribution and the application of the present invention % is mainly in Chinese, so the feature file is divided in the Chinese code range, "Tian's cutting" The central text code takes a wider butterfly; when the secret data block length Bytes day f 'these condition values will naturally vary depending on the statistical value of the word code, and if the pure requirement is finer or coarser English-to-text code The cutting method of the range will also produce different condition values and the number of cuts of the file; e. Completion of this - the feature value partition writes the job after the slot is reset to zero, and is set according to the step parameters in the system. A reading position, the so-called step parameter determines the degree of overlap of each sample ship, and the higher the confidentiality rate of the system, but the system performance is dragged down; f. Repeat the above b. to e. The processing steps are not completed until the end of the reading. μ Refer to the figure, the file feature comparison flowchart of the eigenvalue comparison _ volume analysis method of the county, the steps of which include: a.- and the item turning point of the target file, that is, the end of the file (four) minus the data The position of the block length is 201113719; b. According to the length of the target file, the maximum value of the overlap parameter is used when the text (4) is smaller than a certain value, so as to make the most accurate comparison without affecting the system performance. C.=The characteristic construction step of the target file, including the step of reading the blank characters in the order, the counting code refers to the processing steps after calculating the length of the block; d. setting the characteristics according to the statistical value of the code The navigable ship and its end point; e· enter the eigenvalue sequential comparison operation, if the same CRC40 _ value is found in the feature data area, Ke Biao pure money record “fine_value; otherwise reset the code statistic value, and overlap If the parameter is set to the next position, check whether it has been read to the end position. If the same hash value has not been found until the reading end point is reached, it can be judged that it does not contain confidential information, and the value of the disorder is returned. Otherwise, otherwise carry on Perform the comparison of the next data block. Therefore, the method of the feature space of the present invention improves the comparison performance of the large data amount, and the difference of the hash value is further reduced due to the difference of the statistic value in the eigenvalue comparison operation, thereby improving the situation of the hash value conflict. The correct rate of content analysis. The content analysis method based on the eigenvalue comparison of the present invention has the following advantages when compared with other conventional technologies: 1. The present invention can improve the performance of the file content analysis operation, especially when the Chinese and English files are compared with each other. It can greatly reduce the number of comparisons and improve execution efficiency. 2. The present invention can improve the identification accuracy of confidential information in South China and reduce the misclassification of documents into confidentiality. 201113719 This book has a precise confidentiality rate in the comparison work with a small amount of material, and the performance is improved when the data is large. The detailed description of the present invention is intended to be illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention. In the case of the special scarf. According to the above, the material is indeed a _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The lion sent the case to the case, _ (four), to the sense of virtue. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a method for searching a content analysis method based on eigenvalue comparison according to the present invention; FIG. 2 is a flowchart of constructing a file feature of a content analysis method based on eigenvalue comparison; and i is a feature based on the feature The value comparison is compared to the file characteristics of the volume analysis method. [Main component symbol description]

Claims

201113719 ' VII, the scope of application for patents: 1 · - Based on the ship value comparison _ capacity analysis method, at least include: a document special "construction" __ mosquito collection of miscellaneous construction function, ordering feature system And constructing a reduced file feature slot case; a feature value manned process, carrying the feature values of each zone into the associated data structure; a file feature process flow> file feature comparison Wei (four) person's file feature data The target file compares the contents of the specified length to determine whether the file contains confidential φ data. 2. If you apply for a patent range! The content analysis method based on the feature value comparison described in the item, wherein the file feature construction process, the processing steps include: a) setting the reading end point of the data, sequentially reading the contents of the confidential file; b. eliminating blank characters, Composition-side data block; c. Calculate the statistical value of the data block; d. Calculate the feature value of the data block using the hash algorithm; ^ e. Write the hash value to the value based on the code_value _Feature file; reset the subcode system juice value, and set the next reading position; g. Repeat the spicy b.j_f. ' Until the end of the reading secret. Such as Shen. The content analysis method based on eigenvalue comparison according to item 2 of the monthly patent scope, wherein the feature file_minute adopts a fine cut in the Chinese code range and a wider cut in the English code range. The content analysis method based on the eigenvalue comparison described in item 2 of the patent scope of the application, wherein the calculation and application of the statistical value of the word, the processing steps thereof include: 201113719 a. Excluding the blank character 'constituting a fixed length of data Block; b. Accumulative calculation of the word statistic value according to the range of the inner code value of the character; in the file order and age code riding health value (4) is a building case; d. in the file feature comparison stage according to the word statistic value The partition compares the feature values. 5_ The content analysis method based on the eigenvalue comparison as described in claim 4, wherein the inner code value range condition is selected based on a range of the Chinese/English word distribution. 6. The content analysis method based on eigenvalue comparison according to the scope of the patent application scope, wherein the file feature comparison process comprises: a_setting a reading end point of the target file; b· according to the target file Length setting data overlap parameter; c. Read the contents of the target file sequentially; d·exclude the blank word it, compose the __ fixed length data block; e. calculate the data statistic value of the data block; f. use the hash calculus The method calculates the eigenvalues of the data block; g· sets the characteristic data comparison area and its end point according to the grammar value of the word; h. the zoning sequence comparison of the eigenvalues to determine whether the file contains confidential information; Set the word statistic value, and record the overlap parameter setting under the reading position; j repeat step c· to i. until the end point is broken or the confidential information is found. A content analysis method based on eigenvalue comparison according to item 6 of the patent application scope, wherein the data overlap parameter is smaller than the value of the target file length of a certain Mosquito _ maximum overlap parameter value, so that the forest affects the pure performance condition The most accurate alignment work. 201113719 8· The content analysis method based on eigenvalue comparison according to item 6 of the patent application scope, wherein the calculation and application of the statistic value of the word, the processing steps thereof comprise: a. Excluding the blank word 70, forming a fixed length b. According to the inner code value range of the character, the statistic value of the word is cumulatively calculated; c• in the file feature construction stage, the feature value partition is written to the file according to the statistic value; d. The stage compares the feature values according to the word statistics. 9_ The content analysis method based on eigenvalue comparison according to item 8 of claim patent, wherein the inner code value range condition is selected based on a range of Chinese/English word distribution.