TW201113719A - Characteristic value comparison based content analysis method - Google Patents

Characteristic value comparison based content analysis method Download PDF

Info

Publication number
TW201113719A
TW201113719A TW98134710A TW98134710A TW201113719A TW 201113719 A TW201113719 A TW 201113719A TW 98134710 A TW98134710 A TW 98134710A TW 98134710 A TW98134710 A TW 98134710A TW 201113719 A TW201113719 A TW 201113719A
Authority
TW
Taiwan
Prior art keywords
file
value
comparison
feature
data
Prior art date
Application number
TW98134710A
Other languages
Chinese (zh)
Inventor
ming-zhe Zhang
ke-hua Xu
bao-zhong Zhang
Can-Xiong Liu
Original Assignee
Chunghwa Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chunghwa Telecom Co Ltd filed Critical Chunghwa Telecom Co Ltd
Priority to TW98134710A priority Critical patent/TW201113719A/en
Publication of TW201113719A publication Critical patent/TW201113719A/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A characteristic value comparison based content analysis method is disclosed, which comprises the following steps: firstly, calculating a characteristic value of a confidential file to be protected, and constructing a corresponding data characteristic file; and later, at a content analysis stage of a confidential protection operation, comparing data contents with fixed length for a suspicious file by using the characteristic data of the file, wherein if the data accords with the characteristic comparison, the suspicious file is a file containing confidentiality, and a related protection operation appointed by a policy is performed by a protection system, otherwise, the suspicious file is judged as a file which does not contain confidentiality. The present invention improves the confidentiality protection field based partial document matching technology, enhances the comparison efficiency and precision of large-sized document s by utilizing a method for cutting data space, and achieves two difficult requirements of confidentially identification rate and system efficiency by means of performing the adjustment of related parameters during data comparison according to the length of the target file.

Description

201113719 六、發明說明: 【發明所屬之技術領域】 本發明係關於一種基於特徵值比對的内容分析方法,特別是關於一種 利用部份文件比對技術以提供快速及精確的内容分析方法;本發明藉由分 割特徵值的資料空間改善了一般文件比對技術於大資料量的分析作業中難 以避免的效能問題,亦可提高機密資料的辨識正確率。 【先前技術】 習用的機密防護系統,在内容分析中採用的部份文件比對方法,必須 面臨機密辨識率與系統效能的兩難需求。較精確的機密辨識功能就會產生 最多的資料比對健而造録紐能之衝擊,反之若是追糊容分析的執 行效能則難以兼顧機密雜_鱗。且因為部份文件輯技術所處理的 是指數成長的複減_,錢歸本龍或目標請長度增加時,系統 效能就會祕急速惡化㈣境。綠於機_魏術 用鳴泛,觸卜舰_爾綠㈣繼嗎:: 辨識率及效能方面的需求。 由此可見,上述習用方式仍有諸多不足,實非一良善之設計而祕 加以改良。 本案發明人倾上述習时式所衍钱各點,⑽思加以改良創 孤1旨取、研究後,終於成功研發完成本件-種基於特徵值比 對的内容分析方法。 【發明内容】 201113719 本發明之目的即在於提供一種既快速又精確的内容分析方法,係以切 割特徵值資料空間的方法改進大型文件的比對效能與正確性,並依據目標 槽案的長度進行請輯參紅鮮,從喊錢密_率婦統效能的 兩難需求’且由於字元内碼分佈的特性,本發明在中文與英文檔案互相比 對的作業巾更能大幅減少比對次數,驗善内容分析的效能與正確性。 可達成上述發明目的之一種基於特徵值比對的内容分析方法,主要係 由文件特難職程及文件概崎流輯組成,於建構難文件的特徵 值時根據各#料區塊所算出的字碼統計值,分賊存特徵值於所屬的特徵 檔案中’之後於文件比對階段再依各資料區塊的字碼統計值進行分區比 對·,如此可在大資料量_容分析作業中,尤妓中文與英文職互相比 對的情況下大幅減少比對次數而改善系統的效能,同時因為在特徵值比對 作業中加上一字碼統計值的區別,使得特徵雜湊值的衝突機率進一步降 低,而能提高内容分析的正確率。 另外在文件特徵的比對作業中,本發明也依據目標文件的長度調整資 料比對之重疊參數,使得_量較小的輯㈣有較精確的機密辨識率, 而資料量大的比對作業則可改善系統的執行效能。 【實施方式】 請參閱圖-,為本發明基於特徵值比對的内容分析方法之流程示意 圖,係先在欲防護賴蚊件_敎件特徵_舰(2)進行特徵值的計 算’並建構出相應的文件特徵檔案⑶。在機密防護作業_容分析階段, 載入特徵值(4)至姉應的浦結構以後,即可侧文件特齡對功能⑶依 此文件特徵資料向目標文件⑹進行指定長度㈣料内容比對。若有符合特 201113719 徵比對的貝料則為包含機密的檔案,可以讓防護系統進行政策所指定的相 關防4動作,即可判定為不含機密的文件。 月參閱圖—,為本發明基於特徵值比對的内容分析方法之文件特徵建 構流程圖,其步驟包括: •叹又資料的讀取終點之後進行機密文件的循序讀取; b.排除空自字元以組成—個固定長度的資籠塊並且計算字碼統計 值’在攻裡所謂的空白字元係指space,tab,換行符號等字元,屬於 與文件機密不相關的資料;當讀取至區塊長度時,其字碼統計值亦 已冲算兀成,此一數值係統計資料區塊内的每一字元内碼值之分 佈’其計算方法為: 字元内碼值 字碼統計值 小於64 不變 介於64與123之間 加一 介於124與190之間 加二 大於190 加三 • 上述的内碼條件值係基於中/英文字碼分佈的範圍而選定,以利中/ 英文件的分區比對; C.完成資料區塊的讀取後即利用雜凑演算法計算此區塊的特徵值,可 採用CRC4G演算法進行雜麟算,以滿足緒對於正確率與效能 的雙重需求; d.依據字石馬統計值將CRC40雜湊值寫至不同的特徵槽案,以區塊長 度設為64Bytes的系統為例,可採用如下的分割方式·· IF字碼統計值&lt;56THEN寫出CRC40值至特徵播i : 201113719 ELSE正字碼統計值&gt;55且&lt;60 THEN寫出CRC40值至特徵檔2 ; ELSEIF字碼統計值&gt;59且&lt;90THEN寫出CRC40值至特徵檔3 ; ELSEIF字碼統計值&gt;89且〈105THEN寫出CRC4〇值至特徵檔4 ; ELSE正字碼統計值&gt;1〇4且&lt;11〇ΤΗΕΝ寫出CRC40值至特徵檔5; ELSEIF字碼統計值&gt;1〇9且〈115THEN寫出CRC4〇值至特徵檔6; ELSE IF字碼統計值&gt;114且&lt;12〇THEN寫出crc4〇值至特徵槽7; ELSEIF字碼統計值&gt;119且&lt;™ΤΗΕΝ寫出CROK)值至特徵檔8; ELSE IF字碼統計值&gt;129ΤΗΕΝ寫出crc4〇值至特徵檔9 ; 上述條件值的選定係考量中文字内碼的分佈範圍較廣且本發明的 應用%境以中文為主,因此特徵檔案的劃分是在中文碼範圍採取較 、”田的切割而央文碼採取較寬的蝴;當祕岐的資料區塊長 Bytes日f ’這些條件值自然會因字碼統計值的變動而不 同,另外如果純要求更精細或更粗略的英文对文碼範圍之切割 方式亦會產生不同的條件值和檔案的切割數量; e·完成此—特徵值分區寫槽的作業後將字碼統計值重置為零,並依 系統内疋之步進參數設定下一個讀取位置,所謂的步進參數係決定 每-個取樣資舰_重疊程度,重疊錄何制愈高的機密辨 識率,但系統效能則受拖累; f.重覆上述b.至e.的處理步驟直至讀取终點才結束。 μ參閱圖―,為本發縣於特徵值比對_容分析方法的文件特徵比 對流程圖,其步驟包括: a.-又疋目標文件的項轉點,亦即文件㈣末端減去資料區塊長度的 201113719 位置; b·依據目標文件的長度設謂_參數,當文㈣小於某一内定 值時採用最大的重叠參數值,以便在不影響系統效能的狀況下進行 最精確的比對作業; C.=目標文件之特徵建構步驟,包含循序讀取排除空白字元、計 子碼統指' 達區塊長度後計算諸雜難等處理步驟; d.依據字碼統計值設定所屬的特徵資航舰及其終點; e·進入特徵值循序比對的作業,如果在特徵資料區發現相同的 CRC40 _值,柯欺純錢錄《細_值;否則 重置字碼統計值,並依重疊參數設定下一倘讀取位置,再檢查是否 已讀至終點位置;如果已經比對至讀取終點仍未發現相同的雜凑值 則可判定為不含機密資訊,傳回亂犯值,否則繼續進行下一資 料區塊的比對作業。 如此本發日謂特徵«料空間的方法改進了大資料量的比對效 能,且因在特徵值比對作業中加上字碼統計值的區別,使得雜凑值衝突的 情形進一步減少,從而提高内容分析的正確率。 本發明之基於特徵值比對的内容分析方法,與其他習用技術相互比較 時’更具備下列的優點: 1. 本發明可改進文件内容分析作業的效能,尤其在中文與英文檔案 互相比對時更能大幅減少比對次數而提高執行效率。 2. 本發明可提南機密資料的辨識正確率,減少將文件誤判為含有機 密的情形。 201113719 本《明於貝料量小的比對作業中具有精確的機密辨識率,而在資 料量大的比對作業則提高執行效能。 、 上列詳細說明係針對本發明之一可行實施例之具體說明,惟該實施例 並非用以限制本發明之專利細,凡未脫離本發明技藝精神所為之等效實 施或變更,均應包含於本案之專·圍巾。 、 夕综上所述,本料但在撕思想上確屬_,並能較制物品增進上 =項力放應以充分符合新穎性及進步性之法定發明專利要件,麦依法 提出申請’騎細獅輸辦請案,_㈣,至感德便。 【圖式簡單說明】 圖一為本發明基於特徵值比對的内容分析方法找程示意圖; 圖二為該基於特徵值比對的内容分析方法之文件特徵建構流程圖;以 及 i為該基於特徵值比對_容分析方法之文件特徵比對流程圖。 【主要元件符號說明】201113719 VI. Description of the Invention: [Technical Field] The present invention relates to a content analysis method based on eigenvalue comparison, and more particularly to a method for using a partial file comparison technique to provide a fast and accurate content analysis method; The invention improves the performance problem that the common file comparison technique is difficult to avoid in the analysis work of large data volume by dividing the data space of the feature value, and can also improve the recognition accuracy rate of the confidential data. [Prior Art] Conventional confidentiality protection systems, which are part of the file comparison method used in content analysis, must face the dilemma of confidentiality recognition rate and system performance. The more accurate confidentiality identification function will produce the most information than the impact of the new and powerful, and if it is the performance of the paste analysis, it is difficult to balance the confidentiality. And because some of the documentary techniques deal with the reduction of the exponential growth _, when the money is returned to the dragon or the target length increases, the system performance will suddenly deteriorate (4). Green in the machine _ Wei Shu use the Ming Fan, the Caterpillar _ Er Green (four) following:: Identification rate and performance requirements. It can be seen that there are still many shortcomings in the above-mentioned methods of use. It is not a good design and the secret is improved. The inventor of the case poured out all the points of the above-mentioned syllabus, and (10) thought to improve the creation of the singularity. After the research, he finally succeeded in researching and developing the content-based content analysis method based on eigenvalue comparison. SUMMARY OF THE INVENTION 201113719 The object of the present invention is to provide a fast and accurate content analysis method, which is to improve the performance and correctness of large files by cutting the feature value data space, and according to the length of the target slot. Please select the red and fresh, from the dilemma of shouting money secret _ rate of women's performance and because of the characteristics of the character code distribution, the invention can greatly reduce the number of comparisons in the operation of the Chinese and English files. Check the effectiveness and correctness of content analysis. A content analysis method based on eigenvalue comparison which can achieve the above object, mainly consists of a file difficulty course and a file summary flow file, and is calculated according to each # material block when constructing the feature value of the difficult file. The statistic value of the word, the singularity value of the thief is stored in the attribute file of the genus, and then the statistic value of each data block is compared in the file comparison stage, so that in the large data volume _ capacity analysis operation, In the case of the comparison between Chinese and English, the number of comparisons is greatly reduced to improve the performance of the system. At the same time, because of the difference between the statistic value comparison and the one-word statistic, the probability of collision of the feature hash value is further reduced. , can improve the accuracy of content analysis. In addition, in the comparison operation of the document features, the present invention also adjusts the overlapping parameters of the data comparison according to the length of the target file, so that the _ small amount of the series (4) has a more accurate confidentiality rate, and the data volume is larger than the matching operation. It can improve the performance of the system. [Embodiment] Please refer to the figure--, the flow chart of the content analysis method based on the eigenvalue comparison of the present invention, is to firstly calculate and construct the eigenvalues of the larvae 敎 特征 特征 特征 舰 (2) The corresponding file profile file (3). In the confidential protection operation _ capacity analysis stage, after loading the eigenvalue (4) to the 浦 的 结构 structure, the side file aging function can be performed. (3) According to the document characteristic data, the target file (6) can be assigned a specified length (4) material content comparison. . If there is a file containing the special 201113719, it is a file containing confidentiality, and the protection system can be determined to be a confidential file without the relevant action 4 specified in the policy. Referring to the figure-- is a flowchart for constructing a file feature of the content analysis method based on the eigenvalue comparison of the present invention, and the steps thereof include: • Sequencing and reading the confidentiality file after reading the end point of the data; b. Excluding the empty word Yuan to form - a fixed length of the cage block and calculate the word statistic value 'in the attack, the so-called blank character refers to the space, tab, line break symbol and other characters, belonging to the file confidential information; when reading to When the block length is used, the statistical value of the word code has also been calculated, and the distribution of the code value of each character in the data block of the data system is calculated as follows: The code value of the code value code within the character is less than 64 is constant between 64 and 123 plus one between 124 and 190 plus two is greater than 190 plus three • The above inner code condition value is selected based on the range of Chinese/English code distribution to facilitate Chinese/English file Partition comparison; C. After completing the reading of the data block, the hash value algorithm is used to calculate the feature value of the block, and the CRC4G algorithm can be used to perform the hybrid calculation to meet the dual requirements of correctness and efficiency. d According to the grammar value of the word stone, the CRC40 hash value is written to different feature slots. For the system with the block length set to 64 Bytes, the following segmentation method can be adopted. · IF code statistic value &lt;56THEN Write CRC40 value To feature broadcast i : 201113719 ELSE positive word statistic value &gt; 55 and &lt; 60 THEN write CRC40 value to feature file 2; ELSEIF word code statistic value &gt; 59 and &lt;90THEN write CRC40 value to feature file 3; ELSEIF word code The statistical value &gt; 89 and <105THEN writes the CRC4 threshold to the signature 4; the ELSE positive code statistic &gt; 1 〇 4 and &lt;11 〇ΤΗΕΝ writes the CRC40 value to the signature 5; ELSEIF statistic value &gt; 〇9 and <115THEN writes the CRC4 threshold to the signature 6; ELSE IF word statistic value &gt; 114 and &lt;12〇THEN writes the crc4〇 value to the feature slot 7; ELSEIF word statistic value &gt; 119 and &lt;TM ΤΗΕΝ Write the CROK) value to the signature file 8; ELSE IF word statistic value &gt; 129 ΤΗΕΝ write the crc4 〇 value to the signature file 9; the selection of the above condition value takes into account the wide range of the text internal code distribution and the application of the present invention % is mainly in Chinese, so the feature file is divided in the Chinese code range, "Tian's cutting" The central text code takes a wider butterfly; when the secret data block length Bytes day f 'these condition values will naturally vary depending on the statistical value of the word code, and if the pure requirement is finer or coarser English-to-text code The cutting method of the range will also produce different condition values and the number of cuts of the file; e. Completion of this - the feature value partition writes the job after the slot is reset to zero, and is set according to the step parameters in the system. A reading position, the so-called step parameter determines the degree of overlap of each sample ship, and the higher the confidentiality rate of the system, but the system performance is dragged down; f. Repeat the above b. to e. The processing steps are not completed until the end of the reading. μ Refer to the figure, the file feature comparison flowchart of the eigenvalue comparison _ volume analysis method of the county, the steps of which include: a.- and the item turning point of the target file, that is, the end of the file (four) minus the data The position of the block length is 201113719; b. According to the length of the target file, the maximum value of the overlap parameter is used when the text (4) is smaller than a certain value, so as to make the most accurate comparison without affecting the system performance. C.=The characteristic construction step of the target file, including the step of reading the blank characters in the order, the counting code refers to the processing steps after calculating the length of the block; d. setting the characteristics according to the statistical value of the code The navigable ship and its end point; e· enter the eigenvalue sequential comparison operation, if the same CRC40 _ value is found in the feature data area, Ke Biao pure money record “fine_value; otherwise reset the code statistic value, and overlap If the parameter is set to the next position, check whether it has been read to the end position. If the same hash value has not been found until the reading end point is reached, it can be judged that it does not contain confidential information, and the value of the disorder is returned. Otherwise, otherwise carry on Perform the comparison of the next data block. Therefore, the method of the feature space of the present invention improves the comparison performance of the large data amount, and the difference of the hash value is further reduced due to the difference of the statistic value in the eigenvalue comparison operation, thereby improving the situation of the hash value conflict. The correct rate of content analysis. The content analysis method based on the eigenvalue comparison of the present invention has the following advantages when compared with other conventional technologies: 1. The present invention can improve the performance of the file content analysis operation, especially when the Chinese and English files are compared with each other. It can greatly reduce the number of comparisons and improve execution efficiency. 2. The present invention can improve the identification accuracy of confidential information in South China and reduce the misclassification of documents into confidentiality. 201113719 This book has a precise confidentiality rate in the comparison work with a small amount of material, and the performance is improved when the data is large. The detailed description of the present invention is intended to be illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention. In the case of the special scarf. According to the above, the material is indeed a _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The lion sent the case to the case, _ (four), to the sense of virtue. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of a method for searching a content analysis method based on eigenvalue comparison according to the present invention; FIG. 2 is a flowchart of constructing a file feature of a content analysis method based on eigenvalue comparison; and i is a feature based on the feature The value comparison is compared to the file characteristics of the volume analysis method. [Main component symbol description]

Claims (1)

201113719 ' 七、申請專利範圍: 1· -種基於舰值比對_容分析方法,至少包括: 一文件特《構錄’概__蚊制収件雜建構功能, 進订特徵制分輯算,並建構ώ減的文件特徵槽案; 一特徵值載人流程,將各區特徵值載人至所屬的資料結構中; 一文件特徵輯流程’ 文件特徵比對魏㈣人之文件特徵資料 向目標文件進行指定長度的資料内容比對,以判斷文件是否包含機密 φ 資料。 2·如申請專利範圍第!項所述之基於特徵值比對的内容分析方法,其中 忒文件特徵建構流程,處理步驟包括·· a•設定資料的讀取終點、循序讀取機密文件之内容; b.排除空白字元、組成—侧定紐的資料區塊; c·計算資料區塊的字碼統計值; d.利用雜凑演算法計算資料區塊的特徵值; ^ e.依據字碼_值將雜凑值寫至不_特徵檔案; 重置子碼統汁值,並設定下一個讀取位置; g.重覆辣b.j_f. ’直至讀祕财結束。 如申。月專利範圍第2項所述之基於特徵值比對的内容分析方法,其中 該特徵檔案_分是在中文碼範圍採取較細的切割而英文碼範圍採取 較寬的切割。 申》月專利範圍第2項所述之基於特徵值比對的内容分析方法,其中 該字碼統計值的計算與應用,其處理步驟包括: 201113719 a. 排除空白字元 '組成一個固定長度的資料區塊; b. 依據字元的内碼值範圍對字碼統計值進行累計運算; 於文件階·龄碼騎健概值分㈣麵屬樓案; d. 於文件特徵比對階段依據字碼統計值分區比對特徵值。 5_如申請專利範圍第4項所述之基於特徵值比對的内容分析方法,其中 該内碼值範圍條件係基於中/英文字碼分佈的範圍而選定。 6.如申请專利範圍第丄項所述之基於特徵值比對的内容分析方法,其中 該文件特徵比對流程,處理步驟包括: a_設定目標文件的讀取終點; b·依據目標文件的長度設定資料重疊參數; c. 循序讀取目標文件之内容; d·排除空白字it、組成__個固定長度的資料區塊; e. 計算資料區塊的字碼統計值; f. 利用雜湊演算法計算資料區塊的特徵值; g·依據字石馬統計值設定所屬的特徵資料比對區及其終點; h.特徵值之分區循序比對,以判斷文件是否包含機密資料; 1·重置字碼統計值,並錄料重疊參數設定下—個讀取位置; j重覆步驟c·至i.,直至潰取終點或發現機密資料才結束。 人如申請專利範圍第6項所述之基於特徵值比對的内容分析方法,其中 該資料重疊參數於目標文件長度小於某-蚊辦_最大的重疊參 數值,以便林影響純效能的狀況下精最精確的比對作業。 201113719 8·如申請專利範圍第6項所述之基於特徵值比對的内容分析方法,其中 該字碼統計值的計算與應用,其處理步驟包括: a.排除空白字70、組成—個固定長度的資料區塊; b·依據字元的内碼值範圍對字碼統計值進行累計運算; c•於文件特徵建構階段依據字碼統計值將特徵值分區寫至所屬檔案; d.於文件特徵比對階段依據字碼統計值分區比對特徵值。 9_如申請專利範圍第8項所述之基於特徵值比對的内容分析方法,其中 該内碼值範圍條件係基於中/英文字碼分佈的範圍而選定。201113719 ' VII, the scope of application for patents: 1 · - Based on the ship value comparison _ capacity analysis method, at least include: a document special "construction" __ mosquito collection of miscellaneous construction function, ordering feature system And constructing a reduced file feature slot case; a feature value manned process, carrying the feature values of each zone into the associated data structure; a file feature process flow> file feature comparison Wei (four) person's file feature data The target file compares the contents of the specified length to determine whether the file contains confidential φ data. 2. If you apply for a patent range! The content analysis method based on the feature value comparison described in the item, wherein the file feature construction process, the processing steps include: a) setting the reading end point of the data, sequentially reading the contents of the confidential file; b. eliminating blank characters, Composition-side data block; c. Calculate the statistical value of the data block; d. Calculate the feature value of the data block using the hash algorithm; ^ e. Write the hash value to the value based on the code_value _Feature file; reset the subcode system juice value, and set the next reading position; g. Repeat the spicy b.j_f. ' Until the end of the reading secret. Such as Shen. The content analysis method based on eigenvalue comparison according to item 2 of the monthly patent scope, wherein the feature file_minute adopts a fine cut in the Chinese code range and a wider cut in the English code range. The content analysis method based on the eigenvalue comparison described in item 2 of the patent scope of the application, wherein the calculation and application of the statistical value of the word, the processing steps thereof include: 201113719 a. Excluding the blank character 'constituting a fixed length of data Block; b. Accumulative calculation of the word statistic value according to the range of the inner code value of the character; in the file order and age code riding health value (4) is a building case; d. in the file feature comparison stage according to the word statistic value The partition compares the feature values. 5_ The content analysis method based on the eigenvalue comparison as described in claim 4, wherein the inner code value range condition is selected based on a range of the Chinese/English word distribution. 6. The content analysis method based on eigenvalue comparison according to the scope of the patent application scope, wherein the file feature comparison process comprises: a_setting a reading end point of the target file; b· according to the target file Length setting data overlap parameter; c. Read the contents of the target file sequentially; d·exclude the blank word it, compose the __ fixed length data block; e. calculate the data statistic value of the data block; f. use the hash calculus The method calculates the eigenvalues of the data block; g· sets the characteristic data comparison area and its end point according to the grammar value of the word; h. the zoning sequence comparison of the eigenvalues to determine whether the file contains confidential information; Set the word statistic value, and record the overlap parameter setting under the reading position; j repeat step c· to i. until the end point is broken or the confidential information is found. A content analysis method based on eigenvalue comparison according to item 6 of the patent application scope, wherein the data overlap parameter is smaller than the value of the target file length of a certain Mosquito _ maximum overlap parameter value, so that the forest affects the pure performance condition The most accurate alignment work. 201113719 8· The content analysis method based on eigenvalue comparison according to item 6 of the patent application scope, wherein the calculation and application of the statistic value of the word, the processing steps thereof comprise: a. Excluding the blank word 70, forming a fixed length b. According to the inner code value range of the character, the statistic value of the word is cumulatively calculated; c• in the file feature construction stage, the feature value partition is written to the file according to the statistic value; d. The stage compares the feature values according to the word statistics. 9_ The content analysis method based on eigenvalue comparison according to item 8 of claim patent, wherein the inner code value range condition is selected based on a range of Chinese/English word distribution.
TW98134710A 2009-10-14 2009-10-14 Characteristic value comparison based content analysis method TW201113719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98134710A TW201113719A (en) 2009-10-14 2009-10-14 Characteristic value comparison based content analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98134710A TW201113719A (en) 2009-10-14 2009-10-14 Characteristic value comparison based content analysis method

Publications (1)

Publication Number Publication Date
TW201113719A true TW201113719A (en) 2011-04-16

Family

ID=44909748

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98134710A TW201113719A (en) 2009-10-14 2009-10-14 Characteristic value comparison based content analysis method

Country Status (1)

Country Link
TW (1) TW201113719A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI453621B (en) * 2011-10-31 2014-09-21 Chunghwa Telecom Co Ltd A decentralized environmental information inquiry system based on user privacy
TWI474202B (en) * 2012-01-20 2015-02-21 Htc Corp Methods for parsing content of document, handheld electronic apparatus and computer program product thereof
TWI484357B (en) * 2011-12-02 2015-05-11 Inst Information Industry Quantitative-type data analysis method and quantitative-type data analysis device
US9686310B2 (en) 2012-10-17 2017-06-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for repairing a file

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI453621B (en) * 2011-10-31 2014-09-21 Chunghwa Telecom Co Ltd A decentralized environmental information inquiry system based on user privacy
TWI484357B (en) * 2011-12-02 2015-05-11 Inst Information Industry Quantitative-type data analysis method and quantitative-type data analysis device
TWI474202B (en) * 2012-01-20 2015-02-21 Htc Corp Methods for parsing content of document, handheld electronic apparatus and computer program product thereof
US9218083B2 (en) 2012-01-20 2015-12-22 Htc Corporation Methods for parsing content of document, handheld electronic apparatus and computer-readable medium thereof
US9686310B2 (en) 2012-10-17 2017-06-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for repairing a file

Similar Documents

Publication Publication Date Title
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
WO2015184992A1 (en) Method for recognizing duplicate image, and image search and deduplication method and device thereof
CN110647505B (en) Computer-assisted secret point marking method based on fingerprint characteristics
WO2017118356A1 (en) Text image processing method and apparatus
WO2020057413A1 (en) Junk text identification method and device, computing device and readable storage medium
CN111651990B (en) Entity identification method, computing device and readable storage medium
US20110276523A1 (en) Measuring document similarity by inferring evolution of documents through reuse of passage sequences
CN112380825B (en) PDF document cross-page table merging method and device, electronic equipment and storage medium
TW201113719A (en) Characteristic value comparison based content analysis method
CN108319518B (en) File fragment classification method and device based on recurrent neural network
CN111866605B (en) Video auditing method and server
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
WO2015085805A1 (en) Method and apparatus for determining core word of image cluster description text
WO2021164515A1 (en) Detection method and apparatus for tampered image
WO2022160736A1 (en) Image annotation method and apparatus, electronic device, storage medium and program
CN107085568A (en) A kind of text similarity method of discrimination and device
CN111783077A (en) TrueCrypt encryption software password recovery method, encrypted data evidence obtaining system and storage medium
CN114998918A (en) System and method for extracting text from portable document format data
JPWO2014174599A1 (en) Computer, recording medium and data retrieval method
CN112687266A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
WO2021051602A1 (en) Lip password-based face recognition method and system, device, and storage medium
WO2024088269A1 (en) Character recognition method and apparatus, and electronic device and storage medium
EP3703061A1 (en) Image retrieval
CN111783088A (en) Malicious code family clustering method and device and computer equipment
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm