TWI548985B

TWI548985B - Identification system of network log format and its method

Info

Publication number: TWI548985B
Application number: TW104115713A
Authority: TW
Inventors: Chien Chih Chen; Hui Ching Huang; Tzung Han Jeng; Shun Te Liu; Kuo Sen Chou
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2016-09-11
Also published as: TW201642132A

Description

Network log format identification system and method thereof

本發明屬於一種網路日誌格式的識別方法。 The invention belongs to a method for identifying a web log format.

網路日誌是指網路上設備所產生的事件或操作紀錄。不同用途或不同型號的設備會產生不同格式的日誌檔案。針對不同格式的日誌檔案，通常會有一套對應的日誌剖析器，用來解析日誌的內容。 A network log is an event or operational record generated by a device on the network. Different uses or different types of devices will generate log files in different formats. For log files of different formats, there is usually a corresponding set of log parsers for parsing the contents of the log.

當同時處理不同格式的日誌檔案時，目前常採用的方式主要為下列幾種： When processing log files of different formats at the same time, the methods currently used are mainly the following:

1.以人工指定的方式來指定日誌檔案的格式及所對應的日誌剖析器。 1. Specify the format of the log file and the corresponding log parser in a manually specified manner.

2.日誌檔案內容含有的特定關鍵字，以關鍵字的方式來判定日誌檔案格式及所對應的日誌剖析器。 2. The specific keyword contained in the log file content, and the log file format and the corresponding log parser are determined by keywords.

3.日誌檔案內容的特定欄位含有的特定格式，以正規表示式的方式來擷取日誌檔案內的特定欄位。 3. The specific format of the specific field of the log file content, in a regular expression to capture a specific field within the log file.

當有新的日誌格式加入時，人工指定的方式需要人力介入的成本，關鍵字或正規表示的判定也需透過一定的成本以產生對應的關鍵字跟正規表示式。隨這大資料分析技術的進步，越來越多應用是透過整合不同裝置的日誌來取得解決方案。在這種需求得情況下，維護處理不同日誌格式的成本也會相對提高。 When a new log format is added, the manually specified method requires the cost of human intervention, and the decision of the keyword or the formal representation also needs to be generated at a certain cost to generate a corresponding keyword and a regular expression. With the advancement of this data analysis technology, more and more applications are to achieve solutions by integrating logs from different devices. Maintaining different log formats in case of such demand The cost will also increase relatively.

本案發明人鑑於上述習用方式所衍生的各項缺點，乃亟思加以改良創新，並經多年苦心孤詣潛心研究後，終於成功研發完成本網路日誌格式的識別系統及其方法。 In view of the shortcomings derived from the above-mentioned conventional methods, the inventor of the present invention has improved and innovated, and after years of painstaking research, he finally succeeded in researching and developing the identification system and method of the weblog format.

處理網路日誌資料時，需先識別日誌的格式為何，才可以採用對應的日誌剖析器來處理。本網路日誌格式的識別系統及其方法的目的在於不透過人工指定的方式以自動化的流程告訴日誌剖析器要處理的日誌格式。 When processing network log data, you need to first identify the format of the log before you can use the corresponding log parser. The purpose of the network log format identification system and its method is to tell the log parser the log format to be processed in an automated process without manual designation.

本發明為達成上述發明目的，一種日誌格式識別系統包括下列兩個模組：日誌格式識別模組，此模組用來識別未知格式的原始日誌檔案。在輸入未知格式的原始日誌檔案後，針對原始日誌檔案取樣出部分的記錄。針對取樣的記錄依據日誌格式特徵編碼表轉換成生物序列的字串。日誌格式特徵編碼表針對分隔符號跟標點符號有對應的代表符號，而欄位內容的部分則依內容的資料型態做處理，以一個符號代表一種資料型態。 In order to achieve the above object, a log format recognition system includes the following two modules: a log format recognition module, which is used to identify an original log file of an unknown format. After entering the original log file in an unknown format, a partial record is taken for the original log file. The record for the sample is converted into a string of the biological sequence according to the log format feature coding table. The log format feature coding table has corresponding symbol for the delimiter symbol and the punctuation symbol, and the part of the column content is processed according to the data type of the content, and one symbol represents a data type.

將轉換過的序列透過生物序列比對的方式搜尋日誌格式樣板資料庫，找出最相似的日誌格式。若取樣出的記錄找到日誌格式皆相同則輸出一組日誌格式名稱；若為多組則輸出日誌格式名稱列表依比對分數高低排序。 The converted sequence is searched through the biological sequence alignment to search the log format template database to find the most similar log format. If the sampled records find the same log format, a set of log format names is output; if there are multiple groups, the output log format name list is sorted according to the score.

日誌格式樣板建立模組，此模組用來建立日誌格式的樣板資料庫及調整序列比對中評分機制的權重。收集已知格式的原始日誌檔案。將檔案內的每筆記錄依據日誌格式特徵編碼表轉換成生物序列的字串，儲存成日誌格式名稱對應序列字串的配對，稱作日誌格式樣板。透過過濾的流程保留符合條件的樣板存到日誌格式樣板資料庫。在日誌格式樣板資料庫內，同一個日誌格式可能會對應到多個序列字串，一個日誌格式可能會存成多個日誌格式的樣板。在建立好日誌格式樣板資料庫後，根據序列資料庫中日誌格樣板的序列分布來調整序列比對的評分矩陣。 The log format template establishes a module, which is used to establish a template database of the log format and adjust the weight of the scoring mechanism in the sequence alignment. Collect raw log files in a known format. Record each record in the file according to the log grid The character encoding table is converted into a string of the biological sequence, and is stored as a pair of the log format name corresponding to the sequence string, which is called a log format template. Through the filtering process, the qualified template is saved to the log format template database. In the log format template database, the same log format may correspond to multiple sequence strings, and one log format may be stored in multiple log format templates. After the log format template database is established, the scoring matrix of the sequence alignment is adjusted according to the sequence distribution of the log grid template in the sequence database.

本發明為達成上述發明目的，一種日誌格式識別方法係包括：先建立號日誌格式樣板序列資料庫；使用者收集已知格式的原始日誌檔案，透過日誌格式樣板建立模組建立日誌格式樣板資料庫；再透過日誌格式樣板資料庫的內容調整序列比對的評分矩陣，往後若有新的日誌格式加入，亦是透過日誌格式樣板建立模組更新日誌格式樣板資料庫及序列比對的評分矩陣；在建立好日誌格式樣板資料庫之後，使用者輸入未知日誌格式的原始日誌檔案；日誌格式識別模組首先會取樣原始日誌檔案的內容，針對取樣出來的記錄編碼成生物序列的字串；再以該字串去搜尋日誌格式樣板資料庫的序列，找出來最相似的序列；其所對應的日誌格式名稱則判定為原始日誌檔案的日誌格式作為輸出。 In order to achieve the above object, the method for identifying a log format includes: first establishing a log sequence template database; the user collects the original log file in a known format, and establishes a log format template database through the log format template. Then, through the content of the log format template database, the scoring matrix of the sequence alignment sequence is added. If a new log format is added later, the module update log format template database and the sequence alignment scoring matrix are also established through the log format template. After the log format template database is created, the user inputs the original log file of the unknown log format; the log format identification module first samples the contents of the original log file, and encodes the sampled record into a sequence of biological sequences; The string is searched for the sequence of the log format template database to find the most similar sequence; the corresponding log format name is determined as the log format of the original log file as an output.

本發明在使用時需先建立號日誌格式樣板序列資料庫。使用者收集已知格式的原始日誌檔案，先透過日誌格式樣板建立模組建立日誌格式樣板資料庫。之後再透過日誌格式樣板資料庫的內容調整序列比對的評分矩陣。往後若有新的日誌格式加入，也是透過日誌格式樣板建立模組更新日誌格式樣板資料庫及序列比對的評分矩陣。 When the invention is used, it is necessary to first establish a log format template serial Library. The user collects the original log file in a known format, and first creates a log format template database through the log format template. The score matrix of the sequence alignment is then adjusted through the contents of the log format template database. If a new log format is added later, the module update library format database and sequence alignment matrix are created through the log format template.

在建立好日誌格式樣板資料庫之後，使用者輸入未知日誌格式的原始日誌檔案。日誌格式識別模組首先會取樣原始日誌檔案的內容，針對取樣出來的記錄編碼成生物序列的字串。接著在以此字串去搜尋日誌格式樣板資料庫的序列，找出來最相似的序列，其所對應的日誌格式名稱則判定為原始日誌檔案的日誌格式作為輸出。 After setting up the log format template database, the user enters The original log file for the unknown log format. The log format identification module first samples the contents of the original log file and encodes the sampled records into a sequence of biological sequences. Then, in this string, the sequence of the log format template database is searched to find the most similar sequence, and the corresponding log format name is determined as the log format of the original log file as an output.

本發明所提供一種網路日誌格式的識別系統及其方法，與其他習用技術相互比較時，更具備下列優點： The invention provides a network log format identification system and a method thereof, which have the following advantages when compared with other conventional technologies:

1.本發明提供將日誌轉換成生物序列字串的機制，可依據日誌格式特徵編碼表將日誌轉換成生物序列的字串，儲存成日誌格式名稱對應序列字串的配對，建立日誌格式樣板，供未來日誌比對使用。 The present invention provides a mechanism for converting a log into a biological sequence string, which can convert a log into a sequence of a biological sequence according to a log format feature encoding table, store the pairing of the sequence string corresponding to the log format name, and establish a log format template. For future log comparison use.

2.本發明利用生物資訊的序列比對技術，可將所有序列建立快速索引，產生日誌各式樣板資料庫，建立快速索引可有效增進比對之效率。 2. The invention utilizes the sequence alignment technology of biological information, can establish a fast index of all sequences, generate a log database of various templates, and establish a fast index to effectively improve the efficiency of the comparison.

3.本發明提供日誌格式樣板建立機制，使用者只要針對原始日誌檔案指定其日誌格式名稱，即可將此日誌格式的特徵建入日誌格式樣板資料庫，提供往後的查詢使用，將隨著日誌格式資料庫的成長，針對不同日誌格式的識別能力跟準確度也會跟著提升。 3. The present invention provides a log format template establishment mechanism. The user can specify the log format name of the original log file, and the log format feature can be built into the log format template database to provide subsequent query usage, which will be followed. The growth of the log format database, the recognition ability and accuracy for different log formats will also increase.

4.本發明提供日誌識別機制，針對未知格式的原始日誌檔案，可自動化快速的識別日誌之格式，可將人工介入判定日誌格式的維護成本降到最低。 4. The present invention provides a log identification mechanism for original days of unknown formats The file can automatically and quickly identify the format of the log, which can minimize the maintenance cost of the manual intervention decision log format.

A‧‧‧已知格式的原始日誌檔案 A‧‧‧ original log file of known format

B‧‧‧未知格式的原始日誌檔案 B‧‧‧Original log file in unknown format

100‧‧‧電腦系統 100‧‧‧ computer system

110‧‧‧日誌格式樣板建立模組 110‧‧‧Log format template creation module

111、122、520‧‧‧日誌編碼模組 111, 122, 520‧‧‧ log coding module

112‧‧‧樣板產生模組 112‧‧‧Sample production module

113、542‧‧‧權重調整模組 113, 542‧‧ ‧ weight adjustment module

120‧‧‧日誌格式識別模組 120‧‧‧Log format recognition module

121‧‧‧檔案取樣模組 121‧‧‧File sampling module

123‧‧‧序列比對模組 123‧‧‧Sequence comparison module

130、541‧‧‧日誌格式樣板資料庫 130, 541‧‧‧ log format template database

140、550‧‧‧日誌格式名稱 140, 550‧‧‧ log format name

510‧‧‧原始日誌紀錄 510‧‧‧ Original log records

530‧‧‧編碼後序列 530‧‧‧Coded sequence

540‧‧‧序列比對模組 540‧‧‧Sequence comparison module

S601~S605‧‧‧日誌格式識別流程 S601~S605‧‧‧Log format identification process

S701~S704‧‧‧日誌格式樣板資料庫建立流程 S701~S704‧‧‧Log format template database establishment process

請參閱有關本發明之詳細說明及其附圖，將可進一步瞭解本發明之技術內容及其目的功效；有關附圖為：圖1為本發明網路日誌格式的識別系統及其方法之系統架構示意圖；圖2為本發明網路日誌格式的識別系統及其方法之日誌格式特徵邊碼表轉換原始日誌記錄範例；圖3為本發明網路日誌格式的識別系統及其方法之BLAST的預設評分矩陣BLOSUM 62；圖4為本發明網路日誌格式的識別系統及其方法之日誌格式樣板資料庫每個符號跟符號間的比對圖；圖5為本發明網路日誌格式的識別系統及其方法之日誌格式樣板建立模組及日誌格式識別模組內部元件的關係圖；圖6為本發明網路日誌格式的識別系統及其方法之日誌格式識別流程圖；圖7為本發明網路日誌格式的識別系統及其方法之日誌格式樣板資料庫建立流程圖。 The detailed description of the present invention and the accompanying drawings will be further understood, and the technical contents of the present invention and the functions thereof can be further understood. FIG. 1 is a system architecture of the identification system of the network log format and the method thereof. 2 is a schematic diagram of a log file feature side code table conversion original log record of the network log format recognition system and method thereof; FIG. 3 is a BLAST preset of the network log format recognition system and method thereof according to the present invention; The scoring matrix BLOSUM 62; FIG. 4 is a comparison diagram of each symbol and symbol of the log format template database of the identification system and method of the network log format of the present invention; FIG. 5 is a recognition system of the network log format of the present invention; The log format template of the method and the relationship diagram of the internal components of the log format identification module; FIG. 6 is a flow chart for identifying the log file format identification system and the method thereof; FIG. 7 is a network diagram of the present invention; A flow chart of the log format identification system and the log format template database of the method.

為了使本發明的目的、技術方案及優點更加清楚明白，下面結合附圖及實施例，對本發明進行進一步詳細說明。應當理解，此處所描述的具體實施例僅用以解釋本發明，但並不用於限定本發明。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

以下，結合附圖對本發明進一步說明：請參閱圖1所示，為本發明網路日誌格式的識別系統及其方法之系統架構示意圖，包含兩個設置於電腦系統100下的主要模組：日誌格式樣板建立模組110及日誌格式識別模組120。日誌格式樣板建立模組110，係用於收集訓練資料以產生日誌格式樣板資料庫供日誌格式識別模組使用；日誌格式識別模組120主要用於網路日誌的收集過程，將未知日誌格式的檔案識別出日誌格式名稱，以幫助後續資料的分析及處理。 Hereinafter, the present invention will be further described with reference to the accompanying drawings: Please refer to FIG. 1 , which is a schematic diagram of a system architecture of a network log format identification system and a method thereof, including two main modules disposed under the computer system 100: a log format template creation module 110 and a log format recognition. Module 120. The log format template establishing module 110 is configured to collect training materials to generate a log format template database for use by the log format identification module; the log format identification module 120 is mainly used for collecting the network log, and the unknown log format is The file identifies the log format name to aid in the analysis and processing of subsequent data.

如圖1所示，本發明之目的即在於提供一種日誌格式識別系統，得以自動化快速地建立日誌格式樣板並準確的識別日誌之格式。使用者收集已知格式的原始日誌檔案A當作訓練資料，透過日誌格式樣板建立模組110中的日誌編碼模組111，將檔案內的每筆記錄依據日誌格式特徵編碼表(如下表1)轉換成生物序列的字串，再透過樣板產生模組112將字串儲存成日誌格式名稱對應序列字串的配對，建立出日誌格式樣板資料庫130，最後再透過權重調整模組113，依據日誌格式樣板資料庫130的內容分布調整序列比對的評分矩陣權重。在日誌格式樣板資料庫30內，同一個日誌格式可能會對應到多個序列字串，一個日誌格式可能會存成多個日誌格式的樣板。往後若有新的日誌格式加入，也是透過日誌格式樣板建立模組10更新日誌格式樣板資料庫30及序列比對的評分矩陣。 As shown in FIG. 1, the object of the present invention is to provide a log format recognition system, which can automatically and quickly establish a log format template and accurately identify the format of the log. The user collects the original log file A of the known format as the training data, and creates the log encoding module 111 in the module 110 through the log format template, and records each record in the file according to the log format feature coding table (Table 1 below). Converting into a string of the biological sequence, and then storing the string into a pair of the sequence string corresponding to the log format name through the template generating module 112, creating a log format template database 130, and finally passing the weight adjustment module 113, according to the log The content distribution of the format template database 130 adjusts the weight of the scoring matrix of the sequence alignment. In the log format template database 30, the same log format may correspond to multiple sequence strings, and one log format may be stored in multiple log format templates. If a new log format is added later, the module 10 is updated by the log format template to update the log format template database 30 and the score matrix of the sequence alignment.

在收集網路日誌時，當使用者輸入未知格式的原始日誌檔案B後，日誌格式識別模組120中的檔案取樣模組121，會取樣原始日誌檔案的內容，再透過日誌編碼模組122，針對取樣的記錄依據日誌格式特徵編碼表轉換成生物序列的字串，日誌格式特徵編碼表針對分隔符號跟標點符號有對應的代表符號，而欄位內容的部分則依內容的資料型態做處理，以一個符號代表一種資料型態，轉換成生物序列的字串後，接著透過序列比對模組123，此字串去搜尋日誌格式樣板資料庫130的序列，找出來最相似的序列，其所對應的日誌格式名稱140則判定為原始日誌檔案的日誌格式作為輸出，若取樣出的記錄找到日誌格式皆相同則輸出一組日誌格式名稱；若為多組則輸出日誌格式名稱列表依比對分數高低排序。 When collecting web logs, when the user enters the original format of the unknown format After the log file B is started, the file sampling module 121 in the log format identification module 120 samples the content of the original log file, and then converts the sampled record into a biological sequence according to the log format feature coding table through the log encoding module 122. String, the log format feature code table has a corresponding representative symbol for the delimiter symbol and the punctuation symbol, and the part of the column content is processed according to the data type of the content, and a symbol represents a data type and is converted into a biological sequence. After the string is followed, the sequence comparison module 123 is used to search the sequence of the log format template database 130 to find the most similar sequence, and the corresponding log format name 140 is determined as the original log file. The log format is output. If the sampled records find the same log format, a set of log format names is output. If there are multiple groups, the output log format name list is sorted according to the score.

其具體日誌流程描述如下：使用者收集已知格式的原始日誌檔案A當作訓練資料，透過日誌格式樣板建立模組10，即可建構出日誌格式樣板資料庫30。 The specific log process is described as follows: The user collects the original log file A in a known format as the training data, and builds the module 10 through the log format template to construct the log format template database 30.

已知格式的原始日誌檔案A，透過日誌格式樣板建立模組110中的日誌編碼模組111，將檔案內的每筆記錄依據日誌格式特徵編碼表(見表1)轉換成生物序列的字串，相關步驟如下： The original log file A of the known format is converted into a sequence of biological sequences according to the log format feature coding table (see Table 1) through the log coding module 111 in the log format template creation module 110. The relevant steps are as follows:

1.針對標點符號(")所含蓋起來的內容，刪除有包含分隔符號跟標點符號的部分。 1. For the content enclosed by the punctuation mark ("), delete the part that contains the separator symbol and the punctuation mark.

2.從原始日誌記錄的第一個字元開始掃描到最後一個字元，遇到符合的分隔符號與標點符號則以對應的代表符號取代。若不為日誌格式特徵編碼表的分隔或標點符號，則讀取字串直到下一個字元為分隔符號或標點符號。再判斷此字串的資料型態，若皆為數字則以I表示，皆非數字則以S表示，否則以X表示。 2. Scanning from the first character of the original log record to the last character, the corresponding separator and punctuation marks are replaced by the corresponding representative symbols. If the separation or punctuation of the table is not encoded for the log format feature, the string is read until the next character is a delimiter or punctuation. Then judge the data type of the string, if it is a number, it is represented by I, and if it is not a number, it is represented by S, otherwise it is represented by X.

請參閱圖2所示，為本發明網路日誌格式的識別系統及其方法之日誌格式特徵邊碼表轉換原始日誌記錄範例。 Please refer to FIG. 2, which is an example of the log file feature side code table conversion original log record of the network log format identification system and the method thereof.

轉換為生物序列的字串，透過樣板產生模組將字串儲存成日誌格式名稱對應序列字串的配對，並產生日誌格式樣板後，儲存到日誌格式樣板資料庫。 The string converted into the biological sequence is stored by the template generation module into the pairing of the sequence string corresponding to the log format name, and the log format template is generated, and then stored in the log format template database.

如下表2日誌格式樣板名稱的例子，此模組將不同日誌格式的序列存成表二的格式，並過濾掉重複的名稱與序列，再透過BLAST建立快速索引，產生日誌格式樣板資料庫30。 As an example of the log format template name in Table 2 below, the module saves the sequence of different log formats into the format of Table 2, filters out the duplicate names and sequences, and then creates a fast index through BLAST to generate a log format template database 30.

在日誌格式樣板資料庫30中，每種日誌格式含有不只一個代表的序列，而權重調整模組13主要用來調整BLAST所用到的評分矩陣，對同一種日誌格式的所有序列作多重序列比對，請參閱圖4所示，透過多重序列比對可得知日誌格式樣板資料庫每個符號跟符號間的比對情況，透過下列公式，即可算出一個評分矩陣：Sij=(1/ λ )log(Pij/Qi*Qj+0.5) In the log format template database 30, each log format contains more than one representative sequence, and the weight adjustment module 13 is mainly used to adjust the scoring matrix used by BLAST to perform multiple sequence alignment on all sequences in the same log format. Please refer to FIG. 4, through the multiple sequence comparison, the comparison between each symbol and the symbol of the log format template database can be obtained, and a scoring matrix can be calculated by the following formula: Sij=(1/ λ ) Log(Pij/Qi*Qj+0.5)

其中Pij代表符號i與符號j在多重序列比對排在同一行的機率，Qi與Qj代表符號i與符號j出現的在日誌格式樣板資料庫中任何序列的背景之機率， λ 則為一因子確保評分矩陣內元素為整數，常數0.5則用來調整矩陣元素避免出現機率為0的情況，此模組透過生物序列工具ClusterW對日誌格式樣本資料庫內同一日誌格式的序列做多重序列比對，得到符號間比對的機率，再透過上述公式即可算出評分矩陣；往後若有新的日誌格式加入，也是透過日誌格式樣板建立模組更新日誌格式樣板資料庫及序列比對的評分矩陣。 Where Pij represents the probability that symbol i and symbol j are ranked on the same line in multiple sequence alignments, Qi and Qj represent the probability that the symbols i and j appear in the background of any sequence in the log format template database, λ is a factor Make sure that the elements in the scoring matrix are integers, and the constant 0.5 is used to adjust the matrix elements to avoid the chance of zero. This module performs multiple sequence alignment on the sequence of the same log format in the log format sample database through the biological sequence tool ClusterW. Obtain the probability of inter-symbol comparison, and then calculate the scoring matrix through the above formula; if a new log format is added later, it is also to establish a module update log format template database and a sequence alignment scoring matrix through the log format template.

在收集網路日誌得情境下，使用者輸入未知格式的原始日誌檔案後，透過日誌格式識別模組，即可有效的識別出日誌格式名稱。 In the context of collecting network logs, after the user inputs the original log file in an unknown format, the log format identification module can effectively identify the log format name.

當使用者輸入未知格式的原始日誌檔案後，檔案取樣模組，將以部分的記錄來代表整份原始檔案的內容，以間隔幾筆記錄才取一筆紀錄的方式達到取樣效果。 When the user inputs the original log file in an unknown format, the file sampling module will use part of the record to represent the contents of the entire original file, and the sampling effect is achieved by taking a few records to take a record.

得到取樣結果後，透過日誌編碼模組，針對取樣的記錄依據日誌格式特徵編碼表轉換成生物序列的字串，接著透過序列比對模組，用於搜尋日誌格式樣板資料庫，找出最相似的序列並輸出結果，序列比對模組利用生物資訊搜尋工具(Basic Local Alignment Search Tool，BLAST)來搜尋日誌格式樣板資料庫30中最相似的序列。其中BLAST所用到的評分矩陣(Score Matrix)可透過日誌格式樣板資料庫的序列分布做調整，請參閱圖3 BLAST的預設評分矩陣BLOSUM 62。 After the sampling result is obtained, the log code module is used to convert the sampled record into a sequence of the biological sequence according to the log format feature coding table, and then through the sequence comparison module, for searching the log format template database to find the most similar The sequence and output results, the sequence alignment module uses the Basic Local Alignment Search Tool (BLAST) to search for the most similar sequence in the log format template database 30. The Score Matrix used by BLAST can be adjusted through the sequence distribution of the log format template database. See Figure 3 BLAST's default scoring matrix BLOSUM 62.

找出來最相似的序列後，其所對應的日誌格式名稱40則判定為原始日誌檔案的日誌格式作為輸出，若取樣出的記錄找到日誌格式皆相同則輸出一組日誌格式名稱；若為多組則輸出日誌格式名稱列表依比對分數高低排序。 After finding the most similar sequence, the corresponding log format name The 40 is determined to be the log format of the original log file as the output. If the sampled records find the same log format, a set of log format names are output; if there are multiple groups, the output log format name list is sorted according to the score.

請參閱圖5所示，為日誌格式樣板建立模組及日誌格式識別模組內部元件的關係圖，序列比對模組540會針對編碼後序列530去日誌格式樣板資料庫541做查詢，找到的最相似的序列樣板格式名稱則會當作序列比對模組的輸出即為日誌格式名稱550。此外日誌格式樣板資料庫541也當作權重調整模組542的輸入，產生的評分矩陣會影響序列比對模組540比對的結果。 Please refer to Figure 5 to create a module and date for the log format template. The sequence format recognition module internal component relationship diagram, the sequence comparison module 540 will query the coded sample sequence 530 to the log format template database 541, and the most similar sequence template format name found will be regarded as a sequence comparison mode. The output of the group is the log format name 550. In addition, the log format template database 541 is also used as an input to the weight adjustment module 542, and the generated scoring matrix affects the result of the alignment of the sequence comparison module 540.

請參閱圖6所示，為本發明日誌格式識別流程圖，其流程如下：S601輸入未知格式的日誌檔案；S602取樣出日誌檔案的記錄，將紀錄中的格式特徵編碼成生物資訊用的序列；S603比對已建立的日誌格式資料庫，找出最相似的日誌格式名稱；S604若比對成功，則輸出此日誌檔案的日誌格式名稱；S605若比對失敗，則輸出無法判定。 Please refer to FIG. 6 , which is a flow chart for identifying a log format according to the present invention. The flow is as follows: S601 inputs a log file of an unknown format; S602 samples a log file record, and encodes a format feature in the record into a sequence for biometric information; S603 compares the created log format database to find the most similar log format name; S604 outputs the log format name of the log file if the comparison is successful; if the comparison fails in S605, the output cannot be determined.

請參閱圖所示，為本發明日誌格式樣板資料庫建立流程圖，其流程如下：S701輸入已知格式的日誌檔案；S702取出日誌檔案的每筆記錄，將記錄中的格式特徵編碼成生物資訊用的序列，產生成日誌格式名稱與序列配對的日誌格式樣板；S703過濾掉重複的日誌格式樣板，產生日誌格式樣板資料庫；S704利用日誌格式樣板資料庫的序列分布，重新計算序列比對的評分權重。 Please refer to the figure, which is a flowchart for establishing a log format template database of the present invention. The flow is as follows: S701 inputs a log file of a known format; S702 takes out each record of the log file, and encodes the format feature in the record into biological information. The sequence used, generated into a log format name The log format template paired with the sequence; S703 filters out the duplicate log format template to generate a log format template database; S704 uses the sequence distribution of the log format template database to recalculate the scoring weight of the sequence alignment.

上列詳細說明乃針對本發明之一可行實施例進行具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。 The detailed description of the present invention is intended to be illustrative of a preferred embodiment of the invention, and is not intended to limit the scope of the invention. The patent scope of this case.

綜上所述，本案不僅於技術思想上確屬創新，並具備習用之傳統方法所不及之上述多項功效，已充分符合新穎性及進步性之法定發明專利要件，爰依法提出申請，懇請貴局核准本件發明專利申請案，以勵發明，至感德便。 To sum up, this case is not only innovative in terms of technical thinking, but also has many of the above-mentioned functions that are not in the traditional methods of the past. It has fully complied with the statutory invention patent requirements of novelty and progressiveness, and applied for it according to law. Approved this invention patent application, in order to invent invention, to the sense of virtue.

100‧‧‧電腦系統 100‧‧‧ computer system

111、122‧‧‧日誌編碼模組 111, 122‧‧‧ log coding module

112‧‧‧樣板產生模組 112‧‧‧Sample production module

113‧‧‧權重調整模組 113‧‧‧weight adjustment module

120‧‧‧日誌格式識別模組 120‧‧‧Log format recognition module

121‧‧‧檔案取樣模組 121‧‧‧File sampling module

123‧‧‧序列比對模組 123‧‧‧Sequence comparison module

130‧‧‧日誌格式樣板資料庫 130‧‧‧Log format template database

140‧‧‧日誌格式名稱 140‧‧‧Log format name

Claims

The invention relates to a network log format identification system, which mainly comprises: a log format identification module, which is used for collecting a network log, and identifies a log format name of an archive of an unknown log format to facilitate analysis and processing of subsequent data. And a log format template building module for collecting training data to generate a log format template database for use by the log format identification module.

For example, the identification system of the network log format described in the first aspect of the patent application, wherein the log format template creation module further comprises: a log coding module, which is based on a log format feature coding table for each record in the file. Converting into a sequence of a biological sequence; a template generating module stores the string into a pair of log format name corresponding sequence strings, and creates a log format template database; and a weight adjustment module based on the log format template data The content distribution of the library adjusts the weight of the scoring matrix of the sequence alignment.

The identification system of the network log format as described in claim 1, wherein the log format identification module further comprises: an archive sampling module, which is to sample the content of the original log file; the log format coding module is The record sampled by the file sampling module is converted into a sequence of the biological sequence according to the log format feature coding table; the sequence comparison module compares the sequence of the biological sequence with the sequence of the log format template database. Find the most similar sequence; and the log format template database, which stores the string of the biological sequence.

The identification system of the network log format as described in claim 2, wherein the log format name is a log format determined to be the original log file as an output.

The identification system of the network log format described in claim 3, wherein the log format feature coding table has a corresponding representative symbol for the delimiter symbol and the punctuation symbol, and the part of the column content is based on the content information. The type is treated by a symbol representing a data type and converted into a string of biological sequences.

A method for identifying a network log format, comprising: first establishing a log file format sample database; the user collects the original log file in a known format, and establishes a log format template database through the log format template; The content adjustment sequence of the format template database adjusts the scoring matrix of the sequence. If a new log format is added later, the module is updated by the log format template to update the log format template database and the sequence alignment of the sequence alignment; After the log format template database, the user inputs the original log file of the unknown log format; the log format recognition module first samples the content of the original log file, and encodes the sampled record into a sequence of biological sequences; and then uses the string Search the sequence of the log format template database to find the most similar sequence; the corresponding log format name is determined as the log format of the original log file as the output.

The method for identifying a weblog format as described in claim 6, wherein the sequence alignment determines the similarity of the sequence according to the adjusted weight scoring mechanism.

The method for identifying a network log format as described in claim 6 of the patent application scope, wherein the log format identification includes a process: Enter a log file of unknown format; sample the log file records, encode the format features in the record into a sequence for biometrics; compare the established log format database to find the most similar log format name; If successful, the log format name of this log file is output; if the comparison fails, the output cannot be determined.

The method for identifying a network log format as described in claim 6, wherein the log format template database has a process of establishing: inputting a log file of a known format; and extracting each record of the log file, recording The format feature is encoded into a sequence for bioinformatics, and a log format template that matches the log format name and the sequence is generated; the duplicate log format template is filtered out to generate a log format template database; and the sequence distribution of the log format template database is utilized. Recalculate the scoring weights of the sequence alignments.