TWI514823B

TWI514823B - A traffic classification system based on message size sequence and method thereof

Info

Publication number: TWI514823B
Application number: TW102136709A
Authority: TW
Inventors: Chun Nan Lu; Chun Ying Huang; Ying Dar Lin; Yuan Cheng Lai
Original assignee: Univ Nat Chiao Tung
Priority date: 2013-10-11
Filing date: 2013-10-11
Publication date: 2015-12-21
Also published as: TW201515410A

Description

Network traffic identification system based on message length sequence and method thereof

本發明係關於一種網路流量分類技術，詳而言之，係關於一種基於訊息長度序列之分析封包屬性之網路流量辨識系統及其方法。The present invention relates to a network traffic classification technique, and more particularly to a network traffic identification system and method for analyzing packet attributes based on a sequence of message lengths.

隨著網際網路的蓬勃發展，需要各種網路應用程式透過網路進行訊息傳遞、資料傳送或溝通。在網路應用程式傳遞資料的過程中，資料將分成多個封包傳送，由於現行網路上存在許多有害的垃圾封包，為確保資料傳遞的安全性，有效地辨識流量內容變得十分重要。With the rapid development of the Internet, various web applications are required to transmit, transmit, or communicate over the Internet. In the process of transferring data through the network application, the data will be divided into multiple packets. Since there are many harmful garbage packets on the current network, it is very important to effectively identify the traffic content to ensure the security of data transmission.

對於網路流量內所包含之網路應用程式而言，由於封包內含有網路位址(IP address)和連接埠(port)，故早期在封包辨識時可採用公認的連接埠(well-known port)方式進行判定，也就是預先規劃連接埠號碼(port number)，規範不同網路應用程式採用不同連接埠號碼通行，例如port 80是給http protocol使用，但目前許多通訊軟體也都使用port 80，使得以連接埠號碼辨識封包所屬之網路應用程式的方式變得不可行，再者，現行許多網路上的有害封包是採用隨機配置連接埠(port randomization)的技術，所以，使用前述方法是無法有效辨認出有害封包。此外，為了解決封包辨識問題，另一種“封包內容特徵值”的比對方法現今被廣泛使用，即透過比對封包內容是否存在某些特定式樣(pattern)以辨識封包，舉例來說，防毒軟體可比對封包內容是否等同於資料庫內之關鍵字來判定是否排除，然而越來越多網路應用程式使用封包加密技術，使得無法利用前述方法來剖析封包內容，同時，還有些惡意程式會利用偽裝封包內容的方式意圖躲避內容特徵值的比對偵測，因此，可能產生誤擋或漏擋的問題，且利用剖析封包內容的偵測方式有侵害個人隱私的問題。其他需考量者，還包括這類的傳輸層行為特徵比對方式皆需要收集足夠傳輸層資訊方能獲得正確判斷能力，也導致判斷時間過長。For network applications included in network traffic, since the packet contains an IP address and a port, an early connection can be used in the packet identification (well-known). The port method determines, that is, pre-plans the port number, and regulates different network applications to use different port numbers. For example, port 80 is used for http protocol, but many communication softwares also use port 80. , making it impossible to identify the web application to which the packet belongs by using the port number. Moreover, the harmful seals on many existing networks The packet is a technique of randomly configuring port randomization, so that the use of the aforementioned method cannot effectively identify a harmful packet. In addition, in order to solve the packet identification problem, another "package content feature value" comparison method is widely used today, that is, by comparing the specific content of the packet content to identify the packet, for example, antivirus software It can compare whether the content of the packet is equivalent to the keyword in the database to determine whether to exclude it. However, more and more web applications use packet encryption technology, which makes it impossible to use the above method to parse the content of the packet. At the same time, some malicious programs can use it. The method of masquerading the content of the packet is intended to avoid the detection of the content feature value. Therefore, the problem of misinterpretation or missed may occur, and the method of detecting the content of the packet has a problem of infringing personal privacy. Other considerations, including such transport layer behavior characteristics comparison methods, need to collect enough transport layer information to obtain the correct judgment ability, and also lead to long judgment time.

因此，如何克服現有網路流量分類技術，特別是在不侵犯個人隱私且不受封包加密技術影響下，提供有效地網路流量辨識，實已成目前亟欲解決的課題。Therefore, how to overcome the existing network traffic classification technology, especially in the absence of infringement of personal privacy and without the influence of packet encryption technology, providing effective network traffic identification has become a problem that is currently being solved.

鑒於上述習知技術之缺點，本發明之目的係提出一種基於訊息長度序列之網路流量辨識系統及其方法，透過辨識封包大小與順序以判斷流量所屬網路應用程式為何。In view of the above disadvantages of the prior art, the object of the present invention is to provide a network traffic identification system based on a message length sequence and a method thereof for determining the size of a packet and the order to determine the network application to which the traffic belongs.

為達成前述目的及其他目的，本發明提出一種基於訊息長度序列之網路流量辨識系統，係包括：資料庫、流量收集模組、流量拆解模組、辨識模組以及判定模組。該資料庫係預存對應各種網路應用程式之長度共用子序列集合，該流量收集模組係用於收集網路流量，該流量拆解模組用於依據流量資訊將該網路流量拆解成多條連線，且擷取各該連線中複數封包的傳遞方向及長度大小，以由該複數封包產生對應各該連線之長度特徵序列，該辨識模組用於比對該長度特徵序列與該資料庫中之各種網路應用程式之長度共用子序列集合，以由該長度共用子序列集合中得到與該長度特徵序列之相似度最高者，最後判定模組依據該辨識模組所得到之該相似度最高者的數量判定該連線為已知網路應用程式或未知網路應用程式。To achieve the foregoing and other objects, the present invention provides a network traffic identification system based on a message length sequence, which includes: a database, a traffic collection module, a traffic teardown module, an identification module, and a decision module. The database is pre-stored with a common sub-sequence set corresponding to the length of various web applications. The traffic collection module is configured to collect network traffic, and the traffic disassembly module is configured to disassemble the network traffic into multiple connections according to the traffic information, and extract multiple packets in each connection. Transmitting direction and length to generate a sequence of length characteristics corresponding to each of the links by the plurality of packets, wherein the identification module is configured to share a subsequence with a length of the length feature sequence and various network applications in the database And determining, by the set of the shared subsequences of the length, obtaining the highest degree of similarity with the length feature sequence, and finally determining, by the determining module, the connection to the known network according to the number of the highest similarity obtained by the identification module. Road application or unknown web application.

於一實施例中，該流量拆解模組係自各該連線之複數封包中移除封包長度為最大傳輸單位之封包以及封包內容(payload)長度為零之封包。In an embodiment, the traffic teardown module removes a packet whose packet length is the maximum transmission unit and a packet whose payload length is zero from the plurality of packets of the connection.

於另一實施例中，該基於訊息長度序列之網路流量辨識系統更包括應用程式代表集合產生模組。該應用程式代表集合產生模組係利用已知的網路流量進行訓練，透過該流量拆解模組拆解已知的網路流量以產生各連線之長度特徵序列，並以兩兩一組的方式計算任兩條連線之最長長度共有子序列，收集各種組合所計算出之最長長度共有子序列以產生對應該應用程式之長度共用子序列集合。In another embodiment, the message length sequence based network traffic identification system further includes an application representative set generation module. The application represents a collection generation module that uses known network traffic for training, and uses the traffic disassembly module to disassemble known network traffic to generate a sequence of length characteristics of each connection, and in groups of two The method calculates the longest length shared subsequence of any two connections, and collects the longest length shared subsequences calculated by the various combinations to generate a common subsequence set corresponding to the length of the application.

本發明還提出一種基於訊息長度序列之網路流量辨識方法，係包括：提供對應各種網路應用程式之長度共用子序列集合；收集網路流量並拆解該網路流量成多條連線，擷取各該連線中複數封包的傳遞方向及長度大小，以產生對應各該連線之長度特徵序列；比對該長度特徵序列與該各種網路應用程式之長度共用子序列集合，以由該長度共用子序列集合中取得與該長度特徵序列之相似度最高者；以及依據該長度共用子序列集合之相似度最高者的數量判斷該連線為已知網路應用程式或未知網路應用程式。The invention also provides a network traffic identification method based on a message length sequence, which comprises: providing a common sub-sequence set corresponding to various network application programs; collecting network traffic and disassembling the network traffic into a plurality of connections, Obtaining a transmission direction and a length of a plurality of packets in each of the connections to generate a sequence of length characteristics corresponding to each of the links; Sharing a subset of sub-sequences with the lengths of the various network applications, such that the one with the highest degree of similarity to the length feature sequence is obtained from the set of shared sub-sequences of the length; and the number of the highest similarity of the subset of the shared sub-sequences according to the length Determine if the connection is a known web application or an unknown web application.

於一實施例中，於擷取各該連線中複數封包的傳遞方向及長度大小之前，更包括自各該連線之複數封包中移除封包長度為最大傳輸單位之封包以及封包內容長度為零之封包。In an embodiment, before extracting the transmission direction and the length of the plurality of packets in each connection, the method further includes removing a packet whose packet length is the maximum transmission unit from the plurality of packets of the connection, and the packet content length is zero. The package.

於又一實施例中，該長度共用子序列集合係於辨識之前利用已知的網路流量進行訓練，包括：將已知的網路流量拆解成多條連線，並擷取各該連線中複數封包的傳遞方向及長度大小，以產生對應各該連線之長度特徵序列，各該連線之長度特徵序列以兩兩一組的方式計算該兩條連線之最長長度共有子序列，並收集各種組合所計算出之最長長度共有子序列，以產生該長度共用子序列集合。In yet another embodiment, the length shared subsequence set is trained by using known network traffic prior to identification, including: disassembling the known network traffic into a plurality of connections, and extracting each of the connections The direction and length of the plurality of packets in the line are generated to generate a sequence of length characteristics corresponding to each of the links, and the length feature sequence of each of the lines calculates the longest length common subsequence of the two lines in a pairwise manner And collecting the longest length shared subsequence calculated by various combinations to generate a set of shared subsequences of the length.

相較於先前技術，本發明所提出之基於訊息長度序列之網路流量辨識系統及其方法，可用於偵測被加密或是刻意隱藏通訊協定及通訊內容之網路應用程式，藉此提供網路管理者取得充份資訊以執行流量控管。相較於傳統的流量偵測方式，不論是一般網路應用程式所用的公認的連接埠方式之判別或是利用封包內容特徵值作比對，不僅容易誤判且判斷時間較長，然而本發明透過萃取傳輸層行為特徵的方式，可以網路應用程式連線行為特徵作為分析之依據，而非如往剖析封包內容之特徵資料，因而可對加密通訊協定之網路應用程式作偵測，同時在無需解析封包內容下而不侵犯個人隱私，因此，透過本發明可解決傳統對於加密通訊或刻意隱藏封包內容之網路應用程式無法偵測等問題。Compared with the prior art, the network length identification system based on the message length sequence and the method thereof can be used for detecting a network application that is encrypted or deliberately concealing communication protocols and communication contents, thereby providing a network Road managers get sufficient information to perform flow control. Compared with the traditional traffic detection method, whether it is the recognition of the recognized connection method used by the general network application or the comparison of the feature values of the package content, it is not only easy to misjudge and the judgment time is long, but the present invention is The way to extract the behavior characteristics of the transport layer can be based on the characteristics of the network application behavior, rather than analyzing the characteristics of the packet content, so that the encryption can be The protocol's web application detects and eliminates the need to parse the contents of the package without infringing on the privacy of the individual. Therefore, the present invention can solve the problem that the traditional network application that encrypts the communication or deliberately hides the package cannot be detected. .

1‧‧‧流程圖1‧‧‧flow chart

11‧‧‧第一階段11‧‧‧First stage

110~113‧‧‧流程110~113‧‧‧ Process

12‧‧‧第二階段12‧‧‧ second stage

120~123‧‧‧流程120~123‧‧‧Process

2、3‧‧‧基於訊息長度序列之網路流量辨識系統2, 3‧‧‧Network traffic identification system based on message length sequence

21、31‧‧‧資料庫21, 31‧‧ ‧ database

22、32‧‧‧流量收集模組22, 32‧‧‧Flow collection module

23、33‧‧‧流量拆解模組23, 33‧‧‧Flow Disassembly Module

24、34‧‧‧辨識模組24, 34‧‧‧ Identification Module

25、35‧‧‧判定模組25, 35‧‧‧ Determination module

36‧‧‧應用程式代表集合產生模組36‧‧‧Application Representation Collection Generation Module

100‧‧‧網路流量100‧‧‧Network traffic

200‧‧‧已知的網路流量200‧‧‧ Known network traffic

S601~S604‧‧‧步驟S601~S604‧‧‧Steps

第1圖係說明本發明之基於訊息長度序列之網路流量辨識系統的訓練與分類兩階段之流程圖；第2圖係說明本發明之基於訊息長度序列之網路流量辨識系統於分類階段之系統架構圖；第3圖係說明本發明之基於訊息長度序列之網路流量辨識系統於訓練階段之系統架構圖；第4圖係說明本發明之基於訊息長度序列之網路流量辨識系統一訓練過程實施例之示意圖；第5圖係說明本發明之基於訊息長度序列之網路流量辨識系統一辨識過程實施例之示意圖；以及第6圖係說明本發明之基於訊息長度序列之網路流量辨識方法之步驟圖。1 is a flow chart showing the training and classification of the network length identification system based on the message length sequence of the present invention; FIG. 2 is a diagram showing the network flow identification system based on the message length sequence of the present invention in the classification stage. System architecture diagram; FIG. 3 is a system architecture diagram of the network length identification system based on the message length sequence of the present invention in the training phase; FIG. 4 is a diagram showing the network traffic identification system based on the message length sequence of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 5 is a schematic diagram showing an embodiment of a recognition process of a message length sequence based network traffic identification system of the present invention; and FIG. 6 is a diagram illustrating network length identification based on a message length sequence of the present invention. Step diagram of the method.

以下係藉由特定的實施例說明本發明之實施方式，熟悉此技術之人士可由本說明書所揭示之內容輕易地瞭解本發明之其他特點與功效。本發明亦可藉由其他不同的具體實施例加以施行或應用。The embodiments of the present invention are described below by way of specific examples, and those skilled in the art can readily understand other features and functions of the present invention from the disclosure herein. The invention may also be embodied or applied by other different embodiments.

參閱第1圖，其係說明本發明之基於訊息長度序列之網路流量辨識系統的訓練與分類兩階段之流程圖。如圖所示，流程圖1顯示本發明可分為第一階段11和第二階段12，為了得到網路應用程式連線行為特徵資料作為辨識依據，本發明需通過預先訓練以取得各種不同網路應用程式其連線行為特徵資料，並藉由該些連線行為特徵資料來判斷封包所屬網路應用程式為何。Referring to Figure 1, there is shown a flow chart of the training and classification of the network traffic identification system based on the message length sequence of the present invention. As shown The flow chart 1 shows that the present invention can be divided into a first stage 11 and a second stage 12. In order to obtain the network application connection behavior characteristic data as a identification basis, the present invention needs to be pre-trained to obtain various network applications. The connection behavior characteristics data, and through the characteristics of the connection behavior to determine the network application of the package.

第一階段11為代表訓練階段。首先，於流程110中，是追蹤網路應用程式以完成流量收集(traffic collection)，為了解析各網路應用程式的行為特徵，本發明透過一套『網路應用程式流量收集技術』，在一台主機上執行想要收集的網路應用程式，限定該網路應用程式及其使用的埠號，使得只有該網路應用程式的網路流量才能通過主機網路介面，並且在網路流量出入端利用流量錄製技術將所需的流量錄製下來作為分析之用，也就是說，流程110是收集網路應用程式的流量，且要收集足夠多網路流量(網路流量完整性會影響判斷準確性)。因此，利用一次訓練一種網路應用程式，錄製通過主機的封包，以用於網路應用程式之連線行為的分析。The first stage 11 represents the training phase. First, in the process 110, a network application is tracked to complete a traffic collection. In order to analyze the behavior characteristics of each network application, the present invention uses a set of "network application traffic collection technology" in a Executing the network application to be collected on the host, limiting the network application and the nickname used by it, so that only the network application's network traffic can pass through the host network interface, and the network traffic is in and out. The traffic recording technology is used to record the required traffic for analysis, that is, the process 110 is to collect the traffic of the network application, and collect enough network traffic (the integrity of the network traffic will affect the judgment accurately). Sex). Therefore, a web application is trained to record packets passing through the host for analysis of the connection behavior of the web application.

接著於流程111中，是擷取各連線的連線特徵(flow characterizing)，以將所錄製到的網路流量拆解成多條連線(flow)，雖然訓練階段一次僅訓練一種網路應用程式，但是網路應用程式在不同對象下會有不同的連線行為，如此導致封包序列也有所不同，因而對網路流量依據傳遞對象進行分類，例如以IP位址為分辨依據，將不同傳遞對象的流量分開，才能避免將多位傳遞對象的流量混在一起而造成誤判。Then in the process 111, the flow characterizing of each connection is taken to disassemble the recorded network traffic into multiple flows, although only one network is trained at a time during the training phase. Application, but the web application will have different connection behavior under different objects, which will result in different packet sequences. Therefore, the network traffic will be classified according to the delivery object, for example, based on the IP address, it will be different. The traffic of the delivery object is separated to avoid mixing the traffic of multiple delivery objects. Incorrect judgment.

於流程112中，是開發各流量中的最長共有子序列。簡單來說，自各流量可得到其所表示的子序列，接著，對各不同流量找出共同子序列中的最長者，即可用此共同子序列來代表訓練中的網路應用程式，如此一來，當某一網路流量具有類似於該最長共有子序列時，則有機會判斷該網路流量可能屬於該最長共有子序列所對應之網路應用程式。In the process 112, the longest shared subsequence in each flow is developed. To put it simply, the subsequences represented by each flow can be obtained. Then, the longest one of the common subsequences can be found for each different traffic, and the common subsequence can be used to represent the network application in training. When a certain network traffic has a similar longest shared subsequence, there is a chance to determine that the network traffic may belong to the network application corresponding to the longest shared subsequence.

於流程113中，是找出網路應用程式之代表長度特徵序列集合，亦即找出該網路應用程式於不同流量中各子序列的長度共用子序列集合。於前一流程112中，可能會得到多個最長共有子序列，因此，於本流程113中，將該些最長共有子序列組成集合，此集合即可代表所訓練的網路應用程式。In the process 113, it is to find a set of representative length feature sequences of the network application, that is, to find a set of common sub-sequences of the length of each sub-sequence of the network application in different traffic. In the previous process 112, a plurality of longest shared subsequences may be obtained. Therefore, in the process 113, the longest shared subsequences are grouped together, and the set represents the trained web application.

接著，進入第二階段12，第二階段12為實際分類階段，也就是利用第一階段11所得到的代表各種網路應用程式之長度共用子序列集合，作為與真實網路流量比對的基準，藉由比對與代表各網路應用程式之長度共用子序列集合之間的相似度(similarity)差距，來推論所擷取到的網路連線是屬於哪種網路應用程式。Then, the second stage 12 is entered, and the second stage 12 is the actual classification stage, that is, the length shared sub-sequence set representing the various network applications obtained by the first stage 11 is used as a reference for comparison with the real network traffic. By comparing the similarity gap between the sub-sequence sets and the length of the sub-sequences representing the length of each web application, it is inferred which network application the captured network connection belongs to.

首先，於流程120和流程121中，同樣也是收集實際網路流量並進行流量拆解(traffic decomposition)，也是依據如IP位址作為分辨依據，將所擷取到網路流量拆解成多條連線，使每一個對象的網路流量能夠分開，以利於之後對於封包序列的識別。First, in the process 120 and the process 121, the actual network traffic is also collected and traffic decomposition is performed. The traffic is also disassembled into multiple pieces according to the IP address as a resolution basis. Connect, so that each object's network traffic can be separated to facilitate later Identification of the packet sequence.

接著，於流程122中，是對網路流量進行分類以達到連線辨識(flow classification)的目的。簡單來說，即將流程121所取得之網路流量的封包序列與第一階段11所取得的各種代表不同網路應用程式之長度共用子序列集合進行比對，藉此找出最接近該網路流量之封包序列者。Next, in the process 122, the network traffic is classified to achieve the purpose of flow classification. Briefly, the packet sequence of the network traffic obtained by the process 121 is compared with the various sub-sequence sets representing the length of the different network applications obtained in the first stage 11, thereby finding the closest network. The packet sequence of the traffic.

最後，於流程123中，即利用前一流程122所得到之與欲判定網路流量之封包序列最接近者，找出其所對應之網路應用程式，即可得到所擷取到網路流量其所屬之網路應用程式。Finally, in the process 123, the network application that is obtained by using the packet sequence of the network traffic to be determined by the previous process 122 is found to be the closest to the network application, and the obtained network traffic can be obtained. The web application to which it belongs.

由上可知，通過訓練找出網路應用程式其連線行為，之後可以此與之後擷取到的新網路連線作比對，藉此判斷新網路連線所屬網路應用程式為何者。需說明者，上述僅是針對單一網路應用程式比對的操作流程敘述，因此，如有多種網路應用程式需要比對，則僅需針對各不同的網路應用程式多次操作本流程即可。From the above, you can find out the connection behavior of the web application through training, and then compare it with the new network connection that you can retrieve later to determine the network application of the new network connection. . Need to explain, the above is only a description of the operation flow of a single web application comparison. Therefore, if multiple web applications need to be compared, it is only necessary to operate the process multiple times for different web applications. can.

參閱第2圖，其係說明本發明之基於訊息長度序列之網路流量辨識系統於分類階段之系統架構圖。如圖所示，基於訊息長度序列之網路流量辨識系統2主要說明於第1圖的第二階段中所執行的網路流量辨識，其中，基於訊息長度序列之網路流量辨識系統2係包括：資料庫21、流量收集模組22、流量拆解模組23、辨識模組24以及判定模組25。Referring to FIG. 2, it is a system architecture diagram of the network length identification system based on the message length sequence of the present invention in the classification stage. As shown in the figure, the network traffic identification system 2 based on the message length sequence mainly describes the network traffic identification performed in the second phase of FIG. 1 , wherein the network traffic identification system 2 based on the message length sequence includes The database 21, the flow collection module 22, the flow disassembly module 23, the identification module 24, and the determination module 25.

資料庫21是用於預存對應各種網路應用程式之長度共用子序列集合。如前面所述，長度共用子序列集合是用於代表某一網路應用程式，因而在辨識之前，可透過訓練方式得到可代表每一個網路應用程式之長度共用子序列集合，關於訓練方式，之後會有更詳盡說明。The database 21 is for pre-storing the length of various web applications. A collection of shared subsequences. As described above, the length common subsequence set is used to represent a certain network application, so that before the identification, a training set can be used to obtain a common subsequence set representing the length of each web application. More details will follow.

流量收集模組22是用於收集網路流量100，也就是將通過網路設備的網路流量100進行收集，該網路流量100可能混雜多種網路應用程式的封包，或屬於同一網路應用程式但與不同對象溝通的封包，因而收集後之後，需經拆解才能進行比對判斷。The traffic collection module 22 is configured to collect network traffic 100, that is, to collect network traffic 100 through a network device, the network traffic 100 may be mixed with packets of multiple network applications, or belong to the same network application. Packages that communicate with different objects but are collected and then disassembled for comparison.

流量拆解模組23是用於依據流量資訊將流量收集模組22所收集之網路流量100拆解成多條連線，亦即依據不同傳遞對象進行拆解，前述的流量資訊可為來源IP位址、來源埠號、目的地IP、目的地埠號及傳輸協定等，藉此拆解成多條連線，之後，再擷取各連線中複數封包的傳遞方向及長度大小，以由該連線之該些複數封包產生對應該連線之長度特徵序列(message size sequence)。The traffic disassembly module 23 is configured to disassemble the network traffic 100 collected by the traffic collection module 22 into a plurality of connections according to the traffic information, that is, disassemble according to different delivery objects, and the foregoing traffic information may be a source. IP address, source nickname, destination IP, destination nickname, and transport protocol, etc., thereby disassembling into multiple connections, and then taking the transmission direction and length of the plurality of packets in each connection to The plurality of packets of the connection result in a message size sequence corresponding to the connection.

更具體來說，當拆解成多條連線後，每一連線即是與單一對象之間的封包傳遞，以其中一條連線為例，將同一連線中之多個封包依序排列，並且找出該些封包的傳遞方向及長度大小，其中，傳遞方向可以連線發起者的傳遞方向為正，反之為負，最後，利用該連線中之複數封包產生對應此連線之長度特徵序列，即可作為代表該連線之封包序列組合。More specifically, when disassembled into multiple connections, each connection is a packet transmission with a single object. Taking one of the connections as an example, multiple packets in the same connection are sequentially arranged. And finding the transmission direction and length of the packets, wherein the transmission direction can be positive for the connection initiator of the connection, and vice versa, and finally, the multiple packets in the connection are used to generate the length corresponding to the connection. The sequence of features can be combined as a sequence of packets representing the connection.

此外，於具體實施時，流量拆解模組23係自各連線之複數封包中移除封包長度為最大傳輸單位(maximum transmission unit)之封包以及封包內容長度為零之封包。由於擷取網路流量時，所取得的封包對於找出網路應用程式的封包序列並不一定是有用的，例如：封包可分為帶著控制訊息(control message)的封包以及帶著資料訊息(data message)的封包，由於伺服器端與使用者端連線後，將會以最大封包方式進行資料傳遞(以縮短傳遞時間)，也就是說，用於攜帶資料訊息的封包其大小會達到最大傳輸長度，如此一來無法由該些封包看出差異，反觀，帶著控制訊息的封包可能是具有帳號密碼的訊息，因而封包大小會因連線過程而會有所差異。因此，流量拆解模組23會先將每一連線的複數封包中，其封包長度為最大傳輸單位者及封包內容大小為零者移除，以提高之後封包序列建立的可用性。In addition, in the specific implementation, the flow dismantling module 23 is connected to each line. The packet in which the packet length is the maximum transmission unit and the packet whose packet content length is zero are removed from the plurality of packets. Since the network packet is captured, the obtained packet is not necessarily useful for finding the packet sequence of the network application. For example, the packet can be divided into a packet with a control message and a data message. The packet of (data message) will be transmitted in the largest packet mode (to shorten the delivery time) after the server and the user are connected, that is, the size of the packet used to carry the data message will reach the size. The maximum transmission length, so that the difference cannot be seen by the packets. In contrast, the packet with the control message may be a message with an account password, and thus the packet size may vary due to the connection process. Therefore, the traffic disassembly module 23 first removes the packet of the length of each connection, and the packet length is the maximum transmission unit and the content of the packet is zero, so as to improve the availability of the subsequent packet sequence establishment.

辨識模組24是用於將流量拆解模組23所取得某一連線之長度特徵序列與預存於資料庫21中的各種網路應用程式之長度共用子序列集合進行比對，藉此由長度共用子序列集合中找到與該連線之長度特徵序列間相似度最高者，也就是說，相似度越高者表示兩者網路連線行為特徵越相近。The identification module 24 is configured to compare the length feature sequence of a connection obtained by the traffic disassembly module 23 with a length common subsequence set of various network applications prestored in the database 21, thereby The highest commonality between the length and the length of the feature sequence is found in the set of length shared subsequences, that is, the higher the similarity, the closer the network connection behavior characteristics are.

最後，判定模組25是依據辨識模組24所得到之相似度最高者的數量以判定該連線屬於已知網路應用程式或者是屬於未知網路應用程式。具體而言，於辨識模組24進行辨識時，可能找不到或找到一個以上的長度特徵序列與欲辨識之連線的長度特徵序列相似，此時，若數量為零或超過一個，則判定該連線為未知網路應用程式，反之，若最相似的數量僅有一個，則可判定該連線為已知網路應用程式。Finally, the decision module 25 determines whether the connection belongs to a known web application or belongs to an unknown web application according to the number of similarities obtained by the identification module 24. Specifically, when the identification module 24 performs identification, more than one length feature sequence and desire may not be found or found. The length characteristic sequence of the identified connection is similar. In this case, if the number is zero or more than one, the connection is determined to be an unknown network application, and if the most similar number is only one, the connection may be determined. Is a known web application.

於具體實施時，流量中複數封包所形成之長度特徵序列，係依據複數封包中的每一封包的傳遞方向及長度大小予以數值定義，並且依序排列該些數值而產生。如前所述，傳遞方向可以正負表示，封包大小可給予數值定義，例如，一封包是由連線發起端至接收端且封包大小為20KB，則可給予+20的數值定義。因此，長度特徵序列可為多個封包依其傳遞方向和封包長度大小所給予數值加以定義，所定義出的序列可供之後相似度判斷之用。In a specific implementation, the length feature sequence formed by the plurality of packets in the traffic is numerically defined according to the transmission direction and length of each packet in the plurality of packets, and the values are sequentially arranged. As mentioned above, the direction of transmission can be positive or negative, and the size of the packet can be given a numerical definition. For example, if a packet is from the originating end to the receiving end and the packet size is 20 KB, a value of +20 can be given. Therefore, the length feature sequence can be defined by the values given by the plurality of packets according to the direction of their transmission and the length of the packet, and the defined sequence can be used for subsequent similarity judgment.

參閱第3圖，其係說明本發明之基於訊息長度序列之網路流量辨識系統於訓練階段之系統架構圖。如圖所示，基於訊息長度序列之網路流量辨識系統3內除包括資料庫31、流量收集模組32、流量拆解模組33、辨識模組34、判定模組35外，還包括在訓練的階段中，用於產生各網路應用程式之代表長度特徵序列集合的應用程式代表集合產生模組36。Referring to FIG. 3, it is a system architecture diagram of the network length identification system based on the message length sequence of the present invention in the training phase. As shown in the figure, the network traffic identification system 3 based on the message length sequence includes, in addition to the database 31, the traffic collection module 32, the traffic disassembly module 33, the identification module 34, and the determination module 35, In the training phase, an application representative set generation module 36 for generating a set of representative length feature sequences for each web application.

需先說明者，若在訓練的階段中，辨識模組34和判定模組35則無需運作，此外，基於訊息長度序列之網路流量辨識系統3所接收到的資料，是知道封包所屬網路應用程式為何的已知的網路流量200，而不會是不知道封包所屬網路應用程式為何的網路流量100，因此，辨識模組34和判定模組35在訓練階段中是無需運作的，而網路流量100則無需提供。It should be noted that, in the training phase, the identification module 34 and the determination module 35 need not operate. In addition, the data received by the network traffic identification system 3 based on the message length sequence is aware of the network to which the packet belongs. Why is the application known for network traffic 200, and not the network traffic 100 that does not know the network application to which the packet belongs. Therefore, the identification module 34 and The decision module 35 is not required to operate during the training phase, and the network traffic 100 is not required to be provided.

應用程式代表集合產生模組36係利用已知的網路流量200進行訓練，透過該流量拆解模組33拆解該已知的網路流量以產生各連線之長度特徵序列，之後，將該些連線之長度特徵序列以兩兩一組的方式計算該兩條連線之最長長度共有子序列(longest size subsequence)，並收集各種組合所計算出之最長長度共有子序列以產生該長度共用子序列集合。由上可知，為了知悉各種網路應用程式的可能封包序列，因而利用應用程式代表集合產生模組36以找出對應不同網路應用程式之各種長度共用子序列集合。The application representative collection generation module 36 is trained by using the known network traffic 200, and the known network traffic is disassembled through the traffic disassembly module 33 to generate a length characteristic sequence of each connection, and then The length characteristic sequences of the lines calculate the longest length subsequence of the two lines in a pairwise manner, and collect the longest length common subsequence calculated by various combinations to generate the length. A collection of shared subsequences. From the above, in order to know the possible packet sequence of various network applications, the application representative set generation module 36 is used to find a common sub-sequence set of various lengths corresponding to different network applications.

具體來說，已知的網路流量200同樣由流量收集模組32進行收集，這裡的已知的網路流量200是指僅對單一網路應用程式的封包收集，接著傳送至流量拆解模組33，流量拆解模組33也是將已知的網路流量200拆解成多條連線，亦即可依據前述的流量資訊，例如來源IP位址、來源埠號、目的地IP、目的地埠號及傳輸協定等，之後，再擷取各連線中複數封包的傳遞方向及長度大小，以由該連線之該些複數封包產生對應該連線之長度特徵序列，流量收集模組32和流量拆解模組33與第2圖中的流量收集模組22、流量拆解模組23的作用是相同。Specifically, the known network traffic 200 is also collected by the traffic collection module 32. The known network traffic 200 herein refers to packet collection only for a single network application, and then transmitted to the traffic teardown module. Group 33, the traffic disassembly module 33 also disassembles the known network traffic 200 into a plurality of connections, and may also be based on the foregoing traffic information, such as source IP address, source nickname, destination IP, destination. After the nickname and the transmission protocol, etc., the transmission direction and length of the plurality of packets in each connection are retrieved, so that the plurality of packets of the connection are generated to generate a length characteristic sequence corresponding to the connection, and the traffic collection module The function of the 32 and the flow disassembly module 33 is the same as that of the flow collection module 22 and the flow disassembly module 23 in FIG.

接著，不同連線所產生之長度特徵序列會傳送至應用程式代表集合產生模組36，應用程式代表集合產生模組36將以兩兩一組的方式，亦即任兩條連線一組，以計算兩條連線之最長長度共有子序列，也就是取兩條連線中共有子序列的最長者，之後，在收集各種組合(任兩條一組)所計算出之最長長度共有子序列後，以產生該長度共用子序列集合。由此可知，長度共用子序列集合是包含網路應用程式所傳送之封包的可能集合。Then, the sequence of length features generated by the different connections is transmitted to the application representative set generation module 36, and the application representative set generation module 36 will be in groups of two or two, that is, a set of two wires. To calculate two The longest length of the connection shares a subsequence, that is, the longest of the shared subsequences in the two connections, and then, after collecting the longest common subsequences calculated by the various combinations (either of the two groups), This length shares a set of subsequences. It can be seen that the length shared subsequence set is a possible set containing packets transmitted by the web application.

此外，於前述計算過程中，若發現某長度較短之序列存在於某長度較長之共有子序列中時，即較短的序列已被長度較長的其他共有子序列所包含，此時可將長度較短者刪除，已被其他共有子序列所包含者是無需成為長度共用子序列集合之一員。In addition, in the foregoing calculation process, if a sequence of a short length is found to exist in a common sub-sequence of a long length, that is, the shorter sequence has been included by other common sub-sequences of a longer length, Those who have a shorter length and who have been included in other shared subsequences are not required to be a member of the length shared subsequence set.

參閱第4圖，其係說明本發明之基於訊息長度序列之網路流量辨識系統一訓練過程實施例之示意圖，該圖主要是說明如何找出最長長度共有子序列。如圖所示，某一網路應用程式其連線特徵的可能封包序列為1-2-3-4-5，在訓練過程中，得到A連線的長度特徵序列為1-2-3-4，而B連線的長度特徵序列為2-3-4-5，之後，透過第3圖之應用程式代表集合產生模組36以兩兩一組的方式進行計算，以找出兩條連線之間的最長長度共有子序列，於本範例中，即為2-3-4。Referring to Figure 4, there is shown a schematic diagram of a training process embodiment of a message length sequence based network traffic identification system of the present invention. The figure mainly illustrates how to find the longest length shared subsequence. As shown in the figure, the possible packet sequence of a connection feature of a network application is 1-2-3-4-5. During the training process, the length characteristic sequence of the A connection is 1-2-3- 4, and the length characteristic sequence of the B connection is 2-3-4-5. Thereafter, the application representative set generation module 36 of FIG. 3 performs calculations in groups of two to find two connections. The longest length between the lines shares a subsequence, which in this example is 2-3-4.

由上可知，將收集的已知的網路流量拆解成多條連線，再以兩兩一組進行最長長度共有子序列的計算，最後會得到多個最長長度共有子序列，即為第3圖之應用程式代表集合產生模組36所產生該對應網路應用程式之長度共用子序列集合。此外，若有一組找到的最長長度共有子序列例如3-4時，則該最長長度共有子序列(3-4)可由圖中之A連線和B連線所得到之最長長度共有子序列(2-3-4)所包含，故包含3-4的最長長度共有子序列是無需加入長度共用子序列集合中。It can be seen from the above that the collected network traffic is disassembled into a plurality of links, and then the longest length common subsequence is calculated in groups of two, and finally, a plurality of longest length common subsequences are obtained, that is, The application of the diagram 3 represents a set of shared sub-sequences of the corresponding network application generated by the collection generation module 36. In addition, if there is a group of the longest lengths found together When the sequence is, for example, 3-4, the longest length shared subsequence (3-4) may be included in the longest length shared subsequence (2-3-4) obtained by the A line and the B line in the figure, and thus includes The longest length shared subsequence of 3-4 is not required to be added to the length shared subsequence set.

參閱第5圖，其係說明本發明之基於訊息長度序列之網路流量辨識系統一辨識過程實施例之示意圖，即第5圖主要是說明如何辨識長度特徵序列之間的相似度。在訓練某一網路應用程式後，其得到的長度共用子序列集合中的一個可能封包序列為1-2-3-4-5，為方便進行相似度比對，則可依據封包方向和封包長度大小給予數值定義，如圖所示，該可能封包序列依序可被定義為+10-10+15-10+20的數值，其中10、15、20等數字僅為封包大小的舉例。Referring to FIG. 5, which is a schematic diagram of an embodiment of an identification process of a network traffic identification system based on a message length sequence of the present invention, that is, FIG. 5 mainly illustrates how to identify similarities between length feature sequences. After training a network application, a possible packet sequence in the length shared subsequence set obtained is 1-2-3-4-5. To facilitate the similarity comparison, the packet direction and the packet may be used according to the packet direction and the packet. The length size is given a numerical definition. As shown, the possible packet sequence can be defined as a value of +10-10+15-10+20 in sequence, wherein numbers 10, 15, 20, etc. are only examples of packet sizes.

之後，在新網路流量辨識過程中，網路流量也會被拆解成多條連線，其中，有一條連線的封包序列如情況1，其包含有2-3-4-5的封包序列，且依封包方向和大小得到-10+15-10+20的數值，與可能封包序列(1-2-3-4-5)相比較，序列中有四個封包符合且數值相等，也就是說，情況1的連線與可能封包序列相似度高。After that, in the new network traffic identification process, the network traffic will also be split into multiple connections. Among them, there is a connection sequence of packets, as in case 1, which contains 2-3-4-5 packets. Sequence, and get the value of -10+15-10+20 according to the direction and size of the packet. Compared with the possible packet sequence (1-2-3-4-5), there are four packets in the sequence that match and the values are equal. That is to say, the connection of Case 1 is similar to the possible packet sequence.

再考量情況2，另一條連線包含有1-2-3-4-5的封包序列，且依封包方向和大小得到+10-8+16-10+19的數值，與可能封包序列(1-2-3-4-5)相比較，序列中有五個封包順序符合，但其數值與封包序列(+10-10+15-10+20)相近，因而兩者相似度也很高。前述數值差異主要表示封包大小的不同，然而僅要正負號相同即表示傳遞方向相同，但封包大小可與給予誤差值的容忍(視情況設立)，以避免因封包過小差異導致數值不同的可能誤判。Consider the case 2, the other connection contains a packet sequence of 1-2-3-4-5, and the value of +10-8+16-10+19 is obtained according to the direction and size of the packet, and the possible packet sequence (1) -2-3-4-5) In comparison, there are five packet sequences in the sequence, but the values are similar to the packet sequence (+10-10+15-10+20), so the similarity between the two is also high. The difference in the above values mainly indicates the difference in the size of the packet. However, if the sign is the same, the direction of the transfer is the same, but the seal is the same. The packet size can be tolerated with the given error value (as appropriate) to avoid possible false positives due to small differences in packets.

若考量情況1和情況2的相似度高低，情況2本身已比情況1多一個封包相符，再者，若情況2的數值也都在可容忍範圍內下，則可判定情況2的相似度高於情況1。If the similarity between Case 1 and Case 2 is considered, Case 2 itself has more than one packet in comparison with Case 1. Furthermore, if the value of Case 2 is also within the tolerable range, then the similarity of Case 2 can be determined to be high. In case 1.

本發明對於所擷取之封包是不考慮先後時間，是僅考量封包出現的先後順序，當發送端發出數個封包後，接收端會有暫存區暫存該些封包，之後，再判斷是否有預存之封包序列與所收集之該些封包的順序類似，簡單來說，封包順序不對(不同於某一網路應用程式的可能封包序列)就判定不屬該網路應用程式的封包，只有當順序需符合資料庫內定義的封包順序時，才會進一步考慮封包大小是否相近。The present invention does not consider the sequential time for the captured packet, and only considers the order in which the packets appear. After the sender sends out several packets, the receiving end temporarily stores the packets in the temporary storage area, and then determines whether the packet is present. The pre-stored packet sequence is similar to the sequence in which the packets are collected. In short, the packet sequence is incorrect (unlike a possible packet sequence of a web application) to determine that the packet is not a network application, only When the order needs to conform to the packet order defined in the database, it will be further considered whether the packet sizes are similar.

上述是避免不同順序的封包被誤判，因為，同一個網路應用程式中，不同的封包順序代表不同應用含意，例如訓練得到的封包序列為1-2-3，若新網路流量所拆解出連線的封包序列為3-1-2，則不能因兩者都有相同大小的封包就認為兩者相似度高。再者，以往在判定相似度高低時，僅是考量封包是否相似(可能是封包數量或封包內容)，因而對於具有相同大小的封包但有不同排列順序的兩連線可能會判定相似，但多個封包的排列不同下，仍有可能是不同網路應用程式所產生的，因此，本發明首重考量封包順序是否正確，以提高判斷時的準確性。The above is to avoid misidentification of packets in different orders, because in the same network application, different packet sequences represent different application meanings, for example, the training sequence is 1-2-3, if the new network traffic is disassembled If the sequence of the outgoing packets is 3-1-2, the two cannot be considered to have high similarity because they have the same size. Furthermore, in the past, when judging the similarity level, only considering whether the packets are similar (possibly the number of packets or the contents of the packet), the two connections having the same size but different ordering may be similar, but more The arrangement of the packets is different, and may still be generated by different network applications. Therefore, the present invention considers the order of the packets correctly to improve the accuracy of the judgment.

參閱第6圖，其係說明本發明之基於訊息長度序列之網路流量辨識方法之步驟圖。如圖所示，於步驟S601中，係提供對應各種網路應用程式之長度共用子序列集合，於此所述之長度共用子序列集合可透過預先訓練得到，長度共用子序列集合是用來代表網路應用程式之封包序列的許多可能集合。接著至步驟S602。Referring to Figure 6, it is a description of the message length sequence based on the present invention. Step diagram of the network traffic identification method. As shown in the figure, in step S601, a length common subsequence set corresponding to various network applications is provided. The length common subsequence set described herein can be obtained through pre-training, and the length shared subsequence set is used to represent Many possible collections of packet sequences for web applications. Next, the process goes to step S602.

於步驟S602中，係收集網路流量並拆解該網路流量成多條連線，擷取各該連線中複數封包的傳遞方向及長度大小，以產生對應各該連線之長度特徵序列。於此步驟中，係將所收集到之網路流量依據流量資訊進行拆解，以將網路流量拆解成多條連線，其中，流量資訊可包括來源IP位址、來源埠號、目的地IP、目的地埠號及傳輸協定等，亦即可分辨出傳遞對象的資訊。In step S602, the network traffic is collected and the network traffic is disassembled into a plurality of connections, and the transmission direction and length of the plurality of packets in each connection are extracted to generate a length characteristic sequence corresponding to each connection. . In this step, the collected network traffic is disassembled according to the traffic information to disassemble the network traffic into multiple connections, wherein the traffic information may include the source IP address, the source nickname, and the destination. The IP address, destination nickname, and transmission protocol can also be used to distinguish the information of the delivery object.

此外，還包括於擷取各連線中複數封包的傳遞方向及長度大小之前，將各連線之複數封包中移除封包長度為最大傳輸單位之封包以及封包內容長度為零之封包，該些封包長度大小為最大或內容為零是無法成為封包序列之一員，因為該些封包是無法提供辨識效果，反而容易混淆判斷，因而在建立長度特徵序列前，需先去除該些無用封包。接著至步驟S603。In addition, before extracting the transmission direction and length of the plurality of packets in each connection, removing the packet whose packet length is the maximum transmission unit and the packet whose packet content length is zero, the plurality of packets of each connection are removed. If the length of the packet is the largest or the content is zero, it cannot be a member of the packet sequence. Because the packets cannot provide the identification effect, it is easy to confuse the judgment. Therefore, before using the length feature sequence, the useless packets need to be removed. Next, the process goes to step S603.

於步驟S603中，係比對該長度特徵序列與該各種網路應用程式之長度共用子序列集合，以由該長度共用子序列集合中取得與該長度特徵序列之相似度最高者。於此步驟中，將步驟S602所取得之各連線之長度特徵序列與預先定義之各種網路應用程式之長度共用子序列集合進行比對，以由各種網路應用程式之長度共用子序列集合找出最符合者，於此所述的比對中，要考量的包括封包序列的順序、個數和大小等相似程度，封包序列的順序即是指連線中有用封包的先後順序，若有明顯排列差異則完全不需比對，接著，再以封包個數和大小來判斷相似度。In step S603, the subsequence set is shared with the length feature sequence and the lengths of the various network applications, so that the similarity with the length feature sequence is obtained from the length common subsequence set. In this step, the length feature sequence of each connection obtained in step S602 is compared with a preset common subsequence set of various network applications. Finding the best match by sharing the sequence of sub-sequences by the length of various network applications. In the comparison described above, the order of the sequence, the number and the size of the packet sequence to be considered, the sequence of the packet sequence is It refers to the order of useful packets in the connection. If there is obvious difference in arrangement, there is no need to compare them. Then, the number and size of the packets are used to judge the similarity.

舉例來說，可利用複數封包中每一封包之傳遞方向及長度大小給予數值定義，並且依序排列該些數值以產生代表連線之長度特徵序列。如前面第5圖所示的利用正負號及數值來表示方向和大小，必要時給予適當容忍值，以避免微小差距的封包大小所造成的誤判。接著至步驟S604。For example, a numerical definition can be given by using the direction and length of each packet in the plurality of packets, and the values are sequentially arranged to produce a sequence of length features representing the links. Use the sign and the value to indicate the direction and size as shown in the previous figure, and give appropriate tolerance values if necessary to avoid misjudgment caused by the small size of the packet size. Next, the process goes to step S604.

於步驟S604中，係依據該長度共用子序列集合的數量判斷該連線為已知網路應用程式或未知網路應用程式。於此步驟中，在任一連線之長度特徵序列與長度共用子序列集合比對後，可能找出一個或一個以上相似度最高的長度特徵序列，因此，若僅有一個相似度最高，那可判定該連線屬於該長度共用子序列集合之長度特徵序列所對應之網路應用程式，反之，若超過一個以上時，則無法判定該連線所屬之網路應用程式為何者，故判定其為未知的網路應用程式。In step S604, the connection is determined to be a known network application or an unknown network application according to the number of shared sub-sequence sets of the length. In this step, after the length feature sequence of any connection is aligned with the length shared subsequence set, one or more length feature sequences with the highest similarity may be found. Therefore, if only one similarity is the highest, then Determining that the connection belongs to the network application corresponding to the length feature sequence of the common subsequence set of the length. Otherwise, if there is more than one, it is impossible to determine the network application to which the connection belongs, so it is determined that Unknown web app.

於另一實施例中，本發明之步驟S601係提供有對應各種網路應用程式之長度共用子序列集合，然該長度共用子序列集合是透過預先訓練得到的。具體而言，長度共用子序列集合是預先利用已知的網路流量進行訓練，其包括：將已知的網路流量拆解成多條連線，並擷取各該連線中複數封包的傳遞方向及長度大小，藉此得到對應各該連線之長度特徵序列，之後，將各該連線之長度特徵序列以兩兩一組的方式計算兩條連線之間的最長長度共有子序列，最後，收集各種組合所計算出之最長長度共有子序列，即可得到長度共用子序列集合。In another embodiment, step S601 of the present invention provides a set of shared subsequences corresponding to lengths of various network applications, and the set of shared subsequences of the length is obtained through pre-training. Specifically, the length shared subsequence set is trained in advance using known network traffic, and includes: disassembling the known network traffic into a plurality of connections, and extracting each of the connections The transmission direction and the length of the plurality of packets, thereby obtaining a sequence of length characteristics corresponding to each of the links, and then calculating the longest length between the two links in a pairwise manner by the length characteristic sequence of each of the links The sub-sequences are shared, and finally, the longest-length shared sub-sequences calculated by various combinations are collected to obtain a set of length-shared sub-sequences.

上述長度共用子序列集合的訓練過程是利用一次訓練一種網路應用程式來進行，透過計算多個連線之間所共有的封包序列，以作為該網路應用程式之代表。如此，當新網路流量所拆解出之連線，與長度共用子序列集合中之任何最長長度共有子序列比對後具有高相似度時，則可判定該連線是歸屬某類網路應用程式。The training process of the above-mentioned length shared sub-sequence set is performed by training one network application at a time, and calculating a packet sequence shared between a plurality of connections as a representative of the network application. In this way, when the connection of the new network traffic is disconnected and has a high similarity with any longest shared subsequence in the length common subsequence set, it can be determined that the connection belongs to a certain type of network. application.

綜上所述，本發明所提出之基於訊息長度序列之網路流量辨識系統及其方法，即利用傳輸層連線行為特徵，以網路應用程式之特定訊息長度序列作為辨認網路流量中網路應用程式之依據，辨識時，將網路應用程式連線在傳輸層行為所萃取出的連線行為長度序列作為該連線之代表特徵，與已知的各種網路應用程式代表長度特徵序列做相似度比對，並以相似度最大之網路應用程式作為最後歸屬。與習知技術相比較，傳統的網路應用程式流量偵測和辨識技術多是採用網路應用程式所使用的已知連接埠或者是採用封包內容特徵值的比對方式，因而本發明克服習知技術的兩個缺點：(1)無法偵測使用動態連接埠之網路應用程式，以及(2)封包內容如果被網路應用程式加密傳送就無法透過封包內容特徵值比對辨認。因此，本發明解決了現有無法利用封包內容辨認的問題以及使用動態連接埠無法辨認的問題，並且提供一種可以用來作為線上閘道器使用之辨認機制。In summary, the network traffic identification system based on the message length sequence and the method thereof are provided by using the specific layer length sequence of the network application as the network for identifying the network traffic. The basis of the application, when identifying, the sequence of the length of the connection behavior extracted by the network application in the transport layer behavior is used as a representative feature of the connection, and the known various network applications represent the length feature sequence. Do similarity comparisons and use the most similar web application as the last attribution. Compared with the prior art, the traditional network application traffic detection and identification technology mostly uses the known connection used by the network application or the comparison method of the feature values of the package content, so the present invention overcomes the practice. Two shortcomings of the technology: (1) unable to detect web applications using dynamic connections, and (2) packet content can not be identified by packet content feature value if encrypted by the web application. Therefore, the present invention solves the present There are problems that cannot be identified by the contents of the packet and problems that cannot be recognized using the dynamic link, and an identification mechanism that can be used as an online gateway is provided.

上述實施例僅例示性說明本發明之原理及其功效，而非用於限制本發明。任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下，對上述實施例進行修飾與改變。因此，本發明之權利保護範圍，應如後述之申請專利範圍所列。The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications and variations of the above-described embodiments can be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of protection of the present invention should be as set forth in the scope of the claims described below.

2‧‧‧基於訊息長度序列之網路流量辨識系統2‧‧‧Network traffic identification system based on message length sequence

21‧‧‧資料庫21‧‧‧Database

22‧‧‧流量收集模組22‧‧‧Flow collection module

23‧‧‧流量拆解模組23‧‧‧Flow Disassembly Module

24‧‧‧辨識模組24‧‧‧ Identification Module

25‧‧‧判定模組25‧‧‧Determining module

100‧‧‧網路流量100‧‧‧Network traffic

Claims

A network traffic identification system based on a sequence of message lengths includes: a database for pre-storing a set of length shared sub-sequences corresponding to various network applications; a traffic collection module for collecting network traffic; The demodulation module is configured to disassemble the network traffic into a plurality of connections according to the traffic information, and capture the transmission direction and length of the plurality of packets in each connection, so as to generate corresponding links by the plurality of packets. a length characteristic sequence of the line; the identification module is configured to share the subsequence set with the length of the length feature sequence and the length of the various network applications in the database, to obtain the sum of the subsequence sets of the length The length of the feature sequence is the highest degree of similarity; and the determining module determines whether the connection is a known network application or an unknown network application according to the number of the highest similarity obtained by the identification module.

The network traffic identification system based on the message length sequence described in claim 1, wherein the traffic teardown module removes the packet and the packet whose packet length is the maximum transmission unit from the plurality of packets of the connection. A packet with a zero content length.

The network traffic identification system based on the message length sequence described in claim 1, wherein the length feature sequence is numerically defined according to the transmission direction and length of each packet in the plurality of packets, and sequentially Arrange the values produced by the values.

The network traffic identification system based on the message length sequence described in claim 1, wherein the traffic information includes a source IP address, a source nickname, a destination IP, a destination nickname, and a transport protocol.

The network traffic identification system based on the message length sequence described in claim 1 further includes an application representative set generation module, which uses known network traffic for training through the traffic teardown module. Disassembling the known network traffic to generate a length feature sequence of each connection, calculating the longest length common subsequence of the two connections in a pairwise manner, and collecting The longest length of the various combinations is shared by a subsequence to produce a set of shared subsequences of that length.

A network traffic identification method based on a sequence of message lengths includes: providing a common sub-sequence set corresponding to lengths of various network applications; collecting network traffic and disassembling the network traffic into a plurality of connections, and extracting each The direction and length of the plurality of packets in the connection to generate a sequence of length characteristics corresponding to each of the links; sharing the sequence of sub-sequences with the length of the sequence of features and the length of the various network applications for sharing by the length The subsequence set obtains the highest degree of similarity to the length feature sequence; and determines the connection as a known web application or an unknown web application according to the number of shared subsequence sets of the length.

The method for identifying a network traffic based on a message length sequence according to claim 6 of the patent application, wherein the plurality of packets in the connection are captured Before the transmission direction and the length of the transmission, the packet including the packet of the maximum transmission unit and the packet with the length of the packet content is removed from the plurality of packets of the connection.

The network traffic identification method based on the message length sequence described in claim 6 wherein the length feature sequence is numerically defined according to the transmission direction and length of each packet in the plurality of packets, and sequentially Arrange the values produced by the values.

The network traffic identification method based on the message length sequence described in claim 6 is characterized in that the network traffic is disassembled into a plurality of connection systems, including a source IP address, a source nickname, a destination IP, Destination nickname and transport protocol to disassemble the network traffic.

The method for identifying a network traffic based on a message length sequence according to claim 6, wherein the length shared subsequence set is trained by using known network traffic before the identification, including: the known network The traffic is disassembled into a plurality of connections, and the transmission direction and length of the plurality of packets in each connection are extracted to generate a sequence of length characteristics corresponding to each connection, and the length characteristic sequence of each connection is two. The two sets of methods calculate the longest length common subsequence of the two links, and collect the longest length common subsequence calculated by various combinations to generate the common subsequence set of the length.