CN104023018A - Text protocol reverse resolution method and system - Google Patents

Text protocol reverse resolution method and system Download PDF

Info

Publication number
CN104023018A
CN104023018A CN 201410258281 CN201410258281A CN104023018A CN 104023018 A CN104023018 A CN 104023018A CN 201410258281 CN201410258281 CN 201410258281 CN 201410258281 A CN201410258281 A CN 201410258281A CN 104023018 A CN104023018 A CN 104023018A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
data
item
text
feature
protocol
Prior art date
Application number
CN 201410258281
Other languages
Chinese (zh)
Inventor
李建宇
刘媛媛
Original Assignee
中国联合网络通信集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention provides a text protocol reverse resolution method and a text protocol reverse resolution system. The method comprises the following steps that: a plurality of network data packets of a text protocol are captured, application layer data is extracted from each network data packet and is converted into data streams in a text form, and a plurality of obtained data streams form a data set; separators are used for carrying out segmentation processing on each data stream in the data set, the contents of each data stream are divided into a plurality of data items, and position information of each data item is recorded; for each data item, whether the data item is a feature item or not is determined according to the position information of the data item and the occurring times of the data item in the data streams and the data set; the identical feature items in each data stream are extracted in lines, the feature items are used as fixed fields, and adjacent non-feature items are merged to be used as variable fields; and cluster analysis is carried out, and the format features of the text protocol are determined. The method and the system have the advantages that the commonality and the relevance are analyzed from a great number of data packets of the same text protocol, and the protocol format is obtained, so the reverse resolution is preciser.

Description

一种文本协议的逆向解析方法和系统 Reverse analysis method and system for text protocol

技术领域 FIELD

[0001] 本发明涉及一种协议逆向解析,更具体地,涉及一种文本协议的逆向解析方法和系统。 [0001] The present invention relates to a reverse analysis protocol, and more particularly, to a method and system for reverse parsing text protocol.

背景技术 Background technique

[0002] 信息在不同计算机系统之间交互的过程中,网络通信协议发挥了不可替代的作用,得到越来越高的关注。 [0002] information during the interaction between different computer systems, the network communication protocol irreplaceable role played to give increasing attention. 协议消息的格式对诸多网络安全应用必不可少,如漏洞挖掘、入侵检测系统等。 Protocol message format is essential for many applications of network security, such as vulnerability discovery, intrusion detection systems. 但是许多网络应用协议是非开放的,尤其在企业网络中,正式的协议描述文档越来越少,使得人们不能直接通过协议规范获取协议消息的特征。 But many non-network application protocol open, especially in the corporate network, a formal protocol description document less and less, so that the feature can not get people to protocol messages directly through the protocol specification. 协议逆向解析不依赖于协议规范或描述,通过分析协议的数据包或交互过程等技术方式,可以逆向提取协议的格式字段等信息,进而进行其它应用如自动生成测试例等。 Reverse analysis protocol does not depend on the protocol specification or description, by way of technical data packets or protocol interaction analysis like, can be reversed to extract information protocol format field and the like, further subjected to other applications such as automatic generation of test cases and the like.

[0003] 当前协议逆向解析有2种方案:基于协议报文序列分析和基于程序执行轨迹分析。 [0003] Analytical Reverse current protocol there are two schemes: protocol packets and sequence analysis based on the analysis based on the program execution trace.

[0004] 基于协议报文序列分析的核心技术是序列比对,对于同一种协议来说,其不同的报文样本之间具有相似性,序列比对的目的就是寻找相似性内容或结构。 [0004] The core technology of the packet-based protocol sequence is the sequence alignment analysis, for the same protocol, the similarity between the different sample packets, the sequence alignment is to find similarity of content or structure. 其流程如下:通过局部序列比对算法计算报文字节序列之间的相对距离;按照序列相对距离的远近,根据UPGMA(非加权成对群算术平均法)算法构建进化树;采用渐进比对算法遍历进化树,通过层层采用全局序列比对算法实现多序列比对,最后得出协议结构。 Process is as follows: byte packet algorithm calculates the relative distance between the local sequence alignment through sequence; sequence according to the distance of the relative distance, according to the phylogenetic tree constructed UPGMA (unweighted pair group method with arithmetic mean) algorithm; using progressive alignment phylogenetic tree traversal algorithm, through the layers using global sequence alignment algorithm multiple sequence alignment, the conclusion that the protocol structure.

[0005] 序列比对的局限性在于对于紧凑、简单、序列结构较为类似的协议,其识别效果较好,对于复杂、冗余较多的协议,其效率和准确度明显降低,而实际情况是很多协议格式是比较复杂的。 [0005] For a compact, simple structure is similar protocol sequence, which identifies the better, for complex, large redundancy protocol, accuracy and efficiency decreased, but the reality is that the specific limitations of the sequence many protocol format is more complicated. 此外,序列比对以字节作为基元,粒度小,不具有针对性,分析较为繁琐。 Further, as a byte sequence alignment primitives, small particle size, is not targeted, analysis is more complicated.

[0006] 基于程序执行轨迹的协议逆向解析理论依据在于网络应用程序对协议数据的处理过程包含了大量有关协议消息格式的信息,通过分析网络应用程序对协议数据进行处理的相关指令及数据流信息可以得到协议格式。 [0006] Based on the program execution trace protocol reverse analysis theory that the network application to process protocol data contains a large amount of information about the protocol message formats, related instruction processing and data flow information for a protocol data by analyzing network applications available protocol format. 基于程序执行轨迹的分析是动态分析的方法,其关键技术是动态污点分析技术。 To analyze the execution trace of the program is based on dynamic analysis, key technology is dynamic taint analysis techniques. 其基本流程如下:动态监控网络协议数据的处理过程,实时记录程序执行轨迹,将程序从协议消息读入到内存空间的每个输入字节标记为污点数据,并对应于一个唯一的污点标签(在输入消息中的偏移);对程序执行轨迹进行分析,根据一定的解析策略得到协议的字段,如分隔符、关键词、定长域、变长域等。 The basic process is as follows: dynamic monitoring network protocol processing of data, recording the real-time execution trace of the program, the program reads the message from the protocol of each input byte memory space the data marked as stain and stain corresponds to a unique label ( offset in the incoming message); for the program execution trace analysis, according to a certain protocol analysis policy field, such as separators, keyword, length fields, fields, etc. becomes long. 例如分隔符的解析策略是在执行指令中查找与协议数据中连续位置进行比较的常量字符,定长域的解析策略是域的解析范围由指令中的常量参数指定,等等;将解析出的字段与污点标签(即位置)关联起来,即可得出基本的协议结构。 E.g. separator resolution policy is to find in the instruction execution constant characters to be compared with the data protocol in successive positions, fixed length fields resolution policy analysis range is specified by the constant domain parameter in the instruction, and the like; the parsed stain tag fields (i.e., position) associated, can be derived basic protocol structure.

[0007] 基于程序执行轨迹的动态分析方法,其缺点在于程序执行轨迹需要在程序执行过程中实时记录,而进行实时记录目前尚无公开的工具,因此分析对象的获取较难,而且实时记录程序执行轨迹所需的时空开销很大,每次只是对单条协议消息的处理,导致精确度也不闻。 [0007] Dynamic analysis of the execution trace of a program based on the program execution trace disadvantage that requires real time recording during execution of the program, and a tool for real-time recording There is no disclosure, and therefore more difficult to obtain the analysis target, and the real time recording program time and space required to perform much overhead track, each only a single protocol message processing, resulting accuracy will not smell. 发明内容 SUMMARY

[0008] 本发明要解决的技术问题是提供一种更为精确的文本协议的逆向解析方法和系统。 [0008] The present invention is to solve the technical problem of providing a more accurate text-based protocol and system for reverse analysis method.

[0009] 为了解决的上述技术问题,本发明提供了一种对文本协议的逆向解析方法,包括: [0009] In order to solve the above problems, the present invention provides a method for reverse analysis of the text protocol, comprising:

[0010] 抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; [0010] a plurality of gripping a text protocol network packets extracted from each packet network and application-layer data converted to text data stream, multiple data stream obtained form a data set;

[0011] 对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息; [0011] of the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recording position information of each data item;

[0012] 对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; [0012] For each data item, and its location information in the data stream according to the number of data set statistics of occurrence, determining whether the data item is a feature item;

[0013] 将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 [0013] The same characteristics for each data stream extracted by row, after the feature items as fixed fields, adjacent items incorporated as a non-feature variable domain, cluster analysis, to determine the protocol text format features.

[0014] 较佳地, [0014] Preferably,

[0015] 所述分隔符为分隔符集中的分隔符,包括回车换行符、空格符、制表符、逗号、冒号和分号; [0015] The separator is a separator delimiter set, including carriage returns, spaces, tabs, commas, colon and semicolon;

[0016] 所述对每一文本文件,使用分隔符进行分词处理,包括: [0016] For each of the text file, using the word delimiters process, comprising:

[0017] 先根据回车换行符将该数据流的内容划分为一行或多行; [0017] The contents of the first carriage return to the data stream is divided into one or more rows;

[0018] 再使用其他分隔符对每一行进行划分,得到一个或多个数据项,并记录每一数据项位置信息,包括数据项所在行、列的值。 [0018] then use other delimiters of each line is divided, to obtain one or more data items, each data record and the location information, including data entry line, column values.

[0019] 较佳地, [0019] Preferably,

[0020] 所述对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项,包括: [0020] The data for each item, according to the statistics and information of its position in the data stream, the data set the number of occurrences, determining whether the data item is a feature item, comprising:

[0021] 判断出现该数据项是否满足以下条件之一: [0021] It is determined whether the data item occurs one of the following conditions:

[0022] 条件一,位置固定; [0022] Conditions a, fixed position;

[0023]条件二,t (DF)≤ K1X N 且t (TF) <K2 XN ; [0023] The second condition, t (DF) ≤ K1X N and t (TF) <K2 XN;

[0024] 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; [0024] If either of which satisfies a condition, wherein the term data item, or a non-feature item;

[0025] 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,K1U ;K2为预设的第二比值,Κ2>1。 [0025] wherein, t (DF) of a data item of the data stream number appears, t (TF) is the number of times the data item appearing dataset, N being the total number of the data set comprises a data stream, K1 is a first predetermined ratio, K1U; K2 is a second predetermined ratio, Κ2> 1.

[0026] 较佳地, [0026] Preferably,

[0027] 所述K1在1/5~1/2的范围取值,所述1(2在3~6的范围取值。 [0027] K1 in the range of from 1/5 to 1/2 values, the 1 (value 2 in the range of 3 to 6.

[0028] 较佳地, [0028] Preferably,

[0029] 所述将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征,包括: [0029] The features of the item field as a fixed, non-adjacent feature items as the combined variable domain, cluster analysis, wherein determining the format of the text protocol, comprising:

[0030] 对固定域聚类,得到固定域的取值集合; [0030] The field of fixed cluster, a fixed value set to give field;

[0031] 对可变域聚类,得到可变域的域类型和长度信息,其中,所述域类型为文本域和数字域中的一种,所述长度信息为变长和定长中的一种。 [0031] The variable domains clustered to obtain field type and length information of the variable domain, wherein the field type into a text field and the digital domain, the length information is fixed length and variable length in one kind.

[0032] 相应地,本发明提供的对文本协议的逆向解析系统包括:[0033] 预处理模块,用于抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; [0032] Accordingly, the text protocol reverse analysis system of the invention include: [0033] The pre-processing module, for a plurality of network protocol packet capture a text extracted from each application layer data packet network and the data converted to text data stream, multiple data stream obtained form a data set;

[0034] 分词模块,用于对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息; [0034] The segmentation module, the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recording position information of each data item;

[0035] 特征提取模块,用于对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; [0035] The feature extraction module for each data item, and its location information in the data stream in accordance with the data set number of occurrences statistics, determines whether the data item is a feature item;

[0036] 结构建模模块,用于将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 [0036] The modeling module structure, the same characteristics for each data stream extracted by row, will be characterized as a fixed entry field, after the adjacent non-feature items incorporated as variable domain, cluster analysis determining a text format protocol characteristics.

[0037] 较佳地, [0037] Preferably,

[0038] 所述分词模块使用的所述分隔符为分隔符集中的分隔符,包括回车换行符、空格符、制表符、逗号、冒号和分号; [0038] The segmentation module using the separator as a separator delimiter set, including carriage returns, spaces, tabs, commas, colon and semicolon;

[0039] 所述分词模块对每一文本文件,使用分隔符进行分词处理,包括: [0039] The segmentation module for each text file, using the word delimiters process, comprising:

[0040] 先根据回车换行符将该数据流的内容划分为一行或多行; [0040] The contents of the first carriage return to the data stream is divided into one or more rows;

[0041] 再使用其他分隔符对每一行进行划分,得到一个或多个数据项,并记录每一数据项位置信息,包括数据项所在行、列的值。 [0041] then use other delimiters of each line is divided, to obtain one or more data items, each data record and the location information, including data entry line, column values.

[0042] 较佳地, [0042] Preferably,

[0043] 所述特征提取模块对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项,包括: [0043] The feature extraction module for each data item, according to the statistics and information of its position in the data stream, the data set the number of occurrences, determining whether the data item is a feature item, comprising:

[0044] 判断出现该数据项是否满足以下条件之一: [0044] It is determined whether the data item occurs one of the following conditions:

[0045] 条件一,位置固定; [0045] Conditions a, fixed position;

[0046]条件二,t (DF)≥ K1X N 且t (TF) <K2 XN ; [0046] The second condition, t (DF) ≥ K1X N and t (TF) <K2 XN;

[0047] 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; [0047] If either of which satisfies a condition, wherein the term data item, or a non-feature item;

[0048] 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,K1U ;K2为预设的第二比值,Κ2>1。 [0048] wherein, t (DF) of a data item of the data stream number appears, t (TF) is the number of times the data item appearing dataset, N being the total number of the data set comprises a data stream, K1 is a first predetermined ratio, K1U; K2 is a second predetermined ratio, Κ2> 1.

[0049] 较佳地, [0049] Preferably,

[0050] 所述K1在1/5~1/2的范围取值,所述1(2在3~6的范围取值。 [0050] K1 in the range of from 1/5 to 1/2 values, the 1 (value 2 in the range of 3 to 6.

[0051] 较佳地, [0051] Preferably,

[0052] 所述结构建模模块将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征,包括: [0052] After the structure modeling module feature items as fixed fields, adjacent items incorporated as a non-feature variable domain, cluster analysis, wherein determining the format of the text protocol, comprising:

[0053] 对固定域聚类,得到固定域的取值集合; [0053] The field of fixed cluster, a fixed value set to give field;

[0054] 对可变域聚类,得到可变域的域类型和长度信息,其中,所述域类型为文本域和数字域中的一种,所述长度信息为变长和定长中的一种。 [0054] The variable domains clustered to obtain field type and length information of the variable domain, wherein the field type into a text field and the digital domain, the length information is fixed length and variable length in one kind.

[0055] 上述基于文本挖掘的协议逆向解析方法和系统,针对文本协议,从同一协议大量数据包中分析共性和关联性,得出协议格式,使得逆向解析更为精确。 [0055] The methods and systems based on reverse analysis protocol text mining, for text-based protocol, a large amount of data analysis package from the same protocol and the common association, come protocol format, so that more accurate reverse analysis.

附图说明 BRIEF DESCRIPTION

[0056] 图1是本发明实施例协议逆向解析方法的总体流程示意图;[0057] 图2是本发明实施例协议逆向解析方法的流程图; [0056] FIG. 1 is a schematic view of the protocol embodiments of the present invention the overall flow of the method of reverse analysis; [0057] FIG 2 is a flowchart of a method of the present embodiment of the protocol of the reverse analysis;

[0058] 图3是本发明实施例使用二叉树分词的示意图; [0058] FIG. 3 is a schematic view of an embodiment of the binary word present invention;

[0059]图4是本发明实施例协议逆向解析系统的模块图。 [0059] FIG. 4 is an embodiment of the present invention, the reverse embodiment the protocol analysis module system of FIG.

具体实施方式 detailed description

[0060] 为使本发明的目的、技术方案和优点更加清楚明白,下文中将结合附图对本发明的实施例进行详细说明。 [0060] To make the objectives, technical solutions, and advantages of the present invention will become apparent from, the accompanying drawings hereinafter in conjunction with embodiments of the present invention will be described in detail. 需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互任意组合。 Incidentally, in the case of no conflict, the embodiments and the features of the embodiments of the present invention may be arbitrarily combined with each other.

[0061] 实施例一 [0061] Example a

[0062] 本实施例逆向解析针对的是文本协议,例如,SIP协议是一种文本协议,每条SIP消息由以下三部分组成: [0062] The present embodiment is directed to a text reverse analysis protocols, e.g., SIP protocol is a text-based protocol, each SIP message consists of the following three parts:

[0063] 起始行:每个SIP消息由起始行开始(请求消息是请求行,响应消息是状态行)。 [0063] Starting line: each SIP message starting from the start line (line request message is a request, response message is a status line). 起始行用于传达消息类型(在请求中是方法类型,在响应中是响应代码)与协议版本。 Start line for communicating the message type (a method of the type in the request, a response code in the response) and the protocol version.

[0064] SIP头:用于传递消息属性和消息意义,其中格式为:〈名字>:〈值>。 [0064] SIP header: for messaging and message meaning attributes, wherein the format: <Name>: <value>.

[0065] 消息体:用于描述初始会话。 [0065] The message body: for describing an initial session. 例如,在多媒体会话中包括音频和视频编码类型,采样率等。 E.g., including audio and video coding type, sampling rate and other multimedia session. 消息体能够显示在请求与响应中。 It can be displayed in the message body of the request and response.

[0066] 以请求消息的起始行为例,协议中定义的格式如下: [0066] In the starting behavior of the embodiment of the request message, the format defined in the protocol as follows:

[0067] Request-Line = Method SP Request-URI SP SIP-Version CRLF [0067] Request-Line = Method SP Request-URI SP SIP-Version CRLF

[0068]其中: [0068] wherein:

[0069] “Method”表示方法类型,值可以为“ INVITE”,“ACK”, “OPTIONS”,“BYE”,“CANCEL”,“REGISTER”或extension-method ; [0069] "Method" indicates the type of method, the value may be "INVITE", "ACK", "OPTIONS", "BYE", "CANCEL", "REGISTER" or extension-method;

[0070] “ SP ”表示空格; [0070] "SP" represents a space;

[0071 ] “Request-URI ”表示请求方的通用资源标识符(URL),可以为SIP-URL或absoluteURL ; [0071] "Request-URI" denotes Universal Resource Identifier (URL) requesting party, or may be a SIP-URL absoluteURL;

[0072] “SIP-Version” 表示协议版本,如为“SIP/2.0” ; [0072] "SIP-Version" indicates the protocol version, such as "SIP / 2.0";

[0073] “CRLF” 表示回车。 [0073] "CRLF" represents a carriage return.

[0074] 下面再给出一条SIP请求消息和一条SIP响应消息的示例: [0074] The following gives a further SIP request message and one SIP response message in the examples:

[0075] 请求消息示例: [0075] Request message example:

[0076] INVITE sip:121255512120211.123.66.222SIP/2.0 (CRLF) [0076] INVITE sip: 121255512120211.123.66.222SIP / 2.0 (CRLF)

[0077] Via: SIP/2.0/UDP 211.123.66.223:5060 ;branch = a71b6d57-507c77f2 (CRLF) [0077] Via: SIP / 2.0 / UDP 211.123.66.223:5060; branch = a71b6d57-507c77f2 (CRLF)

[0078] Via: SIP/2.0/UDP 10.0.0.1:5060 !received = 202.123.211.25 ;rport =12345 (CRLF) ! [0078] Via: SIP / 2.0 / UDP 10.0.0.1:5060 received = 202.123.211.25; rport = 12345 (CRLF)

[0079] From:<sip:2125551000i211.123.66.223〉;tag = 108bcdl4(CRLF) [0079] From: <sip: 2125551000i211.123.66.223>; tag = 108bcdl4 (CRLF)

[0080] To:sip:121255512120211.123.66.222(CRLF) [0080] To: sip: 121255512120211.123.66.222 (CRLF)

[0081] Contact: sip:2125551000010.0.0.1 (CRLF) [0081] Contact: sip: 2125551000010.0.0.1 (CRLF)

[0082] Call-1D:4c88fdle-62bb-4abf-b620-a75659435b76il0.3.19.6(CRLF) [0082] Call-1D: 4c88fdle-62bb-4abf-b620-a75659435b76il0.3.19.6 (CRLF)

[0083] CSeq:703141INVITE (CRLF) [0083] CSeq: 703141INVITE (CRLF)

[0084] Content-Length:138(CRLF) [0084] Content-Length: 138 (CRLF)

[0085] Content-Type: application/sdp(CRLF)[0086] User-Agent: HearMe SoftPHONE (CRLF) [0085] Content-Type: application / sdp (CRLF) [0086] User-Agent: HearMe SoftPHONE (CRLF)

[0087] (CRLF) [0087] (CRLF)

[0088] ……(消息体略) [0088] ...... (abbreviated message body)

[0089] 响应消息示例: [0090] SIP/2.02000K(CRLF) [0089] Response message example: [0090] SIP / 2.02000K (CRLF)

[0091] Via:SIP/2.0/UDP pc33.atlanta.com ;branch = z9hG4bKhjhs8ass877 ;received=192.0.2.4 (CRLF) [0091] Via: SIP / 2.0 / UDP pc33.atlanta.com; branch = z9hG4bKhjhs8ass877; received = 192.0.2.4 (CRLF)

[0092] To:<sip:carolichicag0.com〉;tag = 93810874(CRLF) [0092] To: <sip: carolichicag0.com>; tag = 93810874 (CRLF)

[0093] From:Alice<sip:aliceiatlanta.com〉;tag = 1928301774(CRLF) [0093] From: Alice <sip: aliceiatlanta.com>; tag = 1928301774 (CRLF)

[0094] Call-1D:a84b4c76e66710(CRLF) [0094] Call-1D: a84b4c76e66710 (CRLF)

[0095] Cseq:631040PT10NS(CRLF) [0095] Cseq: 631040PT10NS (CRLF)

[0096] Contact:〈sip:carolichicag0.com〉(CRLF) [0096] Contact: <sip: carolichicag0.com> (CRLF)

[0097] Contact:maiIto:carolichicag0.com(CRLF) [0097] Contact: maiIto: carolichicag0.com (CRLF)

[0098] Allow:1NVITE, ACK, CANCEL, OPTIONS, BYE (CRLF) [0098] Allow: 1NVITE, ACK, CANCEL, OPTIONS, BYE (CRLF)

[0099] Accept: application/sdp(CRLF) [0099] Accept: application / sdp (CRLF)

[0100] Accept-Encoding:gzip(CRLF) [0100] Accept-Encoding: gzip (CRLF)

[0101] Accept-Language:en(CRLF) [0101] Accept-Language: en (CRLF)

[0102] Supported:foo(CRLF) [0102] Supported: foo (CRLF)

[0103] Content-Type: application/sdp(CRLF) [0103] Content-Type: application / sdp (CRLF)

[0104] Content-Length:274 [0104] Content-Length: 274

[0105] (CRLF) [0105] (CRLF)

[0106 ]......(消息体略) [0106] ...... (abbreviated message body)

[0107] 从上面2个示例可以看出,消息内容被分为多行,每行内容又包括固定域和/或可变域。 [0107] As can be seen from the two examples above, the contents of the message is divided into a plurality of rows, each line also includes fixed fields and / or variable domains.

[0108] 如图1所示,本实施例对文本协议的逆向解析方法包括预处理、分词、特征提取和结构建模4个处理阶段,其中,对网络数据包进行预处理得到协议数据流,对协议数据流进行分词得到数据项,对数据项进行特征提取得到特征信息,对特征信息进行结构建模块即可得到协议结构。 [0108] As shown in FIG. 1, for example, inverse analysis method includes a pre-text protocol, the present embodiment segmentation, feature extraction and processing stage structure modeling 4, wherein the network data packets to obtain pre-protocol data stream, protocol data word streams to obtain a data item, the data item obtained feature information feature extraction, feature information structure of building blocks to obtain structures protocol.

[0109] 本实施例的逆向解析方法的具体流程如图2所示,包括: [0109] DETAILED reverse flow analysis method according to the embodiment shown in Figure 2, comprising:

[0110] 步骤110,预处理:抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; [0110] Step 110, the preprocessing: a plurality of network packets to fetch a text-based protocol, the application layer data extracted from each packet network and converted to text form of data flow, a plurality of data streams composed of a data obtained set;

[0111] 在一个示例中,可以通过建立网络通信过程,用wireshark或Ethereal等抓包工具获取某种协议的大量数据包,存储为Pcap格式。 [0111] In one example, may acquire a large number of packets using a protocol or the like wireshark Ethereal capture tool by establishing a network communication process, a storage format of Pcap. pcap文件格式是bpf保存原始数据包的格式,很多软件都在使用,如tcpdump、Ethereal、wireshark等等。 pcap file format is bpf save the original packet format, a lot of software are in use, such as tcpdump, Ethereal, wireshark and so on. pcap并不是原始的网络字节流,而是对其进行重新组装形成一种新的数据格式。 pcap not original network byte stream, but it is reassembled to form a new data format.

[0112] 一个pcap 文件由Pcap Header、若干Packet Peader 和Packet Data 组成。 [0112] pcap file consists of a Pcap Header, and a plurality of Packet Data Packet Peader composition. 其中,Packet Data是一个协议消息从数据链路层开始的完整内容。 Wherein, Packet Data is the content of a complete data message from the data link layer protocol began. 预处理时要层层去除PcapHeader, Packet Header以及Packet Data中的数据帧头、IP头和TCP头等信息,提取出每个Pacaket Data中的应用层数据并转换为文本形式(即ASCII字符形式)的数据流,例如,把提取出的每个Pacaket Data的应用层数据存储为一个文本文件,每一文本文件即一个数据流,对多个网络数据包预处理得到的多个数据流组成一数据集,作为下一阶段的输入。 To remove PcapHeader pretreatment layer, and the data header Packet Header Packet Data, IP, and TCP header information class, extract application layer data and each Pacaket Data is converted to text (i.e., ASCII character form) data stream, for example, the application layer data stored in each of the extracted Pacaket data for a text file, each file is a text data stream, a plurality of data on a plurality of network packet streams of a pre-obtained data set as the next phase of the input. 文中,各个数据流用di (i = I, 2,…,η)表示,数据集用S(dl, d2,…,dn)表示。 Hereinafter, each data stream with a di (i = I, 2, ..., η) denotes data set represented by S (dl, d2, ..., dn).

[0113] 预处理时,将数据以ASCII字符的形式而非十六进制或二进制表示,分析判断更为直观;而且可以为后续的基于文本挖掘的分析过程去除文件头、数据头等噪声的影响。 [0113] preprocessing, the data in the form of ASCII characters rather than hexadecimal or binary represented, more intuitive analysis is determined; and the effect of noise can be removed for subsequent analysis process based on the file header text mining, data class .

[0114] 步骤120,分词:对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息; [0114] Step 120, word: the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recording position information of each data item;

[0115] 对于预处理阶段得到的数据流CliQ = 1,2,...,! [0115] For the data preprocessing phase stream obtained CliQ = 1,2, ...,! !),这一阶段将用一系列的分隔符对它们进行分词处理,划分为数据项。 !), This phase will be word processing them with a series of separators, is divided into data items. 在一个示例中,分隔符集可设定为{“\r\n”(回车换行符)、“\x20” (空格符)、“\t” (制表符)、“:”(冒号)、“;”(分号)、“,”(逗号)}。 In one example, the separator may be set to set { "\ r \ n" (carriage return), "\ on the x20" (space character), "\ T" (tab) ":" (colon ), ";" (semicolon) "," (comma)}.

[0116] 首先,根据\r\n将数据流按行划分完毕。 [0116] First, the \ r \ n completed by the divided line data stream.

[0117] 然后,循环读取分隔符集中的其它分隔符对每行进行划分。 [0117] Then, the read cycle other delimiter character set is divided for each row. 在一个示例中,可以建立二叉树模型来描述(不局限于此):将每行数据视为一个根节点,按照除\1"\11外的分隔符集中的第一个分隔符划分,在分隔符出现的第一个位置,划分为2部分,再对右子节点划分,依次进行。读取下一个分隔符,划分第一步生成的树中的叶子节点,每个叶子节点(如果被划分)被划分为2部分,再对右子节点划分。以上循环持续进行,直至分隔符集中的分隔符用完为止,得到二叉树叶子节点的内容即为划分得到的数据项。 In one example, may be established to describe the binomial model (not limited to): each row of data as a root node, according to other \ a "\ 11 outside a first set of separator delimiter into, the partition the first locator appears divided into two parts, and then the right child node is divided, sequentially. separator reads a next dividing leaf nodes generated in Step tree, each leaf node (if it is divided ) is divided into two parts, and then the right child node is divided above cycle is continued until the delimiter set separator is exhausted, to give a binary tree leaf node is the content data item obtained by dividing.

[0118] 由此,一个数据流被划分为多个数据项(term),表示为(IiU1, t2,…,乜),记录每个term的位置信息,包括所在的行、列(即相对于行首位置的偏移)。 [0118] Accordingly, a data stream is divided into a plurality of data items (term), is expressed as (IiU1, t2, ..., NIE), each term of recording position information, including the row where the column (i.e., with respect to the first offset position row). 对(Ii依次进行分词处理,作为下一阶段的输入。 To (Ii sequentially word processing, as input to the next stage.

[0119] 图下面给出一个示例,数据流dn如下: [0119] FIG below gives an example, a data stream dn follows:

[0120] GET/img/log0.gif HTTP/1.1 (CRLF) [0120] GET / img / log0.gif HTTP / 1.1 (CRLF)

[0121] Accept:1mage/png, image/svg+xml, image/* ;q = 0.8,*/* ;q = 0.5 (CRLF) [0121] Accept: 1mage / png, image / svg + xml, image / *; q = 0.8, * / *; q = 0.5 (CRLF)

[0122] Referer: http: //bbs.byr.cn/ (CRLF) [0122] Referer: http: //bbs.byr.cn/ (CRLF)

[0123] Accept-Language:zh_CN(CRLF) [0123] Accept-Language: zh_CN (CRLF)

[0124] User-Agent:MoziIla/5.0 (compatible ;MSIE9.0 ;ffindows NT6.1 ;Trident/5.0 ;360SE) (CRLF) [0124] User-Agent: MoziIla / 5.0 (compatible; MSIE9.0; ffindows NT6.1; Trident / 5.0; 360SE) (CRLF)

[0125] Accept-Encoding:gzip, deflate (CRLF) [0125] Accept-Encoding: gzip, deflate (CRLF)

[0126] Host: static, byr.cn (CRLF) [0126] Host: static, byr.cn (CRLF)

[0127] Connection:Keep-Alive (CRLF) [0127] Connection: Keep-Alive (CRLF)

[0128] 根据\r\n将数据流按行划分完毕后,对行的处理如下,例如,图3示出了对Accept:1mage/png, image/svg+xml, image/* ;q = 0.8, */* ;q = 0.5 这一ίΐ白勺戈ί!分过程。 [0128] The \ r \ n the data stream is completed by divided line, the processing of the line below, e.g., FIG. 3 shows a pair Accept: 1mage / png, image / svg + xml, image / *; q = 0.8 , * / *;! q = 0.5 ίΐ white spoon Ge ί this precipitation process. 图中①代表用分隔符“:”进行的第一次划分。 FIG ① represented by a delimiter ':' for the first division. 序号②、③代表用分隔符“;”进行的第二、三次划分。 No. ②, ③ represented by a delimiter ";" the second and third division performed. 序号④、⑤、⑥代表用分隔符“,”进行的第四、五、六次划分。 No. ④, ⑤, ⑥ represented by a delimiter "," in the fourth, fifth, sixth division. 序号I一7代表划分得到的7个文本片段,且它们位于该行的列数分别为I一7。 I 7 obtained by dividing a number representative of the text segments 7, the number of columns and rows are located are I-7. 其它行的划分类比上述过程。 Other analog line dividing the above-described process.

[0129]上述处理中,以ASCII字符形式表示和分隔符的设定控制了后续分析的粒度大小。 [0129] In the above-described process, in the form of ASCII characters and delimiters indicates the setting control particle size for subsequent analysis. 不以字节为基元,避免了因粒度太小造成时空开销过大;通过分隔符进行分词,避免了粒度太大而造成分析不准确。 No primitives in bytes, is avoided due to time and memory caused by the small particle size is too large; performed by a word delimiter, to avoid inaccuracies caused by particle size analysis too.

[0130] 步骤130,特征提取:对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; [0130] Step 130, the feature extraction: for each data item, and its location information in the data stream in accordance with the data set number of occurrences statistics, determines whether the data item is a feature item;

[0131] 本实施例中,判断出现该数据项是否满足以下条件之一: [0131] In this embodiment, it is determined whether the data item occurs one of the following conditions:

[0132] 条件一,位置固定; [0132] Conditions a, fixed position;

[0133]条件二,t (DF)≥ K1X N 且t (TF) <K2 XN ; [0133] The second condition, t (DF) ≥ K1X N and t (TF) <K2 XN;

[0134] 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; [0134] If either of which satisfies a condition, wherein the term data item, or a non-feature item;

[0135] 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,1〈1,经研究对比,较佳地,在1/5~1/2的范围取值,如取1/3 ;K2为预设的第二比值,Κ2>1,较佳地,在3~6的范围取值,如取为4。 [0135] wherein, t (DF) of a data item of the data stream number appears, t (TF) is the number of times the data item appearing dataset, N being the total number of the data set comprises a data stream, K1 is a first predetermined ratio, 1 <1, the comparative study, preferably, in the range of from 1/5 to 1/2 values, such as taking 1/3; K2 is a predetermined second ratio, 2 higher > 1, preferably, in the range of 3 to 6 values, is taken as 4.

[0136] 如果数据项的位置(行,列值)固定,则将其作为特征项。 [0136] If the location (row, column values) data items is fixed, it is characterized as a key.

[0137] 如果数据项的位置不固定,可以再考虑其在数据集和数据流中出现的频次。 [0137] If the location data item is not fixed, it can consider the frequency of occurrence in the data set and data stream. 如果一个数据项(值相同的为同一数据项)的t(DF)很低(如低于阈值K1XN),认为它代表数据流集的能力不强,不作为特征项。 If a data item (the same value for the same data item) of t (DF) is low (e.g. below a threshold K1XN), it represents that the ability to set the data flow is not strong, not as a feature item. 如果t(DF)大于等于阈值K1XN,再考虑该数据项的t(DF),如果t(TF)很高(如大于等于阈值K2XN),那认为它对于代表文本集没有区分度,有可能是噪声或无意义字符,因此也把它排除作为特征项。 If t (DF) is greater than equal to the threshold K1XN, consider t (DF) of the data item, if t (TF) is high (e.g., greater than or equal threshold K2XN), then that it is not the discrimination by the representative corpus of text, there may be characters or meaningless noise, and therefore it is excluded as a feature item. 如果t(TF)小于阈值1(2\1则可作为特征项。数据集S越大,这种衡量越为有效。 If t (TF) is smaller than the threshold value 1 (2 \ 1 may act as a feature item. The larger data set S, which is more effective measure.

[0138] 在确定特征项后,将各个特征项及其位置信息作为特征信息,作为下一阶段的输入。 [0138] After determining the feature item, the item and the position information of each feature as the feature information, as an input of the next stage.

[0139] 例如,抓取了558个http请求消息数据包,得到部分统计结果如下表所示: [0139] For example, crawling 558 the http request message packet, statistics partially shown in the following table:

[0140] [0140]

Figure CN104023018AD00101

[0141] 统计结果: [0141] statistical results:

[0142] 数据项“Get”的t (DF) U(TF)值符合特征项的条件,且位置固定,因此提取为特征项; [0142] Data item "Get" is t (DF) U (TF) qualifying feature item value, and a fixed position, so as to extract feature item;

[0143] 数据项“Host”、“Acc印t”和“HTTP/1.1”的t (DF) U(TF)值符合特征项的条件,提取为特征项; [0143] Data item "Host", "Acc printed t" and "HTTP / 1.1" is t (DF) U (TF) value matches the feature items, items extracted as a feature;

[0144] 数据项“POST”,虽然l〈t(DF)〈558/3,但其位置信息为固定(1,I),也提取为特征项。 [0144] Data item "POST", although l <t (DF) <558/3, but the fixed position information (1, I), is also extracted as a feature item.

[0145] 步骤140,结构建模:将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 [0145] Step 140, structural modeling: The same characteristics for each data stream extracted by row, will be characterized as a fixed entry field, the adjacent non-characteristic items as the combined variable domain, cluster analysis determining a text format protocol characteristics.

[0146] 文本协议的格式特征包括固定域的取值集合,可变域的基本格式或长度等信息。 Format wherein [0146] protocols include the value of a fixed text field set, the basic format or the length of the variable domain information.

[0147] 例如,针对某一数据流,以特征项“Referer”为例,提取数据流中包括“Referer”的所有行的信息,将其中的特征项作为固定域,将这些行中相邻的非特征项合并后作为可变域。 [0147] For example, for a certain data stream, a characteristic item "Referer" for example, includes information for all lines "Referer" extract stream, as fixed fields, wherein adjacent rows of these feature item after the non-feature items incorporated by variable domain. 假定处理后得到以下内容: Assume obtained after processing the following:

[0148] Referer http//www.renren.con./ [0148] Referer http // www.renren.con. /

[0149] Referer http//www.renren.com./ [0149] Referer http // www.renren.com. /

[0150] Referer http//bb3.byr.cn/ [0150] Referer http // bb3.byr.cn /

[0151] Referer http//bb3.kyr.cn/ [0151] Referer http // bb3.kyr.cn /

[0152] Referer http//bb3.byr.cn/ [0152] Referer http // bb3.byr.cn /

[0153] Referer http//bb3.byr.cn/ [0153] Referer http // bb3.byr.cn /

[0154] Referer http//bb3.byr.cn/ [0154] Referer http // bb3.byr.cn /

[0155] Referer http//bb3.kyr.cn/ [0155] Referer http // bb3.kyr.cn /

[0156] Referer http//bb3.kyr.cn/ [0156] Referer http // bb3.kyr.cn /

[0157] Referer http//bb3.kyr.cn/ [0157] Referer http // bb3.kyr.cn /

[0158] Referer http//bb3.byr.cn/ [0158] Referer http // bb3.byr.cn /

[0159] Referer http//123.3ogou.com/ [0159] Referer http // 123.3ogou.com /

[0160] Referer http//123.3ogou.com/ [0160] Referer http // 123.3ogou.com /

[0161] 其中,第一列和第二列是固定域,其取值集合分别为{Referer}和{http};第三列是属于可变域,针对第三列进行聚类分析:对第三列每一项记录其域类型和字符串长度,从而总结得出可变域的域类型、长度信息或者其它信息。 [0161] wherein the first and second columns are fixed fields, and its value set {Referer} and {http} respectively; variable domain belongs to the third column, the third column for cluster analysis: the first each recorded three field type and length of the string so obtained are summarized variable domain field type, length information, or other information.

[0162] 其中,域类型为文本域或数字域,数字域的字符串仅包含数字。 [0162] wherein the type field is a text field or the digital domain, the digital domain string contains only numeric. 例如:GET/img/logo, gif HTTP/1.1,GET和HTTP/1.1是特征项,是固定域,同时二者都属于文本域。 For example: GET / img / logo, gif HTTP / 1.1, GET and HTTP / 1.1 is a key feature, a domain is fixed, which both are part of the text field. HTTP/1.12000K,其中HTTP/1.1,200, OK按照特征提取规则均被作为特征项提取出来,200是属于数字域。 HTTP / 1.12000K, where HTTP / 1.1,200, OK According to a feature extraction rules are extracted from the feature item 200 belong to digital domain as. 可变域的长度信息可以为变长和定长中的一种。 The variable length information field may be a variable-length and fixed in length.

[0163] 本步骤中,将特征项和其后可变域进行关联分析,还可得到语义特征。 [0163] In this step, the feature correlation analysis and subsequent items variable domains, may also be obtained semantic features.

[0164] 通过上述预处理、分词、特征提取、结构建模这四个过程,最终完成了协议的逆向解析,得到了协议的格式。 [0164] By the above-described reverse pretreatment, segmentation, feature extraction, these four structure modeling process, finally completed the protocol analysis to obtain a protocol format. 相对于现有技术,提高了分析的广度和分析结果的准确度。 The prior art to improve the breadth and accuracy of analytical results of the analysis.

[0165] 如图4所示,本实施例对文本协议的逆向解析系统,包括: [0165] As shown in FIG 4, this embodiment of the analysis system Reverse embodiment text protocol, comprising:

[0166] 预处理模块10,用于抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; [0166] The preprocessing module 10, a plurality of network packets for gripping a text protocol, extracting data from each application layer packet network and converted to text form of data flow, a plurality of data streams to obtain a composition data set;

[0167] 分词模块20,用于对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息; [0167] segmentation module 20, the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recording position information of each data item;

[0168] 特征提取模块30,用于对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; [0168] The feature extraction module 30, for each data item, and its location information in the data stream according to the number of data set statistics of occurrence, determining whether the data item is a feature item;

[0169] 结构建模模块40,用于将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 [0169] Structure modeling module 40, the same characteristics for each data stream extracted by row, will be characterized as a fixed entry field, after the adjacent non-feature items incorporated as variable domain, clustering analysis, wherein said text format protocol.

[0170]其中,[0171] 所述分词模块使用的所述分隔符为分隔符集中的分隔符,包括回车换行符、空格符、制表符、逗号、冒号和分号; [0170] wherein [0171] the sub-module of said word separator as a separator delimiter set, including carriage returns, spaces, tabs, commas, colon and semicolon;

[0172] 所述分词模块对每一文本文件,使用分隔符进行分词处理,包括: [0172] The segmentation module for each text file, using the word delimiters process, comprising:

[0173] 先根据回车换行符将该数据流的内容划分为一行或多行; [0173] The contents of the first carriage return to the data stream is divided into one or more rows;

[0174] 再使用其他分隔符对每一行进行划分,得到一个或多个数据项,并记录每一数据项位置信息,包括数据项所在行、列的值。 [0174] then use other delimiters of each line is divided, to obtain one or more data items, each data record and the location information, including data entry line, column values.

[0175]其中, [0175] wherein,

[0176] 所述特征提取模块对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项,包括: [0176] The feature extraction module for each data item, according to the statistics and information of its position in the data stream, the data set the number of occurrences, determining whether the data item is a feature item, comprising:

[0177] 判断出现该数据项是否满足以下条件之一: [0177] It is determined whether the data item occurs one of the following conditions:

[0178] 条件一,位置固定; [0178] Conditions a, fixed position;

[0179]条件二,t (DF)≥ K1X N 且t (TF) <K2 XN ; [0179] The second condition, t (DF) ≥ K1X N and t (TF) <K2 XN;

[0180] 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; [0180] If either of which satisfies a condition, wherein the term data item, or a non-feature item;

[0181] 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,K1U ;K2为预设的第二比值,Κ2>1。 [0181] wherein, t (DF) of a data item of the data stream number appears, t (TF) is the number of times the data item appearing dataset, N being the total number of the data set comprises a data stream, K1 is a first predetermined ratio, K1U; K2 is a second predetermined ratio, Κ2> 1.

[0182]其中, [0182] wherein,

[0183] 所述K1在1/5~1/2的范围取值,所述1(2在3~6的范围取值。 [0183] K1 in the range of from 1/5 to 1/2 values, the 1 (value 2 in the range of 3 to 6.

[0184]其中, [0184] wherein,

[0185] 所述结构建模模块将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征,包括: [0185] After the structure modeling module feature items as fixed fields, adjacent items incorporated as a non-feature variable domain, cluster analysis, wherein determining the format of the text protocol, comprising:

[0186] 对固定域聚类,得到固定域的取值集合; [0186] cluster of fixed fields, to obtain a fixed value set domain;

[0187] 对可变域聚类,得到可变域的域类型和长度信息,其中,所述域类型为文本域和数字域中的一种,所述长度信息为变长和定长中的一种。 [0187] The variable domains clustered to obtain field type and length information of the variable domain, wherein the field type into a text field and the digital domain, the length information is fixed length and variable length in one kind.

[0188] 上述模块的其他具体的功能可以参见流程中的相应步骤。 [0188] Other specific functions of the above modules may refer to the respective step in the process.

[0189] 本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。 [0189] Those of ordinary skill in the art will be appreciated that the above-described method may be all or part of the steps by a program instructing relevant hardware is completed, the program may be stored in a computer-readable storage medium, such as read only memory, etc., magnetic or optical disk. 可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现,相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。 Alternatively, all or part of the steps of the above-described embodiments may be implemented using one or more integrated circuits, respectively, each module / unit in the above-described embodiments may be implemented using a form of hardware, software functional modules may also be employed forms. 本发明不限制于任何特定形式的硬件和软件的结合。 The present invention is not limited to any specific combination of hardware and software form.

[0190] 以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。 [0190] The foregoing is only preferred embodiments of the present invention, it is not intended to limit the invention to those skilled in the art, the present invention may have various changes and variations. 凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 Any modification within the spirit and principle of the present invention, made, equivalent substitutions, improvements, etc., should be included within the scope of the present invention.

Claims (10)

  1. 1.一种对文本协议的逆向解析方法,包括: 抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; 对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息; 对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; 将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 An analytical method for reverse text protocol, comprising: a plurality of network packets fetch a text-based protocol, the application layer data extracted from each packet network and converted to text form of data flow, a plurality of obtained a stream of a data set; the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recording position information of each data item; for each data item, in accordance with its position information and statistics in the data stream, the data set the number of occurrences, determining whether the data item is a feature item; the same characteristics for each data stream extracted by row, will be characterized as a fixed item domain, wherein after the non-adjacent items incorporated as variable domain, cluster analysis, wherein determining the format of the text protocol.
  2. 2.如权利要求1所述的方法,其特征在于: 所述分隔符为分隔符集中的分隔符,包括回车换行符、空格符、制表符、逗号、冒号和分号; 所述对每一文本文件,使用分隔符进行分词处理,包括: 先根据回车换行符将该数据流的内容划分为一行或多行; 再使用其他分隔符对每一行进行划分,得到一个或多个数据项,并记录每一数据项位置信息,包括数据项所在行、列的值。 2. The method according to claim 1, characterized in that: said separator is a separator delimiter set, including carriage returns, spaces, tabs, commas, colon and semicolon; on the each text file, using the word delimiters process, comprising: a first carriage return in accordance with the content of the data stream is divided into one or more rows; then use other delimiters of each line is divided, to obtain one or more data items, each data record and the location information, including data entry line, column values.
  3. 3.如权利要求1或2所述的方法,其特征在于: 所述对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项,包括: 判断出现该数据项是否满足以下条件之一: 条件一,位置固定; 条件二,t (DF)≥ K1XN 且t (TF)〈K2XN ; 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,K1U ;K2为预设的第二比值,Κ2>1。 3. The method according to claim 1, wherein: said each data item, according to their position in the data stream and information, statistics data set number of occurrences, determining whether the data item feature item, comprising: determining whether the data item occurs one of the following conditions: a condition, a fixed location; condition II, t (DF) ≥ K1XN and t (TF) <K2XN; wherein if any one of the conditions is satisfied, the feature item data item, or non-feature item; wherein, t (DF) of a data item of the data stream number appears, t (TF) is the number of times the data item appearing dataset, N for the data total number of data sets included in the stream, K1 is a first predetermined ratio, K1U; K2 is a second predetermined ratio, Κ2> 1.
  4. 4.如权利要求3所述的方法,其特征在于: 所述K1在1/5~1/2的范围取值,所述1(2在3~6的范围取值。 4. The method according to claim 3, wherein: the K1 is in the range 1/5 to 1/2 values, the 1 (value 2 in the range of 3 to 6.
  5. 5.如权利要求1或2或4所述的方法,其特征在于: 所述将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征,包括: 对固定域聚类,得到固定域的取值集合; 对可变域聚类,得到可变域的域类型和长度信息,其中,所述域类型为文本域和数字域中的一种,所述长度信息为变长和定长中的一种。 5. The method of claim 2 or claim 4, wherein: the feature items as fixed fields, as variable domain, cluster analysis terms adjacent non-feature merging, determining the characterized in text format protocol, comprising: a fixed domain cluster, a fixed value set to obtain domain; variable domain cluster, obtain domain type and length information of the variable domain, wherein the type field and the text field a digital domain, the length information is fixed length and a variable length of.
  6. 6.一种对文本协议的逆向解析系统,包括: 预处理模块,用于抓取一文本协议的多个网络数据包,从每一网络数据包中提取应用层数据并转换为文本形式的数据流,得到的多个数据流组成一数据集; 分词模块,用于对所述数据集中的每一数据流,使用分隔符进行分词处理,将该数据流的内容划分为多个数据项并记录各个数据项的位置信息;特征提取模块,用于对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项; 结构建模模块,用于将每一数据流中相同的特征项按行提取出来,将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征。 A system for reverse analysis of the text of the protocol, comprising: a preprocessing module, a plurality of network protocol packet capture of a text, extracting data from each application layer data packet network and converted to text data stream composed of a plurality of data streams to obtain a data set; word module, the data set for each data stream, using the delimiters word processing, the content of the data stream into a plurality of data items and recorded position information of each data item; feature extraction module, for each data item, according to the statistics and information of its position in the data stream, the data set the number of occurrences, determining whether the data item is a feature item; structure modeling module, for each data item in the same stream as a characteristic line is extracted, the feature items as fixed fields, after the adjacent items incorporated as a non-feature variable domain, cluster analysis, to determine the protocol text the format features.
  7. 7.如权利要求6所述的系统,其特征在于: 所述分词模块使用的所述分隔符为分隔符集中的分隔符,包括回车换行符、空格符、制表符、逗号、冒号和分号; 所述分词模块对每一文本文件,使用分隔符进行分词处理,包括: 先根据回车换行符将该数据流的内容划分为一行或多行; 再使用其他分隔符对每一行进行划分,得到一个或多个数据项,并记录每一数据项位置信息,包括数据项所在行、列的值。 7. The system according to claim 6, wherein: said portion of said word separator module used as a separator delimiter set, including carriage returns, spaces, tabs, commas, colons, and semicolon; a word for each module is a text file, using the word delimiters process, comprising: a first carriage return in accordance with the content of the data stream is divided into one or more rows; then use other delimiters of each line division, to obtain one or more data items, each data record and the location information, including data entry line, the column value.
  8. 8.如权利要求6或7所述的系统,其特征在于: 所述特征提取模块对每一数据项,根据其位置信息及在所述数据流、数据集中出现次数的统计结果,确定该数据项是否为特征项,包括: 判断出现该数据项是否满足以下条件之一: 条件一,位置固定; 条件二,t (DF)≥ K1XN 且t (TF)〈K2XN ; 如果满足其中的任一条件,则该数据项为特征项,否则非特征项; 其中,t(DF)为出现该数据项的数据流的个数,t (TF)为所述数据集中该数据项出现的次数,N为所述数据集包括的数据流的总数,K1为预设的第一比值,K1U ;K2为预设的第二比值,Κ2>1。 8. The system of claim 6 or claim 7, wherein: the feature extraction module for each data item, the number of occurrences statistics focus depending on its position and the flow of information, the data in the data, it is determined that the data whether the item is a feature item, comprising: determining whether the data item occurs one of the following conditions: a condition, a fixed location; condition II, t (DF) ≥ K1XN and t (TF) <K2XN; wherein if any one of the conditions is satisfied , wherein the item is a data item, or non-feature item; wherein, t (DF) is a data stream of the number of data items occurring, t (TF) is the number of times the data item appearing dataset, N being the total number of data streams included in the data set, K1 is a first predetermined ratio, K1U; K2 is a second predetermined ratio, Κ2> 1.
  9. 9.如权利要求8所述的系统,其特征在于: 所述K1在1/5~1/2的范围取值,所述1(2在3~6的范围取值。 9. The system according to claim 8, wherein: the K1 is in the range 1/5 to 1/2 values, the 1 (value 2 in the range of 3 to 6.
  10. 10.如权利要求6或7或9所述的系统,其特征在于: 所述结构建模模块将特征项作为固定域,将相邻的非特征项合并后作为可变域,进行聚类分析,确定所述文本协议的格式特征,包括: 对固定域聚类,得到固定域的取值集合; 对可变域聚类,得到可变域的域类型和长度信息,其中,所述域类型为文本域和数字域中的一种,所述长度信息为变长和定长中的一种。 10. The system of claim 6 or 7 or as claimed in claim 9, wherein: said modeling module structure wherein the item is a fixed domain, after the adjacent items incorporated as non-feature variable domain, cluster analysis determining the text format protocol characteristics, comprising: a fixed domain cluster, a fixed value set to obtain domain; variable domain cluster, obtain domain type and length information of the variable domain, wherein the domain type as a text field and the digital domain, the length information is fixed length and a variable length of.
CN 201410258281 2014-06-11 2014-06-11 Text protocol reverse resolution method and system CN104023018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201410258281 CN104023018A (en) 2014-06-11 2014-06-11 Text protocol reverse resolution method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201410258281 CN104023018A (en) 2014-06-11 2014-06-11 Text protocol reverse resolution method and system

Publications (1)

Publication Number Publication Date
CN104023018A true true CN104023018A (en) 2014-09-03

Family

ID=51439588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201410258281 CN104023018A (en) 2014-06-11 2014-06-11 Text protocol reverse resolution method and system

Country Status (1)

Country Link
CN (1) CN104023018A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1574000A2 (en) * 2002-07-29 2005-09-14 Qosmos Method for protocol recognition and analysis in data networks
CN102891852A (en) * 2012-10-11 2013-01-23 中国人民解放军理工大学 Message analysis-based protocol format automatic inferring method
CN103414708A (en) * 2013-08-01 2013-11-27 清华大学 Method and device for protocol automatic reverse analysis of embedded equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1574000A2 (en) * 2002-07-29 2005-09-14 Qosmos Method for protocol recognition and analysis in data networks
CN102891852A (en) * 2012-10-11 2013-01-23 中国人民解放军理工大学 Message analysis-based protocol format automatic inferring method
CN103414708A (en) * 2013-08-01 2013-11-27 清华大学 Method and device for protocol automatic reverse analysis of embedded equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钟晓欢: "《基于文本类型的应用层协议逆向解析技术的研究》", 《中国优秀硕士学位论文全文数据库信息科技集》 *

Similar Documents

Publication Publication Date Title
Davis et al. Exploring power and parameter estimation of the BiSSE method for analyzing species diversification
Shamma et al. Tweet the debates: understanding community annotation of uncollected sources
US20060212942A1 (en) Semantically-aware network intrusion signature generator
US20050114700A1 (en) Integrated circuit apparatus and method for high throughput signature based network applications
US20100290364A1 (en) Packet Compression for Network Packet Traffic Analysis
US7548848B1 (en) Method and apparatus for semantic processing engine
Lee et al. Toward scalable internet traffic measurement and analysis with hadoop
Cohen PyFlag–An advanced network forensic framework
US6779154B1 (en) Arrangement for reversibly converting extensible markup language documents to hypertext markup language documents
US20060117307A1 (en) XML parser
US20080282352A1 (en) Modification of Messages for Analyzing the Security of Communication Protocols and Channels
US20070136263A1 (en) Discovering web-based multimedia using search toolbar data
US20110066585A1 (en) Extracting information from unstructured data and mapping the information to a structured schema using the naïve bayesian probability model
CN101635655A (en) Method, device and system for page performance test
US8547974B1 (en) Generating communication protocol test cases based on network traffic
CN101282331A (en) Method for recognizing P2P network flow based on transport layer characteristics
CN1450757A (en) Method and system for monitoring network intrusion
US20060235868A1 (en) Methods and apparatus for representing markup language data
CN101193008A (en) A method and system for replaying user webpage access track
Antunes et al. Reverse engineering of protocols from network traces
CN102129528A (en) WEB page tampering identification method and system
US20130013583A1 (en) Online video tracking and identifying method and system
US8964548B1 (en) System and method for determining network application signatures using flow payloads
CN102393849A (en) Web log data preprocessing method
CN102098331A (en) Method and system for reducing WEB type application contents

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
WD01