CN108667839A

CN108667839A - A kind of protocol format estimating method excavated based on closed sequential pattern

Info

Publication number: CN108667839A
Application number: CN201810450347.4A
Authority: CN
Inventors: 吴礼发; 张洪泽; 丁兆锟; 谢波; 廖赟
Original assignee: Nanjing Sky Control Information Technology Co Ltd
Current assignee: Nanjing Sky Control Information Technology Co Ltd
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-10-16

Abstract

The present invention provides a kind of protocol format estimating method excavated based on closed sequential pattern, includes the following steps：Message pretreatment, protocol keyword identification and keyword sequences extraction and message structure are inferred.The present invention implements closed sequential pattern excavation using two benches closed mode Mining Strategy to communication message, solves the problems such as huge existing Closed sequential pattern mining algorithm memory consumption, calculating overlong time, method can avoid the noise effect in message sample, accurately identify protocol keyword and generate the keyword sequences for including key sequence relationship.On the basis of keyword sequences information, the ordinal relation, parallel relation, hierarchical relationship being inferred between protocol keyword are analyzed, accurate message structure information is obtained.

Description

A kind of protocol format estimating method excavated based on closed sequential pattern

Technical field

The present invention relates to network technique fields, in particular to a kind of input and output report of analysis protocol entity program Text infers the method for protocol massages format according to similitude of the same type of message on structurally and semantically.

Background technology

Key element of the procotol as network communication, its quality are directly related to the stability of communication, reliability And safety.Procotol is analyzed, procotol is excavated and its implements security breaches present in program, and and When protection with high safety, contribute to reduce safety problem generation.

If a kind of protocol specification of procotol is known, it is relatively easy to analyze it processing, such as writes The open source software Wireshark of name can parse more than 2000 kind known protocols, can obtain a large amount of valuable in agreement Information.But there is no disclosed procotol, the protocal analysis software such as Wireshark then helpless protocol specification. In this case, some researchers attempt to obtain the protocol specification of unknown protocol using agreement conversed analysis technology.

Agreement conversed analysis technology is retrieved as target with protocol format and protocol state machine.Protocol format acquisition mainly pushes away Disconnected protocol keyword, message structure and field semantics information.It is typically in protocol format Information base that protocol state machine, which obtains, On, it identifies protocol status present in entire agreement operational process and analyzes the transformational relation between protocol status.

According to the difference of research object, agreement conversed analysis is generally divided into conversed analysis technology and base based on perform track In two class of conversed analysis technology of network flow.Conversed analysis method based on perform track is by monitoring protocol entity to message Processing procedure and each message segment occupation mode obtain message format information.Conversed analysis based on network flow is based on Such a observation：Each protocol massages are the specific examples of protocol specification, and same type protocol massages have similitude, this Kind similitude can reflect metastable part in message format, may infer that protocol massages format based on this similitude. Compared with based on perform track reversal technique, the collection and analysis sample based on network flow agreement reversal technique relatively easily with from Dynamicization degree higher, so the application range of this technology is more extensive.Provided by the present invention is exactly that one kind being based on network flow The method that amount carries out protocol format deduction.

Have some achievements in research for carrying out protocol format deduction to unknown protocol based on network flow both at home and abroad at present. AutoReEngine uses Apriori algorithm Mining Frequent character string, wherein change in location frequency to be less than the frequent character of threshold value String is considered that the protocol keyword of procotol, this method have higher keyword recognition accuracy rate.ProWord systems introduce certainly The hyphenation and phrase chunking technology of right Language Processing realize protocol keyword identification.But the above method can only obtain in message individually Keyword has ignored the restriction relation between keyword.

Some researchers divide complete message and extract message characteristic, in identidication key while attempting to obtain Combination restriction relation between keyword.PI projects are attempted to assist target by the sequence alignment algorithms of introducing bioinformatics View is analyzed, and the reverse tool Netzob of agreement also uses sequence alignment method, but this method is for change in location is big or mistake The accuracy rate of the message of more variable-length fields, inferred results is relatively low.PRISMA methods by n-gram method identification protocol keywords, The correlation between obtaining keyword is calculated by Pearson correlation coefficient, and infers the semantic template of agreement.Although these sides Method can obtain the restriction relation between protocol keyword to a certain extent, but also only obtain flat keyword sequences, Sequence between message keyword, side by side and hierarchical relationship is not analyzed.

On the whole, the existing method message keyword for carrying out protocol format deduction to unknown protocol based on network flow The accuracy rate of deduction is not high, and do not fully consider sequence between message keyword, side by side with the architectural characteristics such as hierarchical relationship, In addition, based on the analysis method of network flow there is also noise treatments the problem of, on the one hand, the message sample of unknown protocol is often Other protocol massages can be mixed into, on the other hand, often there are some sequential mixed due to communication link quality problem, in message sample Disorderly, the message of integrality missing, these messages belong to noise for protocal analysis.Agreement noise is to the reverse result of agreement It can interfere, reduce precision of analysis.

Invention content

For problems of the prior art, the present invention is intended to provide a kind of protocol format based on network flow is inferred Method.Method is on the basis of collecting unknown protocol entity program communication message, using two benches closed mode Mining Strategy pair Communication message implements closed sequential pattern excavation, and identification protocol keyword simultaneously generates the key for including ordinal relation between keyword Word sequence.According to the restriction relation between the information analysis keyword of keyword sequences, and then infer message structure, including The detailed protocols format information of key sequence relationship, parallel relation and hierarchical relationship.Protocol keyword is identified with support Based on threshold value, key word frequency of occurrence, which will be more than set support threshold, to be identified, and this recognition methods helps to drop The influence of agreement noise, ensures the accuracy rate of keyword recognition in low message sample.

To reach above-mentioned purpose, the technical solution adopted in the present invention is as follows：

A kind of protocol format estimating method excavated based on closed sequential pattern, is included the following steps：

It is message pretreatment stage first.By message sample according to<Source IP address, purpose IP address, source port, destination Mouthful, transport layer protocol type>Five-tuple is divided into several sessions, is then directed to session and extracts protocol massages, and then will be of the same race The message of type gathers for one kind.

Followed by protocol keyword identification and keyword sequences extract the stage.Implement two benches closed sequential pattern excavation side Method, wherein first stage are the segmentation stage, for the message group for having converged same kind message, based on closure sequence in message group The extraction of row mode excavation is closed frequent adjacent segments, and use keyword recognition strategy judge these frequent adjacent segments of closure whether as Protocol keyword.Second stage is the mode excavation stage, and the frequent adjacent segments of closure of keyword are inferred as based on the first stage, is led to It crosses Closed sequential pattern mining algorithm and generates the closure Frequent episodes being made of multiple keywords, finally obtain in a message group All closure Frequent episodes.

Third step is message structure deduction phase.Message structure deduction is distinguished not based on the keyword sequences extracted With sequence, arranged side by side and hierarchical relationship between keyword, detailed message structure information is obtained.

The workflow of aforementioned message pretreatment stage is as follows：By message sample according to<Source IP address, purpose IP address, Source port, destination interface, transport layer protocol type>Five-tuple is divided into session, and each message is with respectively successively suitable in a session Sequence marking serial numbers, the message with same sequence number is as a kind of.Due to be likely to occur in network transmission message packet loss, out of order etc. because Element, test serial number is identical not to ensure that message is same type, therefore also needs extraction message load, is carried by calculating message Similarity between lotus finds out the identical message of type, and same type of message is gathered within a message group.By with Upper processing, the type of message in a message group is identical, message characteristic having the same, can be as the base that next stage is analyzed Plinth.

Aforementioned protocols keyword recognition and the workflow in keyword sequences extraction stage are as follows：In order to judge agreement key Word, primary work are to search to be closed frequent adjacent segments in the identical message group of type of message.Given message group C_LMost ramuscule Degree of holding threshold value Min_Sup, if the character string t that the message in message group is included is in message group C_LIn support be more than Min_ Sup, and other any character strings for including t, in message group C_LIn support be no more than the support of t, then t is referred to as to close The numerous adjacent segments of sum of fundamental frequencies.Wherein, t is in message group C_LIn support, refer to message group C_LIn include the message total of character string t Divided by message group C_LIn message total.T is in message group C_LIn support reflect t in message group C_LIn the frequency of occurrences.By Have the characteristics that sequence is long, data are dense, data scale is big in message data, the present invention devises two benches closed sequential pattern Method for digging can improve efficiency, reduce memory consumption.Wherein the first stage is the segmentation stage, for having converged same kind The message group of message searches for the frequent adjacent segments of all closures in message group, and judges these using keyword recognition strategy It is closed whether frequent adjacent segments are keyword.The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings from Extracted in message group, message be divided into character string using n-gram models, character in character string keep original sequence and Adjacent attribute.The work of segmentation stage second step is candidate character strings beta pruning.Message in message group is discretized as character string, For determining the protocol keyword in message, these character strings are by as candidate characters segment.Candidate characters set of segments may It can include many repetitions or not meet the character fragments of keyword basic demand.In order to reduce redundant search space, foundation is needed Frequently adjacent feature carries out beta pruning to candidate characters segment.When implementing beta pruning, if the support of a candidate characters segment Degree does not reach minimum support threshold value, then the support of all character strings comprising the character fragments is also impossible to be more than minimum Support threshold, it includes the character sheet that can be deleted in the string assemble of the n-gram different lengths generated accordingly all The character string of section no longer calculates their degree of being supported.The work of segmentation stage third step is closed detection.Judge for Given character fragments, if there are other character strings both included given character string and support is not less than given character string Support.If there is such character string, then given character fragments are not closed, it is otherwise only closure.Due to association View keyword necessarily belongs to be closed frequent adjacent segments, therefore is unsatisfactory for being closed desired frequent character fragments, is all not belonging to agreement Keyword.The second stage of two benches closed sequential pattern method for digging is the mode excavation stage, is inferred as based on the first stage It is frequent to generate the closure being made of multiple keywords by Closed sequential pattern mining algorithm for the frequent adjacent segments of closure of keyword Sequence finally obtains the set that Frequent episodes are closed in a message group.It is closed Frequent episodes and refers to the closure in message group The ordered set that frequent adjacent segments are constituted, is indicated with θ.Each frequent putting in order for adjacent segments of closure is closed with these in set θ The appearance sequence of the numerous adjacent segments of sum of fundamental frequencies in messages is identical, and θ is in message group C_LIn support need more than setting most ramuscule Degree of holding threshold value Min_Sup, and other any ordered sets for including θ, in message group C_LIn support be no more than the branch of θ Degree of holding.So-called θ is in message group C_LIn support, refer to message group C_LIn include the message total of ordered set θ, Chu Yibao Text group C_LThe total number of middle message.θ is in message group C_LIn support reflect what the frequent adjacent segments of closure representated by θ were constituted Ordered set is in message group C_LIn the frequency of occurrences.The advantages of two benches closed sequential pattern method for digging, is mainly reflected in use When Closed sequential pattern mining algorithm carries out Frequent episodes excavation, second stage is reduced using the segment processing of first stage Data processing scale, expanding element of the Frequent episodes in judgement each time is a key field rather than single character, Computational efficiency is improved, fairly large message sample analysis is suitable for.In addition, the closure Frequent episodes that second stage obtains, It can reflect keyword metastable sequence relation in messages, can be that base is established in the deduction of next step message structure Plinth.

The workflow of aforementioned message structure deduction phase is as follows：By the processing of second stage, message class can be obtained The identical message group C of type_LIn all closure Frequent episodes, these are closed, and Frequent episodes are practical to reflect C_LCorresponding to message group Type of message in the sequence relation that occurs in messages of keyword.When implementing message structure deduction, keyword is first determined whether Between ordinal relation, i.e. which keyword must be present in before some keywords.For example, in http protocol, keyword There is stringent ordinal relation, GET are necessarily occurred between GET and keyword HTTP/1.0 before HTTP/1.0.Judgement sequence The method of relationship is to message group C_LIn closure Frequent episodes take the processing of intersection, if some keywords are always another Occur before some outer keywords, then there is stringent ordinal relation between them.Followed by judge arranged side by side between keyword Relationship.If some keywords can be appeared in simultaneously on a message, but one will not be appeared in simultaneously and is closed frequent sequence On row, then these keywords belong to coordination.For example, in http protocol, keyword Host:With User-Agent:Simultaneously It appears on same message, but will not appear in simultaneously in any closure Frequent episodes, then it is assumed that Host:With User- Agent:Belong to coordination.It is finally the hierarchical relationship between analysis keyword.In protocol massages, format identification field The value of (Format Distinguisher, FD) and subsequent message format tight association, FD fields is different, subsequent message Format is also different.The analysis of hierarchical relationship wishes to judge which keyword belongs to FD fields in message between keyword, and analyzes The information of keyword sequences associated by FD fields.By the characteristics of format identification field it is found that in message format identification field with Associated keyword sequences usually occur in pairs, therefore can be scanned for based on confidence level.Given message group C_LIn one Message L_iAnd L_iIn misaligned two subsequence t_iWith t_j, in C_LMiddle t_iTo t_jConfidence level be expressed as Conf_CL(t_i→ t_j)。Conf_CL(t_i→t_j) value be equal to C_LIn simultaneously include t_iWith t_jThe message total divided by C of two subsequences_LIn include t_iSon The message total of sequence.In message group C_LMiddle t_iTo t_jConfidence level is higher, indicates message group C_LIn work as t_iWhen appearance, t_jWhat is occurred can Energy property is bigger.When carrying out format identification field judgement, using the association analysis algorithm in sequential mode mining field in message Group is scanned for using two following conditions：Conf_CL(k_i→k_m,…,k_n)≥Min_Conf；Conf_CL(k_m,…,k_n→k_i)≥Min_ Conf, obtained k_iIt is considered as FD fields, k_m,…,k_nIt is considered as and k_iCorresponding keyword sequences.By the processing in this stage, It can obtain the structural relation between keyword.

By technical scheme of the present invention it is found that the beneficial effects of the present invention are carry out close sequence mould by network flow Formula is excavated to accurately identify keyword, and the noise effect in message sample can be avoided.Sentenced using the keyword in the present invention Determine method, it is desirable that be closed most ramuscule of the frequency of occurrences of the frequent adjacent segments in the identical message group of type of message not less than setting Degree of holding threshold value, and noise is often the appearance of low probability in messages, the noise in message, which does not interfere with, is closed frequent adjacent segments Excavation.In addition, two stage message closed sequential pattern method for digging can solve have Closed sequential pattern mining algorithm Memory consumption is huge, calculates the problems such as overlong time, improves working efficiency, reduces memory consumption, be suitable for large scale network The analysis of flow.Furthermore the present invention on the basis of being closed Frequent episodes, be inferred to ordinal relation between protocol keyword, Parallel relation, hierarchical relationship obtain accurate protocol format information, improve the application value of the reverse result of agreement.

Description of the drawings

Fig. 1 is the whole implementation process schematic diagram of the present invention.

Fig. 2 is the example of message group in the present invention.

Fig. 3 is the frequent adjacent segments of closure corresponding to the message group in the present invention in Fig. 2.

Fig. 4 is the closure Frequent episodes corresponding to the message group in the present invention in Fig. 2.

Specific implementation mode

In order to be better understood by the technology contents of the present invention, spy lifts specific embodiment and coordinates description of the drawings as follows.

As shown in Figure 1, preferred embodiment according to the present invention, the protocol format deduction side excavated based on closed sequential pattern Method includes the following steps：

(1) message pre-processes：Network traffic is pre-processed, obtains the session information in network communication first, And then protocol massages are extracted from session, and same type of message is gathered in a message group.

(2) protocol keyword identification and keyword sequences extraction：It would generally be defined in the protocol specification of procotol Keyword, these keywords are frequently occurred with fixed mode character string in communication data, and the present invention is based on frequent mode diggings The thought of pick extracts these fixed mode character strings, identification protocol keyword.Protocol keyword identification is based primarily upon close sequence Mode excavation extracts fixed mode character string, then belongs to the fixed mode character of keyword using the identification of keyword recognition strategy String, and then excavated by two benches closed sequential pattern and obtain the sequence information of keyword in messages.

(3) message structure is inferred：Message structure infer based on the keyword sequences that the keyword recognition stage obtains, according to Existing ordinal relation, coordination and hierarchical relationship are to obtain detailed message structure between secondary analysis protocol keyword Information.

With reference to whole implementation process shown in FIG. 1, the protocol format estimating method of the present embodiment is mainly located including message in advance Reason, protocol keyword identification and keyword sequences extraction, message structure such as infer at 3 parts, below specific embodiment point It does not mentionlet alone bright.

(1) message pre-processes

Agreement conversed analysis technology based on network flow needs to be analyzed according to the similitude between communication message, because This, pre-processes when carrying out protocol format deduction firstly the need of the network communication message to capture, by the identical report of format Text flocks together, and then analyzes their common feature, infers protocol format.When implementing message pretreatment, for communication Message, first foundation<Source IP address, purpose IP address, source port, destination interface, transport layer protocol type>Five-tuple is by network Communication is divided into several sessions.In a session, serial number label is carried out according to the appearance sequence of message in a session, there will be phase Message with serial number is as a kind of.In conversation procedure, the identical message of appearance sequence is often similar message.But due to network Be likely to occur in transmission message packet loss, it is out of order situations such as, cause the identical message of serial number not necessarily same type of.At this It in the case of kind, needs further to extract message load, it is true to calculate the similitude between message load by string matching algorithm Which fixed type of message is identical, and same type of message is converged in a message group.By handling above, a message Message message characteristic having the same in group, belongs to the identical message of type, the input that will be handled as next stage.

(2) protocol keyword identification and keyword sequences extraction

From the point of view of communication protocol, protocol entity program needs the data sequence of transmission turning to byte stream in a network Transmission.Specific meanings often are expressed using some fixed mode character strings in byte stream, some is used for identification message type, Such as version number (" HTTP/1.0 " of such as http protocol), protocol name, some is used for transmitting related control information, such as orders Code (" GET " of such as http protocol) is enabled, these fixed mode character strings with specific meanings are referred to as protocol keyword.Association Some keywords would generally be defined in view specification, these keywords are frequently occurred with fixed mode character string in communication data, These fixed mode character strings, identification protocol keyword can be extracted based on Frequent Pattern Mining thought.

By the processing of message pretreatment stage, the identical message of format is placed in the same message group, and Fig. 2 is one The example of a message group.Protocol keyword is judged based on the frequent adjacent segments of the closure in message group.So-called message The frequent adjacent segments of closure in group refer to given message group C_LWith minimum support threshold value Min_Sup, if in message group Message included character string t in message group C_LIn support be more than Min_Sup, and other any characters for including t String, in message group C_LIn support be both less than the support of t, then t is referred to as to be closed frequent adjacent segments.Wherein, t is in message group C_L In support, refer to message group C_LIn include the message total divided by message group C of character string t_LIn message total.T exists Message group C_LIn support reflect t in message group C_LIn the frequency of occurrences.By taking Fig. 2 as an example, character string GET in Fig. 15 Occur in message, support is 1；But support is respectively less than after adding the character of arbitrary adjoining again to character string GET 1, if it is to be closed frequent adjacent segments that Min_Sup, which is set as 0.9, GET,.When Min_Sup is arranged to 0.9, wrapped in Fig. 2 The frequent adjacent segments of closure included are as shown in Figure 3.

Since message data has the characteristics that sequence is long, data are dense, data scale is big, the present invention devises two benches and closes Sequential mode mining method is closed, efficiency can be improved, reduces memory consumption.Wherein the first stage is segmentation stage, groundwork It is to judge protocol keyword according to the frequent adjacent segments of closure in message group.Second stage is mode excavation stage, groundwork It is by metastable sequence relation between Closed sequential pattern mining algorithm acquisition protocol keyword.

The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings are extracted from message group, utilize n- Message is divided into character string by gram models, and the character in character string keeps original sequence and adjacent attribute.One is reported Literary group, cutting is carried out to message with regular length.First using 1 as cutting length, the length of cutting is then increased by 1, according to this Analogize.For example, message segment GET/index.html obtains the set for the character string that length is 1 by the cutting of the first round, N1=G, E, T ,/..., h, t, m, l }, the 2nd wheel cutting, obtain length be 2 character string set, n2=GE, ET, T ..., tm, ml }, until n15={ GET/index.html }.

The work of segmentation stage second step is candidate character strings beta pruning.Message in message group is discretized as character string, For determining the protocol keyword in message.These character strings are by as candidate characters segment.Candidate characters set of segments may It can include many repetitions or not meet the character fragments of keyword basic demand.In order to reduce redundant search space, foundation is needed Frequently adjacent feature carries out beta pruning to candidate characters segment.Here beta pruning, refer to if a candidate characters segment not Meet the requirement for being closed frequent adjacent segments, there is no need to go to judge whether the character string comprising the candidate characters segment is to be closed frequency Numerous adjacent segments.

The frequency of a candidate characters segment is detected, mainly judges whether the support of the candidate characters segment is full The requirement of sufficient minimum support threshold value.If the message total comprising candidate characters segment divided by the report in message group in message group Literary sum, has as a result been more than minimum support threshold value Min_Sup, then frequency corresponding with its length is added in the candidate characters segment Numerous character fragments set.Frequency judges since length is 1 candidate characters segment, successively to increase.It accordingly will successively The frequent character fragments set that frequent character fragments set, length that length is 1 are 2 is obtained, until length is the frequent character sheets of k Duan Jihe.If the support of a candidate characters segment does not reach minimum support threshold value, all includes the character sheet The support of the character string of section is also impossible to be more than minimum support threshold value, the character string that can be generated accordingly in n-gram The character string for including all character fragments is deleted in set, and no longer their degree of being supported are calculated.Candidate character strings beta pruning Meaning be that when candidate characters segment is not frequent, all character strings comprising the character fragments need not all carry out frequently Property detection, reduce computing cost, improve treatment effeciency.

The work of segmentation stage third step is closed detection.For a character string t, if its branch in message group Degree of holding is more than Min_Sup, and any character string for including t, the support in message group are both less than the support of t, then claim t To be closed frequent adjacent segments.The work of segmentation stage third step is closed detection.Judge for given character fragments, if Both included the support that given character string and support are not less than given character string there are other character strings.If there is this Otherwise the character string of sample is only closure then given character fragments are not closed.For the frequent character that length is k Segment only need to carry out searching in the frequent character fragments set that length is k+1 can be obtained judgement result.Due to agreement key Word necessarily belongs to be closed frequent adjacent segments, therefore is unsatisfactory for being closed desired frequent character fragments, is all not belonging to protocol keyword.

By the processing of above step, the frequent adjacent segments of closure in message group can be obtained.It is pushed away to improve keyword Disconnected accuracy rate, avoids occurring some into more frequent general character string in messages being determined as protocol keyword, the present invention Two keyword recognition strategies are used.

Firstly, since the short character strings frequency of occurrences is necessarily higher than the frequency that long character string occurs, for example, http protocol message The support of middle character string HOS is higher than keyword HOST, because among character string HOS is in addition to appearing in keyword HOST, may be used also Can occur due to the accidental combination of character.Therefore, the first item keyword recognition strategy that the present invention uses is：One belongs to and closes The character string of the numerous adjacent segments of sum of fundamental frequencies, if there is other frequent adjacent segments of closure include the character string, then this character string is not association Keyword is discussed, can be filtered out.

Secondly, in messages, apart from the fixed frequent character string of start of message (SOM) position offset length or apart from message knot The fixed frequent adjacent segments of closure of beam position deflected length are often protocol keyword.For example, sudden peal of thunder agreement application layer load Preceding 4 bytes 0x39,0x00,0x00,0x00 presentation protocol controls information, belongs to the keyword of sudden peal of thunder agreement.The present invention uses Section 2 keyword recognition strategy be exactly：If being closed frequent adjacent segments in messages with respect to start of message (SOM) position or message Stop bits is equipped with fixed deflected length, then is regarded as protocol keyword.

The second stage of two benches closed sequential pattern method for digging is the mode excavation stage, and groundwork is to be based on first Stage is inferred as the frequent adjacent segments of closure of keyword, is generated by Closed sequential pattern mining algorithm and is made of multiple keywords Closure Frequent episodes, finally obtain the set for being closed Frequent episodes in a message group, obtain opposite between protocol keyword Stable sequence relation.Frequent episodes are closed, the ordered set that the frequent adjacent segments of closure in message group are constituted are referred to, with θ It indicates.In set θ it is each be closed frequent adjacent segments put in order with these be closed frequent adjacent segments occur in messages it is suitable Sequence is identical, and θ is in message group C_LIn support need the minimum support threshold value Min_Sup more than setting, and other are any The ordered set for including θ, in message group C_LIn support be no more than the support of θ.So-called θ is in message group C_LIn support Degree, refers to message group C_LIn include the message total divided by message group C of ordered set θ_LThe total number of middle message.θ is in message Group C_LIn support reflect ordered set that the frequent adjacent segments of closure representated by θ are constituted in message group C_LIn appearance frequency Rate.

By taking Fig. 2 as an example, if setting Min_Sup=0.9, the sequence being made of keyword<<GET>,<HTTP/1.0>, < Host:>>Occur in 5 messages in Fig. 2 message groups, support is 1, and if the sequence add arbitrary pass again Support after key word is respectively less than 1, so being to be closed Frequent episodes.Fig. 4 is when Min_Sup is set as 0.9, in Fig. 2 Closure Frequent episodes corresponding to message group.

The mode excavation stage, message was described as the segmented version of " key field+variable field " when handling message, Part between two of which key field is as variable field.What it is due to processing is that basis is divided using keyword as field Segmented message, compared to using character as processing unit, data scale to be treated becomes smaller, and existing closure may be used Sequential Pattern Mining Algorithm.It is BIDE algorithms that Closed sequential pattern mining algorithm is used in embodiment, and algorithm passes through algorithm Processing can obtain all closure Frequent episodes being made of keyword in message group.

On the whole, the advantages of two benches closed sequential pattern method for digging is mainly reflected in is dug using closed sequential pattern When digging algorithm progress Frequent episodes excavation, protocol keyword is analyzed in the first stage, and message is carried out based on protocol keyword Message, is described as the segmented version of " key field+variable field " by segmentation.The processing method of segmentation reduces second stage The treatment scale of data, expanding element of the Frequent episodes in judgement each time is a key field rather than single word Symbol, improves computational efficiency, is suitable for fairly large message sample analysis.In addition, second stage infers the closure frequency obtained Numerous sequence can intuitively reflect keyword metastable sequence relation in messages, can be next step message structure Deduction lays the foundation.

(3) message structure is inferred

By the processing of preceding step, the identical message group C of type of message can be obtained_LIn all frequent sequence of closure Row, these are closed Frequent episodes and reflect C_LThe sequence relation of keyword in messages in message group.Message structure is inferred to carry Sequence between different keywords, side by side and hierarchical relationship is distinguished based on the keyword sequences got, and obtains detailed message Structural information.

When implementing message structure deduction, the ordinal relation being inferred that first between keyword, i.e., which keyword must Before must appearing in some keywords.Infer that the method for ordinal relation is to message group C_LIn closure Frequent episodes carry out taking friendship The processing of collection has stringent sequence if some keywords occur before other keyword always between them Relationship.For example, in closure Frequent episodes shown in Fig. 4, keyword GET is always positioned at before keyword HTTP/1.0, GET There is stringent ordinal relation between HTTP/1.0.

Followed by judge the coordination between keyword.If both keyword can appear in a message simultaneously On, but one will not be appeared in simultaneously and be closed on Frequent episodes, then the two keywords belong to coordination.For example, In http protocol, keyword Host:With User-Agent:It appears on a message simultaneously, but they will not occur simultaneously In any closure Frequent episodes, it is possible to determine that Host:With User-Agent:Belong to coordination.

Final step is the hierarchical relationship between judging keyword.In protocol massages, format identification field (Format Distinguisher, FD) with subsequent message format tight association, the values of FD fields is different, and subsequent message format is not yet Equally.For example, when FD fields (operation code) are keyword " 0x58 " in eMule agreements, follow-up associated message format is " text The sequence that the keywords such as part ID ", " file status " are constituted, and when FD fields are 0x59, follow-up associated message format is then It is the sequence that the keywords such as " file ID ", " Name Length " are constituted.It is desired to determine report by the analysis of hierarchical relationship between keyword Which keyword belongs to FD fields in text, and analyzes the information of the keyword sequences associated by FD fields.

The judgement of format identification field is mainly using the association analysis algorithm in sequential mode mining, using confidence level as standard Judged.Given message group C_LIn a message L_iAnd L_iIn misaligned two subsequence t_iWith t_j, in C_LMiddle t_i To t_jConfidence level be expressed as Conf_CL(t_i→t_j)。Conf_CL(t_i→t_j) value be equal to C_LIn simultaneously include t_iWith t_jTwo sub- sequences The message total divided by C of row_LIn include t_iThe message total of subsequence.In message group C_LMiddle t_iTo t_jConfidence level is higher, indicates report Text group C_LIn work as t_iWhen appearance, t_jThe possibility of appearance is bigger.When carrying out format identification field judgement, Apriori is used Rule in algorithm generates algorithm, this is common a kind of association analysis algorithm in sequential mode mining.In message group using such as Lower two conditions scan for：(1)Conf_CL(k_i→k_m,…,k_n)≥Min_Conf；(2) Conf_CL(k_m,…,k_n→k_i)≥Min_ Conf.Obtained k_iIt is considered as FD fields, k_m,…,k_nIt is considered as and k_iCorresponding keyword sequences.

By above processing step, sequence between protocol keyword, side by side and hierarchical relationship can be obtained, is grasped Detailed message structure information.

In conclusion the present invention based on closed sequential pattern excavate protocol format estimating method, collect with it is unknown On the basis of protocol-dependent network communication message, communication message is implemented using two-stage closed mode Mining Strategy to be closed sequence Row mode excavation, infer communication message in protocol keyword and generate reflection keyword sequences relationship keyword sequences, Sequence, arranged side by side and hierarchical relationship on the basis of this between extraction keyword, obtain message structure information.The present invention passes through network Flow carries out closed sequential pattern and excavates to accurately identify keyword, can avoid the noise effect in message sample.Using Keyword determination method in the present invention, it is desirable that be closed the frequency of occurrences of the frequent adjacent segments in the identical message group of type of message Not less than the minimum support threshold value of setting, and noise is often the appearance of low probability in messages, and the noise in message will not Influence the excavation for being closed frequent adjacent segments.In addition, two stage message closed sequential pattern method for digging can solve existing close The problems such as closing huge Sequential Pattern Mining Algorithm memory consumption, calculating overlong time, improves working efficiency, reduces memory consumption, It is suitable for the analysis to large-scale network traffic.Furthermore the present invention infers protocol keyword on the basis of being closed Frequent episodes Between ordinal relation, parallel relation, hierarchical relationship, obtain accurate protocol format information, improve the reverse result of agreement Application value.

Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause This, the scope of protection of the present invention is defined by those of the claims.

Claims

1. a kind of protocol format estimating method excavated based on closed sequential pattern, which is characterized in that include the following steps：

(1) message pre-processes：By message sample according to<Source IP address, purpose IP address, source port, destination interface, transport layer association Discuss type>Five-tuple is divided into several sessions, is then directed to session and extracts protocol massages, and then same type of message is gathered It is that one kind is convenient for analysis.

(2) protocol keyword identification and keyword sequences extraction：Implement two benches closed sequential pattern method for digging, wherein first Stage is the segmentation stage, for the message group for having converged same kind message, is excavated based on closed sequential pattern in message group Extraction is closed frequent adjacent segments, and keyword recognition strategy is used to judge that whether these are closed frequent adjacent segments for agreement key Word.Second stage is the mode excavation stage, and the frequent adjacent segments of closure of keyword are inferred as based on the first stage, by being closed sequence Row pattern mining algorithm generates the closure Frequent episodes being made of multiple keywords, finally obtains all in a message group close Close Frequent episodes.

(3) message structure is inferred：Distinguished based on the keyword sequences extracted sequence between different keywords, side by side with And hierarchical relationship, obtain detailed message structure information.

The workflow of aforementioned message pretreatment stage is as follows：By message sample according to<Source IP address, purpose IP address, source Mouthful, destination interface, transport layer protocol type>Five-tuple is divided into session, and each message is with sequencing respectively in a session Marking serial numbers, the message with same sequence number is as a kind of.Due to be likely to occur in network transmission message packet loss, out of order etc. because Element, test serial number is identical not to ensure that message is same type, therefore also needs extraction message load, is carried by calculating message Similarity between lotus finds out the identical message of type, and same type of message is gathered in a message group.By above It handles, the type of message in a message group is identical, message characteristic having the same, can be as the base that next stage is analyzed Plinth.

Aforementioned protocols keyword recognition and the workflow in keyword sequences extraction stage are as follows：Implement two benches close sequence mould Formula method for digging, wherein first stage are the segmentation stage, for the message group for having converged same kind message, the base in message group Extraction is excavated in closed sequential pattern and is closed frequent adjacent segments, and judges that these closures are frequently adjacent using keyword recognition strategy Whether section is keyword.The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings are extracted from message group, Message is divided into character string using n-gram models, the character in character string keeps original sequence and adjacent attribute.Segmentation The work of stage second step is candidate character strings beta pruning.Message in message group is discretized as character string, for determining message In protocol keyword, these character strings are by as candidate characters segment.Candidate characters set of segments may include many heavy Character fragments that are multiple or not meeting keyword basic demand.In order to reduce redundant search space, need according to frequently adjacent spy Sign carries out beta pruning to candidate characters segment.If the support of one candidate characters segment does not reach minimum when implementing beta pruning Support threshold, then the support of all character strings comprising the character fragments be also impossible to be more than minimum support threshold value, will All character strings for including the character fragments are deleted in the string assemble of the n-gram different lengths generated accordingly, no longer Their degree of being supported are calculated.The work of segmentation stage third step is closed detection.Judge for given character fragments, With the presence or absence of the character string comprising the character fragments, there is the support not less than the character fragments.Since protocol keyword must So belong to and be closed frequent adjacent segments, therefore be unsatisfactory for being closed desired frequent character fragments, is all not belonging to protocol keyword.Two ranks The second stage of section closed sequential pattern method for digging is the mode excavation stage, and the closure of keyword is inferred as based on the first stage Frequent adjacent segments are handled the segmented message for being divided basis using keyword as field, are given birth to by Closed sequential pattern mining algorithm At the closure Frequent episodes being made of multiple keywords, all closure Frequent episodes in a message group are finally obtained, are obtained Metastable sequence relation between protocol keyword.

The workflow of aforementioned message structure deduction phase is as follows：When implementing message structure deduction, first determine whether keyword it Between ordinal relation, i.e. which keyword must be present in before other keyword.Judge that the method for ordinal relation is pair Closure Frequent episodes in message group take the processing of intersection, if some keywords are always before other keyword Occur, then there is stringent ordinal relation between them.Followed by judge the coordination between keyword.If some are crucial Word can be appeared in simultaneously on a message, but will not be appeared in one simultaneously and be closed on Frequent episodes, then by these keys Word is determined as coordination.It is finally the hierarchical relationship between analysis keyword.In protocol massages, format identification field The value of (Format Distinguisher, FD) and subsequent message format tight association, FD fields is different, subsequent message Format is also different.The analysis of hierarchical relationship wishes to judge which keyword belongs to FD fields in message between keyword, and analyzes The information of keyword sequences associated by FD fields.The characteristics of according to format identification field, format identification field and institute in message Associated keyword sequences usually occur in pairs, therefore using the association analysis algorithm in sequential mode mining field in message group It is scanned for based on confidence level, determines FD fields and the keyword sequences associated by it.