CN108667839A - A kind of protocol format estimating method excavated based on closed sequential pattern - Google Patents
A kind of protocol format estimating method excavated based on closed sequential pattern Download PDFInfo
- Publication number
- CN108667839A CN108667839A CN201810450347.4A CN201810450347A CN108667839A CN 108667839 A CN108667839 A CN 108667839A CN 201810450347 A CN201810450347 A CN 201810450347A CN 108667839 A CN108667839 A CN 108667839A
- Authority
- CN
- China
- Prior art keywords
- message
- keyword
- closed
- protocol
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000005065 mining Methods 0.000 claims abstract description 22
- 238000009412 basement excavation Methods 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims description 25
- 239000012634 fragment Substances 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 17
- 238000013138 pruning Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 5
- 238000012098 association analyses Methods 0.000 claims description 4
- 240000002853 Nelumbo nucifera Species 0.000 claims description 2
- 235000006508 Nelumbo nucifera Nutrition 0.000 claims description 2
- 235000006510 Nelumbo pentapetala Nutrition 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000004891 communication Methods 0.000 abstract description 16
- 230000000694 effects Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 9
- 239000003795 chemical substances by application Substances 0.000 description 4
- 238000011282 treatment Methods 0.000 description 3
- 229910010888 LiIn Inorganic materials 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000002864 sequence alignment Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 244000144992 flock Species 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/06—Notations for structuring of protocol data, e.g. abstract syntax notation one [ASN.1]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/50—Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of protocol format estimating method excavated based on closed sequential pattern, includes the following steps:Message pretreatment, protocol keyword identification and keyword sequences extraction and message structure are inferred.The present invention implements closed sequential pattern excavation using two benches closed mode Mining Strategy to communication message, solves the problems such as huge existing Closed sequential pattern mining algorithm memory consumption, calculating overlong time, method can avoid the noise effect in message sample, accurately identify protocol keyword and generate the keyword sequences for including key sequence relationship.On the basis of keyword sequences information, the ordinal relation, parallel relation, hierarchical relationship being inferred between protocol keyword are analyzed, accurate message structure information is obtained.
Description
Technical field
The present invention relates to network technique fields, in particular to a kind of input and output report of analysis protocol entity program
Text infers the method for protocol massages format according to similitude of the same type of message on structurally and semantically.
Background technology
Key element of the procotol as network communication, its quality are directly related to the stability of communication, reliability
And safety.Procotol is analyzed, procotol is excavated and its implements security breaches present in program, and and
When protection with high safety, contribute to reduce safety problem generation.
If a kind of protocol specification of procotol is known, it is relatively easy to analyze it processing, such as writes
The open source software Wireshark of name can parse more than 2000 kind known protocols, can obtain a large amount of valuable in agreement
Information.But there is no disclosed procotol, the protocal analysis software such as Wireshark then helpless protocol specification.
In this case, some researchers attempt to obtain the protocol specification of unknown protocol using agreement conversed analysis technology.
Agreement conversed analysis technology is retrieved as target with protocol format and protocol state machine.Protocol format acquisition mainly pushes away
Disconnected protocol keyword, message structure and field semantics information.It is typically in protocol format Information base that protocol state machine, which obtains,
On, it identifies protocol status present in entire agreement operational process and analyzes the transformational relation between protocol status.
According to the difference of research object, agreement conversed analysis is generally divided into conversed analysis technology and base based on perform track
In two class of conversed analysis technology of network flow.Conversed analysis method based on perform track is by monitoring protocol entity to message
Processing procedure and each message segment occupation mode obtain message format information.Conversed analysis based on network flow is based on
Such a observation:Each protocol massages are the specific examples of protocol specification, and same type protocol massages have similitude, this
Kind similitude can reflect metastable part in message format, may infer that protocol massages format based on this similitude.
Compared with based on perform track reversal technique, the collection and analysis sample based on network flow agreement reversal technique relatively easily with from
Dynamicization degree higher, so the application range of this technology is more extensive.Provided by the present invention is exactly that one kind being based on network flow
The method that amount carries out protocol format deduction.
Have some achievements in research for carrying out protocol format deduction to unknown protocol based on network flow both at home and abroad at present.
AutoReEngine uses Apriori algorithm Mining Frequent character string, wherein change in location frequency to be less than the frequent character of threshold value
String is considered that the protocol keyword of procotol, this method have higher keyword recognition accuracy rate.ProWord systems introduce certainly
The hyphenation and phrase chunking technology of right Language Processing realize protocol keyword identification.But the above method can only obtain in message individually
Keyword has ignored the restriction relation between keyword.
Some researchers divide complete message and extract message characteristic, in identidication key while attempting to obtain
Combination restriction relation between keyword.PI projects are attempted to assist target by the sequence alignment algorithms of introducing bioinformatics
View is analyzed, and the reverse tool Netzob of agreement also uses sequence alignment method, but this method is for change in location is big or mistake
The accuracy rate of the message of more variable-length fields, inferred results is relatively low.PRISMA methods by n-gram method identification protocol keywords,
The correlation between obtaining keyword is calculated by Pearson correlation coefficient, and infers the semantic template of agreement.Although these sides
Method can obtain the restriction relation between protocol keyword to a certain extent, but also only obtain flat keyword sequences,
Sequence between message keyword, side by side and hierarchical relationship is not analyzed.
On the whole, the existing method message keyword for carrying out protocol format deduction to unknown protocol based on network flow
The accuracy rate of deduction is not high, and do not fully consider sequence between message keyword, side by side with the architectural characteristics such as hierarchical relationship,
In addition, based on the analysis method of network flow there is also noise treatments the problem of, on the one hand, the message sample of unknown protocol is often
Other protocol massages can be mixed into, on the other hand, often there are some sequential mixed due to communication link quality problem, in message sample
Disorderly, the message of integrality missing, these messages belong to noise for protocal analysis.Agreement noise is to the reverse result of agreement
It can interfere, reduce precision of analysis.
Invention content
For problems of the prior art, the present invention is intended to provide a kind of protocol format based on network flow is inferred
Method.Method is on the basis of collecting unknown protocol entity program communication message, using two benches closed mode Mining Strategy pair
Communication message implements closed sequential pattern excavation, and identification protocol keyword simultaneously generates the key for including ordinal relation between keyword
Word sequence.According to the restriction relation between the information analysis keyword of keyword sequences, and then infer message structure, including
The detailed protocols format information of key sequence relationship, parallel relation and hierarchical relationship.Protocol keyword is identified with support
Based on threshold value, key word frequency of occurrence, which will be more than set support threshold, to be identified, and this recognition methods helps to drop
The influence of agreement noise, ensures the accuracy rate of keyword recognition in low message sample.
To reach above-mentioned purpose, the technical solution adopted in the present invention is as follows:
A kind of protocol format estimating method excavated based on closed sequential pattern, is included the following steps:
It is message pretreatment stage first.By message sample according to<Source IP address, purpose IP address, source port, destination
Mouthful, transport layer protocol type>Five-tuple is divided into several sessions, is then directed to session and extracts protocol massages, and then will be of the same race
The message of type gathers for one kind.
Followed by protocol keyword identification and keyword sequences extract the stage.Implement two benches closed sequential pattern excavation side
Method, wherein first stage are the segmentation stage, for the message group for having converged same kind message, based on closure sequence in message group
The extraction of row mode excavation is closed frequent adjacent segments, and use keyword recognition strategy judge these frequent adjacent segments of closure whether as
Protocol keyword.Second stage is the mode excavation stage, and the frequent adjacent segments of closure of keyword are inferred as based on the first stage, is led to
It crosses Closed sequential pattern mining algorithm and generates the closure Frequent episodes being made of multiple keywords, finally obtain in a message group
All closure Frequent episodes.
Third step is message structure deduction phase.Message structure deduction is distinguished not based on the keyword sequences extracted
With sequence, arranged side by side and hierarchical relationship between keyword, detailed message structure information is obtained.
The workflow of aforementioned message pretreatment stage is as follows:By message sample according to<Source IP address, purpose IP address,
Source port, destination interface, transport layer protocol type>Five-tuple is divided into session, and each message is with respectively successively suitable in a session
Sequence marking serial numbers, the message with same sequence number is as a kind of.Due to be likely to occur in network transmission message packet loss, out of order etc. because
Element, test serial number is identical not to ensure that message is same type, therefore also needs extraction message load, is carried by calculating message
Similarity between lotus finds out the identical message of type, and same type of message is gathered within a message group.By with
Upper processing, the type of message in a message group is identical, message characteristic having the same, can be as the base that next stage is analyzed
Plinth.
Aforementioned protocols keyword recognition and the workflow in keyword sequences extraction stage are as follows:In order to judge agreement key
Word, primary work are to search to be closed frequent adjacent segments in the identical message group of type of message.Given message group CLMost ramuscule
Degree of holding threshold value Min_Sup, if the character string t that the message in message group is included is in message group CLIn support be more than Min_
Sup, and other any character strings for including t, in message group CLIn support be no more than the support of t, then t is referred to as to close
The numerous adjacent segments of sum of fundamental frequencies.Wherein, t is in message group CLIn support, refer to message group CLIn include the message total of character string t
Divided by message group CLIn message total.T is in message group CLIn support reflect t in message group CLIn the frequency of occurrences.By
Have the characteristics that sequence is long, data are dense, data scale is big in message data, the present invention devises two benches closed sequential pattern
Method for digging can improve efficiency, reduce memory consumption.Wherein the first stage is the segmentation stage, for having converged same kind
The message group of message searches for the frequent adjacent segments of all closures in message group, and judges these using keyword recognition strategy
It is closed whether frequent adjacent segments are keyword.The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings from
Extracted in message group, message be divided into character string using n-gram models, character in character string keep original sequence and
Adjacent attribute.The work of segmentation stage second step is candidate character strings beta pruning.Message in message group is discretized as character string,
For determining the protocol keyword in message, these character strings are by as candidate characters segment.Candidate characters set of segments may
It can include many repetitions or not meet the character fragments of keyword basic demand.In order to reduce redundant search space, foundation is needed
Frequently adjacent feature carries out beta pruning to candidate characters segment.When implementing beta pruning, if the support of a candidate characters segment
Degree does not reach minimum support threshold value, then the support of all character strings comprising the character fragments is also impossible to be more than minimum
Support threshold, it includes the character sheet that can be deleted in the string assemble of the n-gram different lengths generated accordingly all
The character string of section no longer calculates their degree of being supported.The work of segmentation stage third step is closed detection.Judge for
Given character fragments, if there are other character strings both included given character string and support is not less than given character string
Support.If there is such character string, then given character fragments are not closed, it is otherwise only closure.Due to association
View keyword necessarily belongs to be closed frequent adjacent segments, therefore is unsatisfactory for being closed desired frequent character fragments, is all not belonging to agreement
Keyword.The second stage of two benches closed sequential pattern method for digging is the mode excavation stage, is inferred as based on the first stage
It is frequent to generate the closure being made of multiple keywords by Closed sequential pattern mining algorithm for the frequent adjacent segments of closure of keyword
Sequence finally obtains the set that Frequent episodes are closed in a message group.It is closed Frequent episodes and refers to the closure in message group
The ordered set that frequent adjacent segments are constituted, is indicated with θ.Each frequent putting in order for adjacent segments of closure is closed with these in set θ
The appearance sequence of the numerous adjacent segments of sum of fundamental frequencies in messages is identical, and θ is in message group CLIn support need more than setting most ramuscule
Degree of holding threshold value Min_Sup, and other any ordered sets for including θ, in message group CLIn support be no more than the branch of θ
Degree of holding.So-called θ is in message group CLIn support, refer to message group CLIn include the message total of ordered set θ, Chu Yibao
Text group CLThe total number of middle message.θ is in message group CLIn support reflect what the frequent adjacent segments of closure representated by θ were constituted
Ordered set is in message group CLIn the frequency of occurrences.The advantages of two benches closed sequential pattern method for digging, is mainly reflected in use
When Closed sequential pattern mining algorithm carries out Frequent episodes excavation, second stage is reduced using the segment processing of first stage
Data processing scale, expanding element of the Frequent episodes in judgement each time is a key field rather than single character,
Computational efficiency is improved, fairly large message sample analysis is suitable for.In addition, the closure Frequent episodes that second stage obtains,
It can reflect keyword metastable sequence relation in messages, can be that base is established in the deduction of next step message structure
Plinth.
The workflow of aforementioned message structure deduction phase is as follows:By the processing of second stage, message class can be obtained
The identical message group C of typeLIn all closure Frequent episodes, these are closed, and Frequent episodes are practical to reflect CLCorresponding to message group
Type of message in the sequence relation that occurs in messages of keyword.When implementing message structure deduction, keyword is first determined whether
Between ordinal relation, i.e. which keyword must be present in before some keywords.For example, in http protocol, keyword
There is stringent ordinal relation, GET are necessarily occurred between GET and keyword HTTP/1.0 before HTTP/1.0.Judgement sequence
The method of relationship is to message group CLIn closure Frequent episodes take the processing of intersection, if some keywords are always another
Occur before some outer keywords, then there is stringent ordinal relation between them.Followed by judge arranged side by side between keyword
Relationship.If some keywords can be appeared in simultaneously on a message, but one will not be appeared in simultaneously and is closed frequent sequence
On row, then these keywords belong to coordination.For example, in http protocol, keyword Host:With User-Agent:Simultaneously
It appears on same message, but will not appear in simultaneously in any closure Frequent episodes, then it is assumed that Host:With User-
Agent:Belong to coordination.It is finally the hierarchical relationship between analysis keyword.In protocol massages, format identification field
The value of (Format Distinguisher, FD) and subsequent message format tight association, FD fields is different, subsequent message
Format is also different.The analysis of hierarchical relationship wishes to judge which keyword belongs to FD fields in message between keyword, and analyzes
The information of keyword sequences associated by FD fields.By the characteristics of format identification field it is found that in message format identification field with
Associated keyword sequences usually occur in pairs, therefore can be scanned for based on confidence level.Given message group CLIn one
Message LiAnd LiIn misaligned two subsequence tiWith tj, in CLMiddle tiTo tjConfidence level be expressed as ConfCL(ti→
tj)。ConfCL(ti→tj) value be equal to CLIn simultaneously include tiWith tjThe message total divided by C of two subsequencesLIn include tiSon
The message total of sequence.In message group CLMiddle tiTo tjConfidence level is higher, indicates message group CLIn work as tiWhen appearance, tjWhat is occurred can
Energy property is bigger.When carrying out format identification field judgement, using the association analysis algorithm in sequential mode mining field in message
Group is scanned for using two following conditions:ConfCL(ki→km,…,kn)≥Min_Conf;ConfCL(km,…,kn→ki)≥Min_
Conf, obtained kiIt is considered as FD fields, km,…,knIt is considered as and kiCorresponding keyword sequences.By the processing in this stage,
It can obtain the structural relation between keyword.
By technical scheme of the present invention it is found that the beneficial effects of the present invention are carry out close sequence mould by network flow
Formula is excavated to accurately identify keyword, and the noise effect in message sample can be avoided.Sentenced using the keyword in the present invention
Determine method, it is desirable that be closed most ramuscule of the frequency of occurrences of the frequent adjacent segments in the identical message group of type of message not less than setting
Degree of holding threshold value, and noise is often the appearance of low probability in messages, the noise in message, which does not interfere with, is closed frequent adjacent segments
Excavation.In addition, two stage message closed sequential pattern method for digging can solve have Closed sequential pattern mining algorithm
Memory consumption is huge, calculates the problems such as overlong time, improves working efficiency, reduces memory consumption, be suitable for large scale network
The analysis of flow.Furthermore the present invention on the basis of being closed Frequent episodes, be inferred to ordinal relation between protocol keyword,
Parallel relation, hierarchical relationship obtain accurate protocol format information, improve the application value of the reverse result of agreement.
Description of the drawings
Fig. 1 is the whole implementation process schematic diagram of the present invention.
Fig. 2 is the example of message group in the present invention.
Fig. 3 is the frequent adjacent segments of closure corresponding to the message group in the present invention in Fig. 2.
Fig. 4 is the closure Frequent episodes corresponding to the message group in the present invention in Fig. 2.
Specific implementation mode
In order to be better understood by the technology contents of the present invention, spy lifts specific embodiment and coordinates description of the drawings as follows.
As shown in Figure 1, preferred embodiment according to the present invention, the protocol format deduction side excavated based on closed sequential pattern
Method includes the following steps:
(1) message pre-processes:Network traffic is pre-processed, obtains the session information in network communication first,
And then protocol massages are extracted from session, and same type of message is gathered in a message group.
(2) protocol keyword identification and keyword sequences extraction:It would generally be defined in the protocol specification of procotol
Keyword, these keywords are frequently occurred with fixed mode character string in communication data, and the present invention is based on frequent mode diggings
The thought of pick extracts these fixed mode character strings, identification protocol keyword.Protocol keyword identification is based primarily upon close sequence
Mode excavation extracts fixed mode character string, then belongs to the fixed mode character of keyword using the identification of keyword recognition strategy
String, and then excavated by two benches closed sequential pattern and obtain the sequence information of keyword in messages.
(3) message structure is inferred:Message structure infer based on the keyword sequences that the keyword recognition stage obtains, according to
Existing ordinal relation, coordination and hierarchical relationship are to obtain detailed message structure between secondary analysis protocol keyword
Information.
With reference to whole implementation process shown in FIG. 1, the protocol format estimating method of the present embodiment is mainly located including message in advance
Reason, protocol keyword identification and keyword sequences extraction, message structure such as infer at 3 parts, below specific embodiment point
It does not mentionlet alone bright.
(1) message pre-processes
Agreement conversed analysis technology based on network flow needs to be analyzed according to the similitude between communication message, because
This, pre-processes when carrying out protocol format deduction firstly the need of the network communication message to capture, by the identical report of format
Text flocks together, and then analyzes their common feature, infers protocol format.When implementing message pretreatment, for communication
Message, first foundation<Source IP address, purpose IP address, source port, destination interface, transport layer protocol type>Five-tuple is by network
Communication is divided into several sessions.In a session, serial number label is carried out according to the appearance sequence of message in a session, there will be phase
Message with serial number is as a kind of.In conversation procedure, the identical message of appearance sequence is often similar message.But due to network
Be likely to occur in transmission message packet loss, it is out of order situations such as, cause the identical message of serial number not necessarily same type of.At this
It in the case of kind, needs further to extract message load, it is true to calculate the similitude between message load by string matching algorithm
Which fixed type of message is identical, and same type of message is converged in a message group.By handling above, a message
Message message characteristic having the same in group, belongs to the identical message of type, the input that will be handled as next stage.
(2) protocol keyword identification and keyword sequences extraction
From the point of view of communication protocol, protocol entity program needs the data sequence of transmission turning to byte stream in a network
Transmission.Specific meanings often are expressed using some fixed mode character strings in byte stream, some is used for identification message type,
Such as version number (" HTTP/1.0 " of such as http protocol), protocol name, some is used for transmitting related control information, such as orders
Code (" GET " of such as http protocol) is enabled, these fixed mode character strings with specific meanings are referred to as protocol keyword.Association
Some keywords would generally be defined in view specification, these keywords are frequently occurred with fixed mode character string in communication data,
These fixed mode character strings, identification protocol keyword can be extracted based on Frequent Pattern Mining thought.
By the processing of message pretreatment stage, the identical message of format is placed in the same message group, and Fig. 2 is one
The example of a message group.Protocol keyword is judged based on the frequent adjacent segments of the closure in message group.So-called message
The frequent adjacent segments of closure in group refer to given message group CLWith minimum support threshold value Min_Sup, if in message group
Message included character string t in message group CLIn support be more than Min_Sup, and other any characters for including t
String, in message group CLIn support be both less than the support of t, then t is referred to as to be closed frequent adjacent segments.Wherein, t is in message group CL
In support, refer to message group CLIn include the message total divided by message group C of character string tLIn message total.T exists
Message group CLIn support reflect t in message group CLIn the frequency of occurrences.By taking Fig. 2 as an example, character string GET in Fig. 15
Occur in message, support is 1;But support is respectively less than after adding the character of arbitrary adjoining again to character string GET
1, if it is to be closed frequent adjacent segments that Min_Sup, which is set as 0.9, GET,.When Min_Sup is arranged to 0.9, wrapped in Fig. 2
The frequent adjacent segments of closure included are as shown in Figure 3.
Since message data has the characteristics that sequence is long, data are dense, data scale is big, the present invention devises two benches and closes
Sequential mode mining method is closed, efficiency can be improved, reduces memory consumption.Wherein the first stage is segmentation stage, groundwork
It is to judge protocol keyword according to the frequent adjacent segments of closure in message group.Second stage is mode excavation stage, groundwork
It is by metastable sequence relation between Closed sequential pattern mining algorithm acquisition protocol keyword.
The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings are extracted from message group, utilize n-
Message is divided into character string by gram models, and the character in character string keeps original sequence and adjacent attribute.One is reported
Literary group, cutting is carried out to message with regular length.First using 1 as cutting length, the length of cutting is then increased by 1, according to this
Analogize.For example, message segment GET/index.html obtains the set for the character string that length is 1 by the cutting of the first round,
N1=G, E, T ,/..., h, t, m, l }, the 2nd wheel cutting, obtain length be 2 character string set, n2=GE, ET,
T ..., tm, ml }, until n15={ GET/index.html }.
The work of segmentation stage second step is candidate character strings beta pruning.Message in message group is discretized as character string,
For determining the protocol keyword in message.These character strings are by as candidate characters segment.Candidate characters set of segments may
It can include many repetitions or not meet the character fragments of keyword basic demand.In order to reduce redundant search space, foundation is needed
Frequently adjacent feature carries out beta pruning to candidate characters segment.Here beta pruning, refer to if a candidate characters segment not
Meet the requirement for being closed frequent adjacent segments, there is no need to go to judge whether the character string comprising the candidate characters segment is to be closed frequency
Numerous adjacent segments.
The frequency of a candidate characters segment is detected, mainly judges whether the support of the candidate characters segment is full
The requirement of sufficient minimum support threshold value.If the message total comprising candidate characters segment divided by the report in message group in message group
Literary sum, has as a result been more than minimum support threshold value Min_Sup, then frequency corresponding with its length is added in the candidate characters segment
Numerous character fragments set.Frequency judges since length is 1 candidate characters segment, successively to increase.It accordingly will successively
The frequent character fragments set that frequent character fragments set, length that length is 1 are 2 is obtained, until length is the frequent character sheets of k
Duan Jihe.If the support of a candidate characters segment does not reach minimum support threshold value, all includes the character sheet
The support of the character string of section is also impossible to be more than minimum support threshold value, the character string that can be generated accordingly in n-gram
The character string for including all character fragments is deleted in set, and no longer their degree of being supported are calculated.Candidate character strings beta pruning
Meaning be that when candidate characters segment is not frequent, all character strings comprising the character fragments need not all carry out frequently
Property detection, reduce computing cost, improve treatment effeciency.
The work of segmentation stage third step is closed detection.For a character string t, if its branch in message group
Degree of holding is more than Min_Sup, and any character string for including t, the support in message group are both less than the support of t, then claim t
To be closed frequent adjacent segments.The work of segmentation stage third step is closed detection.Judge for given character fragments, if
Both included the support that given character string and support are not less than given character string there are other character strings.If there is this
Otherwise the character string of sample is only closure then given character fragments are not closed.For the frequent character that length is k
Segment only need to carry out searching in the frequent character fragments set that length is k+1 can be obtained judgement result.Due to agreement key
Word necessarily belongs to be closed frequent adjacent segments, therefore is unsatisfactory for being closed desired frequent character fragments, is all not belonging to protocol keyword.
By the processing of above step, the frequent adjacent segments of closure in message group can be obtained.It is pushed away to improve keyword
Disconnected accuracy rate, avoids occurring some into more frequent general character string in messages being determined as protocol keyword, the present invention
Two keyword recognition strategies are used.
Firstly, since the short character strings frequency of occurrences is necessarily higher than the frequency that long character string occurs, for example, http protocol message
The support of middle character string HOS is higher than keyword HOST, because among character string HOS is in addition to appearing in keyword HOST, may be used also
Can occur due to the accidental combination of character.Therefore, the first item keyword recognition strategy that the present invention uses is:One belongs to and closes
The character string of the numerous adjacent segments of sum of fundamental frequencies, if there is other frequent adjacent segments of closure include the character string, then this character string is not association
Keyword is discussed, can be filtered out.
Secondly, in messages, apart from the fixed frequent character string of start of message (SOM) position offset length or apart from message knot
The fixed frequent adjacent segments of closure of beam position deflected length are often protocol keyword.For example, sudden peal of thunder agreement application layer load
Preceding 4 bytes 0x39,0x00,0x00,0x00 presentation protocol controls information, belongs to the keyword of sudden peal of thunder agreement.The present invention uses
Section 2 keyword recognition strategy be exactly:If being closed frequent adjacent segments in messages with respect to start of message (SOM) position or message
Stop bits is equipped with fixed deflected length, then is regarded as protocol keyword.
The second stage of two benches closed sequential pattern method for digging is the mode excavation stage, and groundwork is to be based on first
Stage is inferred as the frequent adjacent segments of closure of keyword, is generated by Closed sequential pattern mining algorithm and is made of multiple keywords
Closure Frequent episodes, finally obtain the set for being closed Frequent episodes in a message group, obtain opposite between protocol keyword
Stable sequence relation.Frequent episodes are closed, the ordered set that the frequent adjacent segments of closure in message group are constituted are referred to, with θ
It indicates.In set θ it is each be closed frequent adjacent segments put in order with these be closed frequent adjacent segments occur in messages it is suitable
Sequence is identical, and θ is in message group CLIn support need the minimum support threshold value Min_Sup more than setting, and other are any
The ordered set for including θ, in message group CLIn support be no more than the support of θ.So-called θ is in message group CLIn support
Degree, refers to message group CLIn include the message total divided by message group C of ordered set θLThe total number of middle message.θ is in message
Group CLIn support reflect ordered set that the frequent adjacent segments of closure representated by θ are constituted in message group CLIn appearance frequency
Rate.
By taking Fig. 2 as an example, if setting Min_Sup=0.9, the sequence being made of keyword<<GET>,<HTTP/1.0>, <
Host:>>Occur in 5 messages in Fig. 2 message groups, support is 1, and if the sequence add arbitrary pass again
Support after key word is respectively less than 1, so being to be closed Frequent episodes.Fig. 4 is when Min_Sup is set as 0.9, in Fig. 2
Closure Frequent episodes corresponding to message group.
The mode excavation stage, message was described as the segmented version of " key field+variable field " when handling message,
Part between two of which key field is as variable field.What it is due to processing is that basis is divided using keyword as field
Segmented message, compared to using character as processing unit, data scale to be treated becomes smaller, and existing closure may be used
Sequential Pattern Mining Algorithm.It is BIDE algorithms that Closed sequential pattern mining algorithm is used in embodiment, and algorithm passes through algorithm
Processing can obtain all closure Frequent episodes being made of keyword in message group.
On the whole, the advantages of two benches closed sequential pattern method for digging is mainly reflected in is dug using closed sequential pattern
When digging algorithm progress Frequent episodes excavation, protocol keyword is analyzed in the first stage, and message is carried out based on protocol keyword
Message, is described as the segmented version of " key field+variable field " by segmentation.The processing method of segmentation reduces second stage
The treatment scale of data, expanding element of the Frequent episodes in judgement each time is a key field rather than single word
Symbol, improves computational efficiency, is suitable for fairly large message sample analysis.In addition, second stage infers the closure frequency obtained
Numerous sequence can intuitively reflect keyword metastable sequence relation in messages, can be next step message structure
Deduction lays the foundation.
(3) message structure is inferred
By the processing of preceding step, the identical message group C of type of message can be obtainedLIn all frequent sequence of closure
Row, these are closed Frequent episodes and reflect CLThe sequence relation of keyword in messages in message group.Message structure is inferred to carry
Sequence between different keywords, side by side and hierarchical relationship is distinguished based on the keyword sequences got, and obtains detailed message
Structural information.
When implementing message structure deduction, the ordinal relation being inferred that first between keyword, i.e., which keyword must
Before must appearing in some keywords.Infer that the method for ordinal relation is to message group CLIn closure Frequent episodes carry out taking friendship
The processing of collection has stringent sequence if some keywords occur before other keyword always between them
Relationship.For example, in closure Frequent episodes shown in Fig. 4, keyword GET is always positioned at before keyword HTTP/1.0, GET
There is stringent ordinal relation between HTTP/1.0.
Followed by judge the coordination between keyword.If both keyword can appear in a message simultaneously
On, but one will not be appeared in simultaneously and be closed on Frequent episodes, then the two keywords belong to coordination.For example,
In http protocol, keyword Host:With User-Agent:It appears on a message simultaneously, but they will not occur simultaneously
In any closure Frequent episodes, it is possible to determine that Host:With User-Agent:Belong to coordination.
Final step is the hierarchical relationship between judging keyword.In protocol massages, format identification field (Format
Distinguisher, FD) with subsequent message format tight association, the values of FD fields is different, and subsequent message format is not yet
Equally.For example, when FD fields (operation code) are keyword " 0x58 " in eMule agreements, follow-up associated message format is " text
The sequence that the keywords such as part ID ", " file status " are constituted, and when FD fields are 0x59, follow-up associated message format is then
It is the sequence that the keywords such as " file ID ", " Name Length " are constituted.It is desired to determine report by the analysis of hierarchical relationship between keyword
Which keyword belongs to FD fields in text, and analyzes the information of the keyword sequences associated by FD fields.
The judgement of format identification field is mainly using the association analysis algorithm in sequential mode mining, using confidence level as standard
Judged.Given message group CLIn a message LiAnd LiIn misaligned two subsequence tiWith tj, in CLMiddle ti
To tjConfidence level be expressed as ConfCL(ti→tj)。ConfCL(ti→tj) value be equal to CLIn simultaneously include tiWith tjTwo sub- sequences
The message total divided by C of rowLIn include tiThe message total of subsequence.In message group CLMiddle tiTo tjConfidence level is higher, indicates report
Text group CLIn work as tiWhen appearance, tjThe possibility of appearance is bigger.When carrying out format identification field judgement, Apriori is used
Rule in algorithm generates algorithm, this is common a kind of association analysis algorithm in sequential mode mining.In message group using such as
Lower two conditions scan for:(1)ConfCL(ki→km,…,kn)≥Min_Conf;(2) ConfCL(km,…,kn→ki)≥Min_
Conf.Obtained kiIt is considered as FD fields, km,…,knIt is considered as and kiCorresponding keyword sequences.
By above processing step, sequence between protocol keyword, side by side and hierarchical relationship can be obtained, is grasped
Detailed message structure information.
In conclusion the present invention based on closed sequential pattern excavate protocol format estimating method, collect with it is unknown
On the basis of protocol-dependent network communication message, communication message is implemented using two-stage closed mode Mining Strategy to be closed sequence
Row mode excavation, infer communication message in protocol keyword and generate reflection keyword sequences relationship keyword sequences,
Sequence, arranged side by side and hierarchical relationship on the basis of this between extraction keyword, obtain message structure information.The present invention passes through network
Flow carries out closed sequential pattern and excavates to accurately identify keyword, can avoid the noise effect in message sample.Using
Keyword determination method in the present invention, it is desirable that be closed the frequency of occurrences of the frequent adjacent segments in the identical message group of type of message
Not less than the minimum support threshold value of setting, and noise is often the appearance of low probability in messages, and the noise in message will not
Influence the excavation for being closed frequent adjacent segments.In addition, two stage message closed sequential pattern method for digging can solve existing close
The problems such as closing huge Sequential Pattern Mining Algorithm memory consumption, calculating overlong time, improves working efficiency, reduces memory consumption,
It is suitable for the analysis to large-scale network traffic.Furthermore the present invention infers protocol keyword on the basis of being closed Frequent episodes
Between ordinal relation, parallel relation, hierarchical relationship, obtain accurate protocol format information, improve the reverse result of agreement
Application value.
Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention
Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause
This, the scope of protection of the present invention is defined by those of the claims.
Claims (1)
1. a kind of protocol format estimating method excavated based on closed sequential pattern, which is characterized in that include the following steps:
(1) message pre-processes:By message sample according to<Source IP address, purpose IP address, source port, destination interface, transport layer association
Discuss type>Five-tuple is divided into several sessions, is then directed to session and extracts protocol massages, and then same type of message is gathered
It is that one kind is convenient for analysis.
(2) protocol keyword identification and keyword sequences extraction:Implement two benches closed sequential pattern method for digging, wherein first
Stage is the segmentation stage, for the message group for having converged same kind message, is excavated based on closed sequential pattern in message group
Extraction is closed frequent adjacent segments, and keyword recognition strategy is used to judge that whether these are closed frequent adjacent segments for agreement key
Word.Second stage is the mode excavation stage, and the frequent adjacent segments of closure of keyword are inferred as based on the first stage, by being closed sequence
Row pattern mining algorithm generates the closure Frequent episodes being made of multiple keywords, finally obtains all in a message group close
Close Frequent episodes.
(3) message structure is inferred:Distinguished based on the keyword sequences extracted sequence between different keywords, side by side with
And hierarchical relationship, obtain detailed message structure information.
The workflow of aforementioned message pretreatment stage is as follows:By message sample according to<Source IP address, purpose IP address, source
Mouthful, destination interface, transport layer protocol type>Five-tuple is divided into session, and each message is with sequencing respectively in a session
Marking serial numbers, the message with same sequence number is as a kind of.Due to be likely to occur in network transmission message packet loss, out of order etc. because
Element, test serial number is identical not to ensure that message is same type, therefore also needs extraction message load, is carried by calculating message
Similarity between lotus finds out the identical message of type, and same type of message is gathered in a message group.By above
It handles, the type of message in a message group is identical, message characteristic having the same, can be as the base that next stage is analyzed
Plinth.
Aforementioned protocols keyword recognition and the workflow in keyword sequences extraction stage are as follows:Implement two benches close sequence mould
Formula method for digging, wherein first stage are the segmentation stage, for the message group for having converged same kind message, the base in message group
Extraction is excavated in closed sequential pattern and is closed frequent adjacent segments, and judges that these closures are frequently adjacent using keyword recognition strategy
Whether section is keyword.The work of the segmentation stage first step is to generate candidate character strings.Candidate character strings are extracted from message group,
Message is divided into character string using n-gram models, the character in character string keeps original sequence and adjacent attribute.Segmentation
The work of stage second step is candidate character strings beta pruning.Message in message group is discretized as character string, for determining message
In protocol keyword, these character strings are by as candidate characters segment.Candidate characters set of segments may include many heavy
Character fragments that are multiple or not meeting keyword basic demand.In order to reduce redundant search space, need according to frequently adjacent spy
Sign carries out beta pruning to candidate characters segment.If the support of one candidate characters segment does not reach minimum when implementing beta pruning
Support threshold, then the support of all character strings comprising the character fragments be also impossible to be more than minimum support threshold value, will
All character strings for including the character fragments are deleted in the string assemble of the n-gram different lengths generated accordingly, no longer
Their degree of being supported are calculated.The work of segmentation stage third step is closed detection.Judge for given character fragments,
With the presence or absence of the character string comprising the character fragments, there is the support not less than the character fragments.Since protocol keyword must
So belong to and be closed frequent adjacent segments, therefore be unsatisfactory for being closed desired frequent character fragments, is all not belonging to protocol keyword.Two ranks
The second stage of section closed sequential pattern method for digging is the mode excavation stage, and the closure of keyword is inferred as based on the first stage
Frequent adjacent segments are handled the segmented message for being divided basis using keyword as field, are given birth to by Closed sequential pattern mining algorithm
At the closure Frequent episodes being made of multiple keywords, all closure Frequent episodes in a message group are finally obtained, are obtained
Metastable sequence relation between protocol keyword.
The workflow of aforementioned message structure deduction phase is as follows:When implementing message structure deduction, first determine whether keyword it
Between ordinal relation, i.e. which keyword must be present in before other keyword.Judge that the method for ordinal relation is pair
Closure Frequent episodes in message group take the processing of intersection, if some keywords are always before other keyword
Occur, then there is stringent ordinal relation between them.Followed by judge the coordination between keyword.If some are crucial
Word can be appeared in simultaneously on a message, but will not be appeared in one simultaneously and be closed on Frequent episodes, then by these keys
Word is determined as coordination.It is finally the hierarchical relationship between analysis keyword.In protocol massages, format identification field
The value of (Format Distinguisher, FD) and subsequent message format tight association, FD fields is different, subsequent message
Format is also different.The analysis of hierarchical relationship wishes to judge which keyword belongs to FD fields in message between keyword, and analyzes
The information of keyword sequences associated by FD fields.The characteristics of according to format identification field, format identification field and institute in message
Associated keyword sequences usually occur in pairs, therefore using the association analysis algorithm in sequential mode mining field in message group
It is scanned for based on confidence level, determines FD fields and the keyword sequences associated by it.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810450347.4A CN108667839A (en) | 2018-05-11 | 2018-05-11 | A kind of protocol format estimating method excavated based on closed sequential pattern |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810450347.4A CN108667839A (en) | 2018-05-11 | 2018-05-11 | A kind of protocol format estimating method excavated based on closed sequential pattern |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108667839A true CN108667839A (en) | 2018-10-16 |
Family
ID=63779124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810450347.4A Pending CN108667839A (en) | 2018-05-11 | 2018-05-11 | A kind of protocol format estimating method excavated based on closed sequential pattern |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108667839A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112367325A (en) * | 2020-11-13 | 2021-02-12 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488861A (en) * | 2008-12-19 | 2009-07-22 | 中山大学 | Keyword extracting method for network unknown application |
CN102891852A (en) * | 2012-10-11 | 2013-01-23 | 中国人民解放军理工大学 | Message analysis-based protocol format automatic inferring method |
CN103023909A (en) * | 2012-12-24 | 2013-04-03 | 成都科来软件有限公司 | Network packet protocol identification method and system |
CN103825784A (en) * | 2014-03-24 | 2014-05-28 | 中国人民解放军信息工程大学 | Non-public protocol field identification method and system |
US9100326B1 (en) * | 2013-06-13 | 2015-08-04 | Narus, Inc. | Automatic parsing of text-based application protocols using network traffic data |
CN105827603A (en) * | 2016-03-14 | 2016-08-03 | 中国人民解放军信息工程大学 | Inexplicit protocol feature library establishment method and device and inexplicit message classification method and device |
CN107665191A (en) * | 2017-10-19 | 2018-02-06 | 中国人民解放军陆军工程大学 | Private protocol message format inference method based on extended prefix tree |
-
2018
- 2018-05-11 CN CN201810450347.4A patent/CN108667839A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488861A (en) * | 2008-12-19 | 2009-07-22 | 中山大学 | Keyword extracting method for network unknown application |
CN102891852A (en) * | 2012-10-11 | 2013-01-23 | 中国人民解放军理工大学 | Message analysis-based protocol format automatic inferring method |
CN103023909A (en) * | 2012-12-24 | 2013-04-03 | 成都科来软件有限公司 | Network packet protocol identification method and system |
US9100326B1 (en) * | 2013-06-13 | 2015-08-04 | Narus, Inc. | Automatic parsing of text-based application protocols using network traffic data |
CN103825784A (en) * | 2014-03-24 | 2014-05-28 | 中国人民解放军信息工程大学 | Non-public protocol field identification method and system |
CN105827603A (en) * | 2016-03-14 | 2016-08-03 | 中国人民解放军信息工程大学 | Inexplicit protocol feature library establishment method and device and inexplicit message classification method and device |
CN107665191A (en) * | 2017-10-19 | 2018-02-06 | 中国人民解放军陆军工程大学 | Private protocol message format inference method based on extended prefix tree |
Non-Patent Citations (2)
Title |
---|
李伟明: "《网络协议的自动化模糊测试漏洞挖掘方法》" * |
田益凡;洪征;潘;张洪泽;: "第5讲 基于网络流量的协议格式推断技术研究进展" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112367325A (en) * | 2020-11-13 | 2021-02-12 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
CN112367325B (en) * | 2020-11-13 | 2023-11-07 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107665191B (en) | Private protocol message format inference method based on extended prefix tree | |
Chen et al. | Colnet: Embedding the semantics of web tables for column type prediction | |
TWI729472B (en) | Method, device and server for determining feature words | |
CN110457404B (en) | Social media account classification method based on complex heterogeneous network | |
CN102722709B (en) | Method and device for identifying garbage pictures | |
CN108182523A (en) | The treating method and apparatus of fault data, computer readable storage medium | |
KR101617696B1 (en) | Method and device for mining data regular expression | |
CN109918505B (en) | Network security event visualization method based on text processing | |
CN106713273B (en) | A kind of protocol keyword recognition methods based on dictionary tree pruning search | |
CN110162632B (en) | Method for discovering news special events | |
CN105045808B (en) | A kind of compound rule collection matching process and system | |
CN102867049B (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
Chen et al. | Efficient information extraction over evolving text data | |
CN112286575A (en) | Intelligent contract similarity detection method and system based on graph matching model | |
CN103577598B (en) | Matching method and device for pattern string and text string | |
CN108667839A (en) | A kind of protocol format estimating method excavated based on closed sequential pattern | |
CN110399485A (en) | The data source tracing method and system of word-based vector sum machine learning | |
CN112084776B (en) | Method, device, server and computer storage medium for detecting similar articles | |
CN103166942A (en) | Network protocol analysis method of malicious code | |
CN103544167A (en) | Backward word segmentation method and device based on Chinese retrieval | |
KR102298397B1 (en) | Citation Relationship Analysis Method and System Based on Citation Type | |
CN117390130A (en) | Code searching method based on multi-mode representation | |
CN113691562A (en) | Method for implementing rule engine for accurately identifying malicious network communication | |
CN103544139A (en) | Forward word segmentation method and device based on Chinese retrieval | |
CN113095363A (en) | Code search intention classification method using weak supervision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181016 |
|
WD01 | Invention patent application deemed withdrawn after publication |