CN110336817B - Unknown protocol frame positioning method based on TextRank - Google Patents

Unknown protocol frame positioning method based on TextRank Download PDF

Info

Publication number
CN110336817B
CN110336817B CN201910609097.9A CN201910609097A CN110336817B CN 110336817 B CN110336817 B CN 110336817B CN 201910609097 A CN201910609097 A CN 201910609097A CN 110336817 B CN110336817 B CN 110336817B
Authority
CN
China
Prior art keywords
node
sequence
state
weight
station
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910609097.9A
Other languages
Chinese (zh)
Other versions
CN110336817A (en
Inventor
刘治国
宋广跃
蔡文珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN201910609097.9A priority Critical patent/CN110336817B/en
Publication of CN110336817A publication Critical patent/CN110336817A/en
Application granted granted Critical
Publication of CN110336817B publication Critical patent/CN110336817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers

Abstract

The invention discloses an unknown protocol frame positioning method based on TextRank, which introduces the thought of TextRank into the traditional unknown protocol frame positioning process, determines the voting weight of each node in a bit stream by counting the occurrence frequency of each sequence in data, votes through the thought of TextRank algorithm to determine a key sequence in protocol data, and finally segments the bit stream according to the key sequence, calculates the sequence similarity between each segment of bit stream and judges the frame header position of unknown protocol data. By the method, unknown protocol data can be analyzed quickly and effectively, and the position of each frame in the bit stream data can be accurately positioned.

Description

Unknown protocol frame positioning method based on TextRank
Technical Field
The invention belongs to the field of communication, and particularly relates to an unknown protocol frame positioning method based on TextRank.
Background
With the continuous development of computer network technology, more and more proprietary protocols are applied in data transmission, and such proprietary protocols usually have fixed formats but are not disclosed. Therefore, the research on the proprietary protocol to analyze the format of the proprietary protocol has great significance for constructing a safe network environment. Under the condition that the protocol is known, a receiving party in communication can determine the position of a frame header through a frame synchronization code, and analyze bit stream data according to the protocol format. However, for unknown protocols, a listener cannot perform effective analysis after acquiring communication data of the other party. Currently, how to identify a private protocol from intercepted communication data is an important research topic, and determining a frame header and a frame tail in bitstream data to obtain a complete frame is a primary problem in the research.
Disclosure of Invention
In order to solve the problem that frames are difficult to delimit in the prior art, the invention provides an Unknown Protocol Frame positioning Method (UPFLM) based on TextRank, aiming at Unknown Protocol data in a bit stream form, a key sequence in the Unknown Protocol Frame positioning Method can be accurately excavated, a Frame header sequence and the position of the Frame header sequence are distinguished from the key sequence, and the Frame positioning and segmentation are completed according to the key sequence.
In order to achieve the purpose, the technical scheme of the application is as follows: an unknown protocol frame positioning method based on TextRank comprises the following steps:
the method comprises the following steps: enumerating all target sequences according to the given target sequence length n, and constructing a dictionary array;
step two: identifying a sequence with the initial length n of the bit stream B, and updating the corresponding current state station and the corresponding importance value;
step three: reading in a new _ bit value of a next bit, jumping to a next state new _ station according to a state jump function, and updating an importance value of the new _ bit value;
step four: if the bit stream B is completely read, jumping to the fifth step; otherwise, repeating the third step until the bit stream B is completely read;
step five: and sorting the states according to the importance value in the binary group, and outputting the state information of the states.
Further, the state jump function is:
new_station=(station%2n-1)*2+new_bit
in the formula: n denotes the target sequence length, new _ bit denotes the value of the next bit, station denotes the current state, and new _ station denotes the next state to jump.
Further, the output state information is processed as follows: calculating the initial weight of the state, calculating the weight of the node and extracting the key sequence.
Further, the purpose of calculating the initial weight of the state is to set the voting weight of the state, and the calculation process is as follows:
Figure BDA0002121757690000021
wherein stationiVW (station) indicating the state corresponding to the sequence having the i-th bit as the starting length n in the bit stream Bi) Represents stationiVoting weight of (1), P (station)i) Representing the actual frequency of occurrence, P, of the state i in the bit stream BaverageIs the expected value of the frequency of occurrence of a sequence of length n.
Further, the process of calculating the state weight is to use a neighboring node of a certain node to vote, so as to obtain a weight WS of the node in the bitstream B, where WS is represented by the following formula:
Figure BDA0002121757690000031
in the formula: node(s)iRepresenting a node with the initial ith bit and the length n in the bit stream; WS is the weight of the node; stationtRepresenting a nodeiA corresponding state; VM is the initial weight of the state; d represents a damping coefficient, which means the probability that a certain node points to any other node, and is usually an empirical value of 0.85.
Further, by calculating the weight of the state in the bit stream, if a certain sequence is a key sequence in the bit stream, it must be represented as a continuous state with higher weight, so that the long key sequence can be extracted by the following steps:
step 1: searching the maximum value max _ WS in the weight WS of each node;
step 2: traversing each node in the bit stream B in sequence;
and step 3: if the weight of the node is more than 0.75 × max _ WS, judging that the sequence corresponding to the node is a key sequence, and executing the step 4; otherwise, jumping to the step 2;
and 4, step 4: if the weight of the next node is also larger than 0.75 × max _ WS, combining the two sequences into one sequence according to the position relation, and repeating the step 4; otherwise, jumping to step 5;
and 5: storing the obtained key sequence, recording the initial position of the key sequence, and jumping to the step 6 if traversal is finished; otherwise, executing step 2;
step 6: and outputting the obtained long key sequence information.
Further, the obtained long key sequence information is segmented, and a plurality of sequences can be obtained after segmentation, so that on the basis of the similarity of the two sequences, the key sequences are sequenced according to the sequence from high to low of the average similarity by taking the average similarity among the sequences as a basis; the average similarity between multiple sequences is as follows:
Figure BDA0002121757690000041
in the formula, distaverageThe average similarity among a plurality of sequences obtained after the key sequence is segmented is determined, k is the number of the sequences obtained after the key sequence is segmented, and ComTime represents the number of times of comparison; after sequencing, the key sequence with the highest average similarity is positioned in the frame head, and at the moment, frame positioning and segmentation can be completed according to the sequence.
Due to the adoption of the technical scheme, the invention can obtain the following technical effects: by the method, unknown protocol data can be analyzed quickly and effectively, and the position of each frame in the bit stream data can be accurately positioned. The method solves the speed problem of sequence statistics when a large amount of bit stream data are collected and the problem that the starting position of a time frame is difficult to determine when the unknown protocol data are oriented.
Drawings
FIG. 1 is a statistical time comparison of UPFLM method for target sequences of different lengths;
FIG. 2 is a graph showing a statistical time comparison of target sequences of different lengths in different data sets;
FIG. 3 is a partial node weight graph obtained by the UPFLM method;
FIG. 4 is a comparison graph of frame alignment accuracy of the UPFLM method.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described examples are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
In the method, the characteristics of protocol data in a bit stream form are fully considered in the sequence statistics process, and the traditional AC algorithm is improved for counting all target sequences with specified lengths appearing in the bit stream. In order to contain all possible cases of target sequences, the target sequences are generated by enumerating all sequences with the length of n, and the target sequences are stored in an array form to form a dictionary so as to replace a Trie tree to reduce space occupation, and each target sequence can be defined as a state. Meanwhile, the condition of matching failure does not exist in the statistical process, and the failure pointer does not play a role in the process. In order to realize the jump among different states, a state jump function is provided by comprehensively considering the relation between the state information and the read-in data:
new_station=(station%2n-1)*2+new_bit
in the formula: n denotes the target sequence length, new _ bit denotes the value of the next bit, station denotes the current state, and new _ station denotes the next state to jump.
The specific process is as follows:
the method comprises the following steps: enumerating all target sequences according to the given target sequence length n, and constructing a dictionary array;
step two: identifying a sequence with the initial length n of the bit stream B, and updating the corresponding current state station and the corresponding importance value;
step three: reading in a new _ bit value of a next bit, jumping to a next state new _ station according to a state jump function, and updating an importance value of the new _ bit value;
step four: if the bit stream B is completely read, jumping to the fifth step; otherwise, repeating the third step until the bit stream B is completely read;
step five: and sorting the states according to the importance value in the binary group, and outputting the state information of the states.
In order to obtain a key sequence in a bit stream through a sequence statistical result, the idea of a keyword extraction method TextRank in natural language processing is introduced, and the weight of each node is obtained by utilizing the voting principle to realize keyword extraction. Because the data is in a bit stream form, the method mainly comprises three parts of state initial weight calculation, node weight calculation and key sequence extraction.
The purpose of the state initial weight calculation is to set the voting weight of the state, and the calculation process is as follows:
Figure BDA0002121757690000061
wherein stationiVW (station) indicating the state corresponding to the sequence having the i-th bit as the starting length n in the bit stream Bi) Represents stationiVoting weight of (1), P (station)i) Representing the actual frequency of occurrence, P, of the state i in the bit stream BaverageIs the expected value of the frequency of occurrence of a sequence of length n.
The process of state weight calculation is to use a neighboring node of a certain node to vote, so as to obtain a weight WS of the node in the bitstream B, where WS may be represented by the following formula:
Figure BDA0002121757690000062
in the formula: node(s)iRepresenting a node with the initial ith bit and the length n in the bit stream; WS is the weight of the node; stationtRepresenting a nodeiA corresponding state; VM is the initial weight of the state; d represents a damping coefficient, which means the probability that a certain node points to any other node, and is usually an empirical value of 0.85.
By calculating the weights of the states in the bit stream, if a certain sequence is a key sequence in the bit stream, it must be represented as a continuous state with higher weight, so that the long key sequence can be extracted by the following process:
step 1: searching the maximum value max _ WS in the weight WS of each node;
step 2: traversing each node in the bit stream B in sequence;
and step 3: if the weight of the node is more than 0.75 × max _ WS, judging that the sequence corresponding to the node is a key sequence, and executing the step 4; otherwise, jumping to the step 2;
and 4, step 4: if the weight of the next node is also larger than 0.75 × max _ WS, combining the two sequences into one sequence according to the position relation, and repeating the step 4; otherwise, jumping to step 5;
and 5: storing the obtained key sequence, recording the initial position of the key sequence, and jumping to the step 6 if traversal is finished; otherwise, executing step 2;
step 6: and outputting the obtained long key sequence information.
Segmenting the obtained long key sequence information to obtain a plurality of sequences, and sequencing the key sequences according to the sequence from high average similarity to low average similarity on the basis of the similarity of the two sequences by taking the average similarity between the sequences as a basis; the average similarity between multiple sequences is as follows:
Figure BDA0002121757690000071
in the formula, distaverageThe average similarity among a plurality of sequences obtained after the key sequence is segmented is determined, k is the number of the sequences obtained after the key sequence is segmented, and ComTime represents the number of times of comparison; after sequencing, the key sequence with the highest average similarity is positioned in the frame head, and at the moment, frame positioning and segmentation can be completed according to the sequence.
The effect of the present invention is demonstrated by way of another example.
The UPFLM method is realized through a Visual Studio 2015 platform, and communication data of the same host at different times are collected through Wireshark software, and collected data packets are converted into a continuous bit stream form, so that a data set for experiments is generated. Data sets J1 and J3 are TCP protocol data, J2 is UDP protocol data, and J4 is TCP/UDP hybrid protocol data. The numbers of the data packets contained in J1, J2, J3 and J4 are 500, 1000 and 1500 respectively.
The statistics of sequences of different lengths were first performed in data set J1 and compared with the modified AC algorithm and the conventional AC algorithm, and the statistical time was as shown in fig. 1. In order to observe the influence of the size of the data set on the sequence statistical process, experiments are respectively carried out under different target sequence lengths by using different data sets, and the experimental results are shown in fig. 2.
In order to further verify the effectiveness of the UPFLM method, data obtained by performing sequence statistics on the data set J1 are processed to obtain the partial node weight condition of each data set, which is shown in FIG. 3.
In the experiment, the accuracy is used as the measurement standard of the frame positioning method, and the frame positioning accuracy R can be represented by the following formula:
Figure BDA0002121757690000081
in the formula: frecogIndicating the number of accurately positioned frames, FtotalRepresenting the number of frames contained in the data set.
The comparison graph of frame alignment accuracy of the UPFLM method obtained after simulation is shown in FIG. 4.
In summary, the following results can be obtained:
1. the invention improves the traditional AC algorithm, thereby being more suitable for the statistics of the target sequence with the specified length in the bit stream data.
2. The invention solves the problem that the frame starting position is difficult to determine when the unknown protocol data is oriented.
3. The invention can effectively shorten the sequence statistical time so as to accelerate the frame positioning speed and effectively improve the accuracy of the frame positioning.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (2)

1. An unknown protocol frame positioning method based on TextRank is characterized by comprising the following steps:
the method comprises the following steps: enumerating all target sequences according to the given target sequence length n, and constructing a dictionary array;
step two: identifying a sequence with the initial length n of the bit stream B, and updating the corresponding current state station and the corresponding importance value;
step three: reading in a new _ bit value of a next bit, jumping to a next state new _ station according to a state jump function, and updating an importance value of the new _ bit value;
step four: if the bit stream B is completely read, jumping to the fifth step; otherwise, repeating the third step until the bit stream B is completely read;
step five: sorting the states according to the importance value in the binary group, and outputting the state information of the states;
the state jump function is:
new_station=(station%2n-1)*2+new_bit
in the formula: n represents the target sequence length, new _ bit represents the value of the next bit, station represents the current state, and new _ station represents the next state to be jumped;
the output state information is processed as follows: calculating state initial weight, calculating node weight and extracting a key sequence;
the purpose of calculating the initial weight of the state is to set the voting weight of the state, and the calculation process is as follows:
Figure FDA0003039640030000011
wherein stationiVW (station) indicating the state corresponding to the sequence having the i-th bit as the starting length n in the bit stream Bi) Represents stationiVoting weight of (1), P (station)i) Representing the actual frequency of occurrence, P, of the state i in the bit stream BaverageAn expected value of the frequency of occurrence of a sequence of length n;
the process of state weight calculation is to use a neighboring node of a certain node to vote, so as to obtain the weight WS of the node in the bit stream B, where WS is represented by the following formula:
Figure FDA0003039640030000021
in the formula:nodeirepresenting a node with the initial ith bit and the length n in the bit stream; WS is the weight of the node; stationtRepresenting a nodeiA corresponding state; VM is the initial weight of the state; d represents a damping coefficient, and the meaning of the damping coefficient is the probability that a certain node points to any other node;
the extraction of long key sequences is carried out by the following steps:
step 1: searching the maximum value max _ WS in the weight WS of each node;
step 2: traversing each node in the bit stream B in sequence;
and step 3: if the weight of the node is more than 0.75 × max _ WS, judging that the sequence corresponding to the node is a key sequence, and executing the step 4; otherwise, jumping to the step 2;
and 4, step 4: if the weight of the next node is also larger than 0.75 × max _ WS, combining the two sequences into one sequence according to the position relation, and repeating the step 4; otherwise, jumping to step 5;
and 5: storing the obtained key sequence, recording the initial position of the key sequence, and jumping to the step 6 if traversal is finished; otherwise, executing step 2;
step 6: and outputting the obtained long key sequence information.
2. The TextRank-based unknown protocol frame positioning method according to claim 1, characterized in that the obtained long key sequence information is segmented, and a plurality of sequences are obtained after the segmentation, so that on the basis of similarity of two sequences, the key sequences are ordered according to the average similarity between the sequences from high to low; the average similarity between multiple sequences is as follows:
Figure FDA0003039640030000031
in the formula, distaverageThe average similarity among a plurality of sequences obtained after the key sequence is segmented is determined, k is the number of the sequences obtained after the key sequence is segmented, and ComTime represents the number of times of comparison; after the sorting, the first and second images are obtained,the key sequence with the highest average similarity is located in the frame head, and the frame positioning and segmentation can be completed according to the sequence.
CN201910609097.9A 2019-07-08 2019-07-08 Unknown protocol frame positioning method based on TextRank Active CN110336817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910609097.9A CN110336817B (en) 2019-07-08 2019-07-08 Unknown protocol frame positioning method based on TextRank

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910609097.9A CN110336817B (en) 2019-07-08 2019-07-08 Unknown protocol frame positioning method based on TextRank

Publications (2)

Publication Number Publication Date
CN110336817A CN110336817A (en) 2019-10-15
CN110336817B true CN110336817B (en) 2021-08-10

Family

ID=68143276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910609097.9A Active CN110336817B (en) 2019-07-08 2019-07-08 Unknown protocol frame positioning method based on TextRank

Country Status (1)

Country Link
CN (1) CN110336817B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139593B (en) * 2021-04-19 2022-06-21 湖南大学 Industrial control protocol message classification method and system based on conversation analysis

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707532A (en) * 2009-10-30 2010-05-12 中山大学 Automatic analysis method for unknown application layer protocol
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
CN104767736A (en) * 2015-03-23 2015-07-08 电子科技大学 Method for separating unknown single protocol data stream into different types of data frames
CN104935567A (en) * 2015-04-20 2015-09-23 中国电子科技集团公司第二十九研究所 Unknown protocol message format deduction method
CN105791278A (en) * 2016-02-29 2016-07-20 中国工程物理研究院计算机应用研究所 Unknown binary protocol frame segmentation and hierarchical division method
CN107689899A (en) * 2017-09-01 2018-02-13 南京南瑞集团公司 A kind of unknown protocol recognition methods and system based on bit stream
CN108712414A (en) * 2018-05-16 2018-10-26 东南大学 A kind of binary system unknown protocol message format division methods based on sequence alignment
CN108924010A (en) * 2018-07-25 2018-11-30 北京科东电力控制系统有限责任公司 A kind of communication protocol recognition methods and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7650317B2 (en) * 2006-12-06 2010-01-19 Microsoft Corporation Active learning framework for automatic field extraction from network traffic
IL206240A0 (en) * 2010-06-08 2011-02-28 Verint Systems Ltd Systems and methods for extracting media from network traffic having unknown protocols

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101707532A (en) * 2009-10-30 2010-05-12 中山大学 Automatic analysis method for unknown application layer protocol
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
CN104767736A (en) * 2015-03-23 2015-07-08 电子科技大学 Method for separating unknown single protocol data stream into different types of data frames
CN104935567A (en) * 2015-04-20 2015-09-23 中国电子科技集团公司第二十九研究所 Unknown protocol message format deduction method
CN105791278A (en) * 2016-02-29 2016-07-20 中国工程物理研究院计算机应用研究所 Unknown binary protocol frame segmentation and hierarchical division method
CN107689899A (en) * 2017-09-01 2018-02-13 南京南瑞集团公司 A kind of unknown protocol recognition methods and system based on bit stream
CN108712414A (en) * 2018-05-16 2018-10-26 东南大学 A kind of binary system unknown protocol message format division methods based on sequence alignment
CN108924010A (en) * 2018-07-25 2018-11-30 北京科东电力控制系统有限责任公司 A kind of communication protocol recognition methods and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于前导码挖掘的未知协议帧切分算法;雷东;《计算机应用》;20170210;全文 *

Also Published As

Publication number Publication date
CN110336817A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
EP3846048A1 (en) Online log analysis method, system, and electronic terminal device thereof
CN111506599B (en) Industrial control equipment identification method and system based on rule matching and deep learning
CN108959474B (en) Entity relation extraction method
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
CN109753987B (en) File recognition method and feature extraction method
CN111191767A (en) Vectorization-based malicious traffic attack type judgment method
WO2023185494A1 (en) Point cloud data identification method and apparatus, electronic device, and storage medium
CN110798463B (en) Network covert channel detection method and device based on information entropy
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN113539304A (en) Video strip splitting method and device
CN106650446A (en) Identification method and system of malicious program behavior, based on system call
CN110336817B (en) Unknown protocol frame positioning method based on TextRank
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN109558735A (en) A kind of rogue program sample clustering method and relevant apparatus based on machine learning
CN114860942B (en) Text intention classification method, device, equipment and storage medium
CN114301719B (en) Malicious update detection method and system based on variational self-encoder
CN115630304A (en) Event segmentation and extraction method and system in text extraction task
CN115859191A (en) Fault diagnosis method and device, computer readable storage medium and computer equipment
CN112989040B (en) Dialogue text labeling method and device, electronic equipment and storage medium
CN112235254B (en) Rapid identification method for Tor network bridge in high-speed backbone network
CN110442714B (en) POI name normative evaluation method, device, equipment and storage medium
CN115392238A (en) Equipment identification method, device, equipment and readable storage medium
CN110019659A (en) The search method and device of judgement document
CN113468866A (en) Method and device for analyzing non-standard JSON string

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant