CN110336817B

CN110336817B - Unknown protocol frame positioning method based on TextRank

Info

Publication number: CN110336817B
Application number: CN201910609097.9A
Authority: CN
Inventors: 刘治国; 宋广跃; 蔡文珠
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2021-08-10
Anticipated expiration: 2039-07-08
Also published as: CN110336817A

Abstract

The invention discloses an unknown protocol frame positioning method based on TextRank, which introduces the thought of TextRank into the traditional unknown protocol frame positioning process, determines the voting weight of each node in a bit stream by counting the occurrence frequency of each sequence in data, votes through the thought of TextRank algorithm to determine a key sequence in protocol data, and finally segments the bit stream according to the key sequence, calculates the sequence similarity between each segment of bit stream and judges the frame header position of unknown protocol data. By the method, unknown protocol data can be analyzed quickly and effectively, and the position of each frame in the bit stream data can be accurately positioned.

Description

Unknown protocol frame positioning method based on TextRank

Technical Field

The invention belongs to the field of communication, and particularly relates to an unknown protocol frame positioning method based on TextRank.

Background

With the continuous development of computer network technology, more and more proprietary protocols are applied in data transmission, and such proprietary protocols usually have fixed formats but are not disclosed. Therefore, the research on the proprietary protocol to analyze the format of the proprietary protocol has great significance for constructing a safe network environment. Under the condition that the protocol is known, a receiving party in communication can determine the position of a frame header through a frame synchronization code, and analyze bit stream data according to the protocol format. However, for unknown protocols, a listener cannot perform effective analysis after acquiring communication data of the other party. Currently, how to identify a private protocol from intercepted communication data is an important research topic, and determining a frame header and a frame tail in bitstream data to obtain a complete frame is a primary problem in the research.

Disclosure of Invention

In order to solve the problem that frames are difficult to delimit in the prior art, the invention provides an Unknown Protocol Frame positioning Method (UPFLM) based on TextRank, aiming at Unknown Protocol data in a bit stream form, a key sequence in the Unknown Protocol Frame positioning Method can be accurately excavated, a Frame header sequence and the position of the Frame header sequence are distinguished from the key sequence, and the Frame positioning and segmentation are completed according to the key sequence.

In order to achieve the purpose, the technical scheme of the application is as follows: an unknown protocol frame positioning method based on TextRank comprises the following steps:

the method comprises the following steps: enumerating all target sequences according to the given target sequence length n, and constructing a dictionary array;

step two: identifying a sequence with the initial length n of the bit stream B, and updating the corresponding current state station and the corresponding importance value;

step three: reading in a new _ bit value of a next bit, jumping to a next state new _ station according to a state jump function, and updating an importance value of the new _ bit value;

step four: if the bit stream B is completely read, jumping to the fifth step; otherwise, repeating the third step until the bit stream B is completely read;

step five: and sorting the states according to the importance value in the binary group, and outputting the state information of the states.

Further, the state jump function is:

new_station＝(station％2^n-1)*2+new_bit

in the formula: n denotes the target sequence length, new _ bit denotes the value of the next bit, station denotes the current state, and new _ station denotes the next state to jump.

Further, the output state information is processed as follows: calculating the initial weight of the state, calculating the weight of the node and extracting the key sequence.

Further, the purpose of calculating the initial weight of the state is to set the voting weight of the state, and the calculation process is as follows:

wherein station_iVW (station) indicating the state corresponding to the sequence having the i-th bit as the starting length n in the bit stream B_i) Represents station_iVoting weight of (1), P (station)_i) Representing the actual frequency of occurrence, P, of the state i in the bit stream B_averageIs the expected value of the frequency of occurrence of a sequence of length n.

Further, the process of calculating the state weight is to use a neighboring node of a certain node to vote, so as to obtain a weight WS of the node in the bitstream B, where WS is represented by the following formula:

in the formula: node(s)_iRepresenting a node with the initial ith bit and the length n in the bit stream; WS is the weight of the node; station_tRepresenting a node_iA corresponding state; VM is the initial weight of the state; d represents a damping coefficient, which means the probability that a certain node points to any other node, and is usually an empirical value of 0.85.

Further, by calculating the weight of the state in the bit stream, if a certain sequence is a key sequence in the bit stream, it must be represented as a continuous state with higher weight, so that the long key sequence can be extracted by the following steps:

step 1: searching the maximum value max _ WS in the weight WS of each node;

step 2: traversing each node in the bit stream B in sequence;

and step 3: if the weight of the node is more than 0.75 × max _ WS, judging that the sequence corresponding to the node is a key sequence, and executing the step 4; otherwise, jumping to the step 2;

and 4, step 4: if the weight of the next node is also larger than 0.75 × max _ WS, combining the two sequences into one sequence according to the position relation, and repeating the step 4; otherwise, jumping to step 5;

and 5: storing the obtained key sequence, recording the initial position of the key sequence, and jumping to the step 6 if traversal is finished; otherwise, executing step 2;

step 6: and outputting the obtained long key sequence information.

Further, the obtained long key sequence information is segmented, and a plurality of sequences can be obtained after segmentation, so that on the basis of the similarity of the two sequences, the key sequences are sequenced according to the sequence from high to low of the average similarity by taking the average similarity among the sequences as a basis; the average similarity between multiple sequences is as follows:

in the formula, dist_averageThe average similarity among a plurality of sequences obtained after the key sequence is segmented is determined, k is the number of the sequences obtained after the key sequence is segmented, and ComTime represents the number of times of comparison; after sequencing, the key sequence with the highest average similarity is positioned in the frame head, and at the moment, frame positioning and segmentation can be completed according to the sequence.

Due to the adoption of the technical scheme, the invention can obtain the following technical effects: by the method, unknown protocol data can be analyzed quickly and effectively, and the position of each frame in the bit stream data can be accurately positioned. The method solves the speed problem of sequence statistics when a large amount of bit stream data are collected and the problem that the starting position of a time frame is difficult to determine when the unknown protocol data are oriented.

Drawings

FIG. 1 is a statistical time comparison of UPFLM method for target sequences of different lengths;

FIG. 2 is a graph showing a statistical time comparison of target sequences of different lengths in different data sets;

FIG. 3 is a partial node weight graph obtained by the UPFLM method;

FIG. 4 is a comparison graph of frame alignment accuracy of the UPFLM method.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described examples are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In the method, the characteristics of protocol data in a bit stream form are fully considered in the sequence statistics process, and the traditional AC algorithm is improved for counting all target sequences with specified lengths appearing in the bit stream. In order to contain all possible cases of target sequences, the target sequences are generated by enumerating all sequences with the length of n, and the target sequences are stored in an array form to form a dictionary so as to replace a Trie tree to reduce space occupation, and each target sequence can be defined as a state. Meanwhile, the condition of matching failure does not exist in the statistical process, and the failure pointer does not play a role in the process. In order to realize the jump among different states, a state jump function is provided by comprehensively considering the relation between the state information and the read-in data:

new_station＝(station％2^n-1)*2+new_bit

The specific process is as follows:

In order to obtain a key sequence in a bit stream through a sequence statistical result, the idea of a keyword extraction method TextRank in natural language processing is introduced, and the weight of each node is obtained by utilizing the voting principle to realize keyword extraction. Because the data is in a bit stream form, the method mainly comprises three parts of state initial weight calculation, node weight calculation and key sequence extraction.

The purpose of the state initial weight calculation is to set the voting weight of the state, and the calculation process is as follows:

The process of state weight calculation is to use a neighboring node of a certain node to vote, so as to obtain a weight WS of the node in the bitstream B, where WS may be represented by the following formula:

By calculating the weights of the states in the bit stream, if a certain sequence is a key sequence in the bit stream, it must be represented as a continuous state with higher weight, so that the long key sequence can be extracted by the following process:

step 1: searching the maximum value max _ WS in the weight WS of each node;

step 2: traversing each node in the bit stream B in sequence;

step 6: and outputting the obtained long key sequence information.

Segmenting the obtained long key sequence information to obtain a plurality of sequences, and sequencing the key sequences according to the sequence from high average similarity to low average similarity on the basis of the similarity of the two sequences by taking the average similarity between the sequences as a basis; the average similarity between multiple sequences is as follows:

The effect of the present invention is demonstrated by way of another example.

The UPFLM method is realized through a Visual Studio 2015 platform, and communication data of the same host at different times are collected through Wireshark software, and collected data packets are converted into a continuous bit stream form, so that a data set for experiments is generated. Data sets J1 and J3 are TCP protocol data, J2 is UDP protocol data, and J4 is TCP/UDP hybrid protocol data. The numbers of the data packets contained in J1, J2, J3 and J4 are 500, 1000 and 1500 respectively.

The statistics of sequences of different lengths were first performed in data set J1 and compared with the modified AC algorithm and the conventional AC algorithm, and the statistical time was as shown in fig. 1. In order to observe the influence of the size of the data set on the sequence statistical process, experiments are respectively carried out under different target sequence lengths by using different data sets, and the experimental results are shown in fig. 2.

In order to further verify the effectiveness of the UPFLM method, data obtained by performing sequence statistics on the data set J1 are processed to obtain the partial node weight condition of each data set, which is shown in FIG. 3.

In the experiment, the accuracy is used as the measurement standard of the frame positioning method, and the frame positioning accuracy R can be represented by the following formula:

in the formula: f_recogIndicating the number of accurately positioned frames, F_totalRepresenting the number of frames contained in the data set.

The comparison graph of frame alignment accuracy of the UPFLM method obtained after simulation is shown in FIG. 4.

In summary, the following results can be obtained:

1. the invention improves the traditional AC algorithm, thereby being more suitable for the statistics of the target sequence with the specified length in the bit stream data.

2. The invention solves the problem that the frame starting position is difficult to determine when the unknown protocol data is oriented.

3. The invention can effectively shorten the sequence statistical time so as to accelerate the frame positioning speed and effectively improve the accuracy of the frame positioning.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. An unknown protocol frame positioning method based on TextRank is characterized by comprising the following steps:

step five: sorting the states according to the importance value in the binary group, and outputting the state information of the states;

the state jump function is:

new_station＝(station％2^n-1)*2+new_bit

in the formula: n represents the target sequence length, new _ bit represents the value of the next bit, station represents the current state, and new _ station represents the next state to be jumped;

the output state information is processed as follows: calculating state initial weight, calculating node weight and extracting a key sequence;

the purpose of calculating the initial weight of the state is to set the voting weight of the state, and the calculation process is as follows:

wherein station_iVW (station) indicating the state corresponding to the sequence having the i-th bit as the starting length n in the bit stream B_i) Represents station_iVoting weight of (1), P (station)_i) Representing the actual frequency of occurrence, P, of the state i in the bit stream B_averageAn expected value of the frequency of occurrence of a sequence of length n;

the process of state weight calculation is to use a neighboring node of a certain node to vote, so as to obtain the weight WS of the node in the bit stream B, where WS is represented by the following formula:

in the formula:node_irepresenting a node with the initial ith bit and the length n in the bit stream; WS is the weight of the node; station_tRepresenting a node_iA corresponding state; VM is the initial weight of the state; d represents a damping coefficient, and the meaning of the damping coefficient is the probability that a certain node points to any other node;

the extraction of long key sequences is carried out by the following steps:

step 1: searching the maximum value max _ WS in the weight WS of each node;

step 2: traversing each node in the bit stream B in sequence;

step 6: and outputting the obtained long key sequence information.

2. The TextRank-based unknown protocol frame positioning method according to claim 1, characterized in that the obtained long key sequence information is segmented, and a plurality of sequences are obtained after the segmentation, so that on the basis of similarity of two sequences, the key sequences are ordered according to the average similarity between the sequences from high to low; the average similarity between multiple sequences is as follows:

in the formula, dist_averageThe average similarity among a plurality of sequences obtained after the key sequence is segmented is determined, k is the number of the sequences obtained after the key sequence is segmented, and ComTime represents the number of times of comparison; after the sorting, the first and second images are obtained,the key sequence with the highest average similarity is located in the frame head, and the frame positioning and segmentation can be completed according to the sequence.