CN112367325B

CN112367325B - Unknown protocol message clustering method and system based on closed frequent item mining

Info

Publication number: CN112367325B
Application number: CN202011266863.5A
Authority: CN
Inventors: 洪征; 李毅豪; 林培鸿
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2023-11-07
Anticipated expiration: 2040-11-13
Also published as: CN112367325A

Abstract

The application discloses a method and a system for clustering unknown protocol messages based on closed frequent item mining, which are used for converting a datagram of a target protocol into a message and dividing the message into different types. Word segmentation is carried out on the message; and excavating closed frequent items in the message according to the word segmentation and the frequency thereof. On the basis, vectorization is carried out on the message according to the closed frequent item, and then the t-sne algorithm is used for carrying out dimension reduction on the message vector. And finally, clustering the messages by using a self-organizing map neural network according to the vector information of the messages. The method is suitable for network communication protocols with unknown protocol specifications, clusters the messages by adopting the closed frequent items in the protocol messages as the characteristics, solves the defect of low accuracy when the traditional sequence comparison method is applied to the message clustering, and has the advantages of strong universality and high clustering accuracy.

Description

Unknown protocol message clustering method and system based on closed frequent item mining

Technical Field

The application relates to a clustering method of network communication messages, in particular to a clustering method and a clustering system of unknown protocol messages based on closed frequent item mining, and belongs to the technical field of networks.

Background

A network protocol is a set of rules, standards, or conventions established for data exchange in a computer network. Network protocols are an integral part of computer networks that regulate the communication process between network entities. Network security applications such as network management, traffic monitoring, vulnerability discovery, intrusion detection, etc. all rely on protocol specifications. However, for business or private reasons, a large amount of protocol specification information is not disclosed, and in addition to this, many malware also use custom protocols to communicate. These protocols all belong to unknown protocols.

The protocol reverse engineering refers to a process of extracting protocol grammar, semantics and synchronization information by monitoring and analyzing network input and output, system behavior and instruction execution flow of protocol entities under the condition of not depending on protocol description. Protocol reverse engineering is a major method of obtaining protocol specification information for unknown protocols.

Clustering network messages and aggregating protocol messages of the same type are an important link in the protocol reverse process. In a real network environment, communication messages of various network protocols are interleaved together, and one network protocol generally contains numerous message types, which presents a great challenge for protocol reverse analysis. Therefore, when the protocol is reversed, the communication messages in the network must be clustered first, so that the messages of the same type are clustered. And on the basis, analysis is performed, the difficulty of protocol reverse is reduced, and the accuracy of a reverse analysis result is improved.

Each network protocol typically contains multiple message types, for example, in the HTTP protocol, there are "GET" type messages and "POST" type messages. For protocols known by protocol specifications, message clustering can be performed by utilizing protocol features, and the messages of the same type are gathered together, so that the implementation is relatively easy. However, if the protocol specification is unknown, message clustering is not easy. The application mainly focuses on the message clustering problem of the communication protocol of which the protocol specification is unknown.

The clustering of unknown protocol messages requires consideration of how to aggregate the same type of messages together without prior knowledge of the protocol. A network protocol often contains multiple message types, and the present application aims to aggregate captured network protocol messages into multiple clusters, where the messages in each cluster correspond to one message type of the protocol.

The PI item (Protocol Information Project) is the earliest reverse protocol automation item that applies a sequence alignment algorithm in bioinformatics to measure message similarity, builds a message similarity matrix from the similarity, and then clusters the messages using a non-weighted ensemble of cluster-number average. However, using a method of measuring similarity of messages based on a sequence alignment algorithm and then clustering the messages, the difference of message types caused by the local differences cannot be found. For example, two messages of SMTP protocol are captured in the network: "HELO CROW.ey RIE.af.mil" and "EHLO CROW.ey RIE.af.mil" mean a connection requiring no user authentication and a connection requiring user authentication, respectively. The difference in message type is due to the locally small differences in "HELO" and "EHLO". Clustering using sequence alignment-based algorithms can lead to situations where the accuracy of the clustering results is low because local minor differences are not detectable.

Researchers such as SiyuTao measure the similarity of the messages by using a Needle-Wunsch algorithm, and cluster the messages by using a K-means clustering algorithm guided by a contour coefficient. The clustering method does not need to know the value of K in the K-means clustering algorithm, because the optimal K value can be automatically selected through the guidance of the contour coefficient. However, the clustering method is the same as PI project, and is difficult to find out the difference of message types caused by small differences of messages due to the use of a sequence comparison algorithm.

And a SeqCluster clustering method is provided by a super-class researcher on the basis of a sequence alignment algorithm. The SeqCluster method is different from the traditional sequence comparison method, and when the message similarity is measured, rewards are increased in an arithmetic series mode for the continuous matching, so that the weight of the continuous byte matching is higher, and the more optimal message similarity measurement is realized. However, if the lengths of the messages are different, unfair problems can be generated when the method is used for measuring the similarity of the messages, the rewards of long messages are more likely to be higher than those of short messages, so that the calculation of the similarity of the messages is affected, and an ideal clustering effect is difficult to achieve.

The protocol message has control characters such as carriage return and line feed, so that the message presents a structural feature, and the structural feature is called a message contour feature. Researchers such as Li Yang convert the message into a binary image by utilizing the outline characteristics of the message, and then cluster the message through the similarity of the images. However, this method is only suitable for the messages with contour features such as carriage return character and line feed character, and can not cluster the protocol without delimiter.

Researchers such as Mingming Xiao consider that most protocols use delimiters to divide protocol fields, they use delimiters to recursively delimit messages using a hierarchical tree to obtain domains, and then reverse the protocol. However, the premise of using this method is that it is necessary to know in advance which delimiters the protocol has, which is information that is difficult to grasp in advance for an unknown protocol.

In general, most of the existing mainstream message clustering methods mainly use a sequence comparison algorithm to measure the similarity of messages, and then cluster the messages according to the similarity of the messages. However, these methods for measuring the similarity of the messages based on the sequence alignment algorithm have a certain defect, and the limitation is that the sequence alignment algorithm cannot identify the change of the message types caused by the small difference of the messages. In addition, some message clustering methods are only suitable for protocols with specific characteristics, and have low universality, such as the method for the outline characteristics of the messages proposed by researchers such as Li Yang and the delimiter-based method proposed by researchers such as Mingming Xiao.

Disclosure of Invention

The application aims to overcome the defects of the prior art and provide a network message clustering method and system based on closed frequent item mining.

The application adopts the following technical scheme.

In one aspect, the application provides an unknown protocol message clustering method based on closed frequent item mining, which comprises the following steps: converting the acquired datagram into a message; dividing the message into short sequences; extracting frequent items in the short sequence according to the occurrence frequency of the short sequence and a set frequency threshold, and screening the frequent items according to the closing attribute to obtain closed frequent items; based on the closed frequent item, vectorizing the message, and performing dimension reduction on the vector to obtain a dimension-reduced message vector;

for the message vectors after dimension reduction, clustering the message vectors according to the distance between the vectors by mapping a neural network to the self-organization, and clustering the message vectors of the same type.

Further, for the application layer datagram transmitted through the TCP protocol, a new application layer message is separated from the previous application layer message according to the TCP FIN flag and the TCP SYN flag, and is recombined, so as to obtain a complete application layer message.

For application layer datagrams transmitted over the UDP protocol, the payload of each UDP datagram is considered to be an independent application layer datagram.

Further, dividing the message into three types of text type message, binary type message and mixed type message containing text and binary character before dividing the message into each short sequence; the method for dividing the message into each short sequence comprises the following steps:

for binary type messages and text type messages, the n-gram word segmentation method is used for directly segmenting words, and for mixed type messages, different types of contents are segmented according to the boundaries of predetermined binary contents and text contents.

Still further, when the n-gram word segmentation method is used for word segmentation, the value of n increases one by one from the set minimum value to the set maximum value.

Further, the process of extracting frequent items in the short sequence according to the occurrence frequency of the short sequence and the set frequency threshold value comprises the following steps: counting the occurrence frequency of each short sequence, and taking the ratio of the total occurrence frequency of the short sequences to the total number of the short sequences as the frequency of the short sequences; a short sequence is a frequent item if its frequency exceeds a set frequency threshold, otherwise it is not a frequent item.

Further, filtering the frequent items according to the closing attribute specifically includes: checking whether each frequent item has a closing attribute or not in sequence, and selecting the frequent items meeting the closing attribute to form a closed frequent item set, wherein the method for judging that the sequence A in one set has the closing attribute is as follows: if and only if none of the sequences in the set in which sequence A is located is a supersequence of sequence A and the frequency of the sequence is equal to the frequency of sequence A, then sequence A in the set is determined to possess a closure property.

Further, based on the closed frequent item, the specific method for vectorizing the message and performing dimension reduction on the vector to obtain the dimension reduced message vector comprises the following steps:

carrying out vectorization representation on each message based on the closed frequent item set, and setting the corresponding element to be 1 in the vectorization process if the message has a certain closed frequent item; if the closed frequent item does not appear in the message, setting the corresponding element to 0 in the vectorization process; and then, reducing the dimension of the message vector by using a t-sne method, and converting the high-dimension message vector into a two-dimension message vector.

Further, the self-organizing map clustering process includes: by inputting the message vector after the dimension reduction into the self-organizing map neural network, the neural network discovers the rule of the message vector and the interrelation between the message vectors. The clustering results in a neural network in which a set of message vectors in the vicinity of each neuron are considered to belong to the same cluster, representing that the message vectors belong to the same type.

In a second aspect, the present application provides an unknown protocol packet clustering system based on closed frequent item mining, including: the system comprises a message capturing module, a short sequence segmentation module, a closed frequent item acquisition module, a message vector generation module and a message vector clustering module;

the message capturing module is used for converting the data messages captured in the network into messages;

the short sequence segmentation module is used for segmenting the message into short sequences;

the closed frequent item acquisition module is used for setting a frequent threshold according to the occurrence frequency of the short sequence, extracting frequent items in the short sequence, and then screening the frequent items according to the closed attribute to further obtain closed frequent items;

the message vector generation module is used for vectorizing the message based on the closed frequent item and carrying out dimension reduction on the vector to obtain a dimension-reduced message vector;

the message vector clustering module is used for clustering the message vectors according to the distance between the vectors by mapping the neural network to the self-organization aiming at the message vectors after the dimension reduction, and clustering the message vectors of the same type.

The beneficial technical effects obtained by the application are as follows:

the method is suitable for clustering network protocol messages with unknown protocol specifications. When the message clustering is carried out, a sequence comparison algorithm is not used, so that the problem that the message types are different due to the fact that the sequence comparison algorithm cannot find out local differences can be solved. In addition, the method of the application has universality, and the clustering process does not use any characteristic limited to specific messages, but adopts the common characteristic of closed frequent items to cluster the messages.

Aiming at the characteristics of an unknown protocol, the application adopts an n-gram algorithm with n value change, thereby obtaining short sequences with different lengths and further obtaining a closed frequent item corresponding to the unknown protocol; in addition, the self-organizing map neural network is adopted to cluster the messages, the self-organizing map clustering method does not need prior knowledge of protocols, the network is adaptively adjusted according to the input samples, the type relation of the input samples is determined, and the method is suitable for cluster analysis of the messages with unknown protocols. In general, the method and the device can effectively improve the accuracy of the clustering of the unknown protocol messages.

Drawings

FIG. 1 is a schematic diagram of an overall implementation flow of an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present application more obvious and understandable, the technical solutions of the embodiments of the present application will be clearly and completely described in the following in conjunction with the overall implementation flow diagrams of the present application, and it is apparent that the embodiments described below are only some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, according to a preferred embodiment of the present application, a network message clustering method based on closed frequent item mining includes the following steps:

(1) Data preprocessing: after the data Message of the target protocol is acquired according to the port information, converting the data Message (Packet) captured in the network into a Message (Message), and dividing the Message into three types of text Message, binary Message and mixed Message according to whether the number of the maximum continuous printable characters in the Message exceeds a set threshold value or not, so as to divide the Message into coarse granularity.

(2) Short sequence division: aiming at different message types, an n-gram serialization idea is adopted to divide the message into a plurality of short sequences. Because the length of the message keyword cannot be pre-known in advance, the value of n is gradually increased from a set minimum value to a set maximum value, short sequences with different lengths are obtained, and the complete keyword can be ensured to be contained in the word segmentation result of the n-gram.

(3) Closing frequent item mining: setting a frequency threshold according to the occurrence frequency of the short sequence, extracting frequent items in the short sequence, screening all the frequent items according to the closing attribute, filtering the frequent items without the closing attribute, and taking the obtained closed frequent items as message characteristics to carry out subsequent processing on the message.

(4) Message feature vectorization: and taking the closed frequent item as the message characteristic of the unknown protocol message, carrying out vectorization representation on the message, setting the corresponding representation bit to be 1 when the message contains a certain closed frequent item, and otherwise, setting the representation bit to be 0. And carrying out vectorization representation on each message according to the method, and then carrying out dimension reduction on the vector by using a t-sne method to obtain two-dimensional output data.

(5) Self-organizing map clustering: by inputting the message vectors subjected to dimension reduction into the self-organizing map neural network, the neural network can find the rule of the message vectors and the relation between the message vectors, and meanwhile, the network is adaptively adjusted according to the message vectors, and finally the output neural network can find the type relation of all the message vectors. After the final neural network is obtained through clustering, the set of message vectors near each neuron is considered to belong to the same cluster, and the clustering result consists of a plurality of clusters, wherein each cluster represents one type of message vector.

Referring to the overall implementation flow shown in fig. 1, the unknown protocol message clustering method based on closed frequent item mining in this embodiment mainly includes 5 parts, such as data preprocessing, short sequence division, closed frequent item mining, message feature vectorization and self-organizing map clustering, and specific embodiments are described below respectively.

(1) Data preprocessing

The data preprocessing of the embodiment of the application comprises two stages of message extraction and message classification.

In the message extraction stage, firstly, network communication messages are collected through a network packet grasping tool such as a Wireshark and the like, if the communication messages of a specific port are desired to be collected, the communication messages of the corresponding port can be filtered by utilizing the communication port information, and then the communication messages of the corresponding port are reserved, and are analyzed and processed. The application layer data is transmitted through the TCP protocol or UDP protocol of the transport layer. For messages transmitted via the TCP protocol, a new application layer message is separated from the previous application layer message by a TCP FIN flag and a TCPSYN flag. In addition, when using the TCP protocol to transmit datagrams (packets), there is a limitation on the maximum Packet segment length (Maximum Segment Size), if the application layer Packet (Message) exceeds a certain length, the application layer Packet may be encapsulated into multiple TCP datagrams in a fragmented manner for transmission. Therefore, in order to obtain the complete application layer message, the datagram transmitted by the TCP protocol needs to be reassembled to obtain the complete message.

There is no maximum segment length limit when using UDP protocol to transfer application layer data. The payload of each UDP packet may be considered as a separate application layer packet.

In the message classification stage, in order to accurately acquire the message format, the embodiment of the application distinguishes the protocol message according to the number of continuous ASCII printable characters, and determines the message to be a text type message, a binary type message or a mixed message containing text and binary characters at the same time. The purpose of distinguishing is two: firstly, the messages are roughly classified, so that the subsequent clustering is more accurate. For example, a text-type message and a binary-type message will not generally belong to the same type of message. And secondly, in the subsequent processing, the three types of messages can be processed differently. The word segmentation processing of the text type message and the binary type message is the same as that of the common word segmentation, and the mixed type message always needs to establish the boundary between binary content and text content during word segmentation, and then word segmentation is carried out on different contents, so that the processing method is more efficient.

The specific method for classifying the messages comprises the following steps: if all the messages are ASCII printable characters, judging that the messages are text messages; if the number of the ASCII printable characters in the message Wen Nalian exceeds a preset threshold, judging that the message is a mixed message; if the number of subsequent ASCII printable characters in the message Wen Nalian does not exceed the preset threshold, the message is judged to be a binary type message.

(2) Short sequence partitioning

The embodiment of the application uses an n-gram method to segment the message. Because the n-gram method does not need information of delimiters, the method is applicable to various protocols of binary, text and mixed types. The n-gram method is an important word segmentation method in natural language processing, and the idea is that the probability of generating the i-th word is determined by the n-1 words generated before the i-th word, that is, the word appearing at the i-th position is related to only the n-1 words in front of the i-th word. For example, a message sequence "1001001" when the parameter n=3 in the n-gram, then the word segmentation junction is: "100", "001", "010", "100" and "001". In this step, basic n-gram word segmentation is used for binary type messages and text messages, and boundaries of binary content and text content are predetermined for mixed type messages, and then different types of internal messages are neededWord segmentation is performed. In the word segmentation process of the embodiment of the application, n is changed, so that short sequences with different lengths are obtained. n is taken from the set minimum value MIN _n To a set maximum value MAX _n And gradually increases, so that complete keywords can be guaranteed to be contained in the word segmentation result of the n-gram.

(3) Closed frequent item mining

The closed frequent item mining of the embodiment of the application comprises two stages of work of frequent item mining and closed attribute screening.

The first phase is frequent item mining. If a sequence occurs very frequently, then the sequence is a frequent term. In order to extract frequent items, statistics is firstly carried out on the occurrence frequency of the short sequences obtained by the n-gram word segmentation. Secondly, frequent calculation is performed on all the obtained short sequences. In the embodiment of the present application, the frequency refers to the ratio of the total number of occurrences of a certain short sequence (i.e., the total occurrence frequency) to the total number of short sequences. In order to determine frequent items, a frequent item threshold needs to be set in advance. If the frequency of a short sequence exceeds a set threshold, the short sequence is judged to be a frequent item, otherwise, the short sequence is not a frequent item.

The second stage is closure attribute screening. In the word segmentation algorithm, complete keywords are obtained. In the n-gram method, the value of n is changed from a set minimum value to a set maximum value one by one, and the obtained short sequences are different in length. The embodiment of the application hopes to obtain complete keywords as message characteristics. In a communication protocol, protocol keywords tend to be repeated in messages. This feature of the protocol key is also stable for unknown protocols. In order to obtain complete keywords, short sequences need to be analyzed for Apriori properties. In the field of frequent pattern mining, apriori properties refer to that any subsequence of a frequent item should also be a frequent item, and that any supersequence of an infrequent item should also be an infrequent item. The embodiment of the application provides a closed attribute to screen short sequence frequent items on the basis of Apriori properties. Sequence a in a set possesses a closed property if and only if none of the sequences in the set is a supersequence of a and the frequency of the sequences is equal to the frequency of a. And checking whether each frequent item meets the closure attribute or not in sequence, and if not, removing the frequent item to finally obtain a set of all the frequent items meeting the closure attribute.

(4) Message feature vectorization

The message characteristic vectorization of the embodiment of the application comprises two stages of construction of a sparse matrix and dimension reduction by using a t-sne method.

The first stage is sparse matrix construction. And carrying out vectorization representation on each message according to the closed frequent item set, and representing each message as a sequence consisting of 0 and 1 so as to form a sparse matrix of the message sample set. Assuming that the closed frequent item set containskThe closed frequent item is represented as M= { feature after vectorization of one message sequence ₁ , feature ₂ ,...feature _k }, feature therein _i Corresponding to the ith closed frequent item, if the message has a certain closed frequent item, setting the corresponding element to be 1 in the vectorization process; if this closed frequent item does not occur in the message, the corresponding element is set to 0 in the vectorization process.

For example, if the closure frequent item set contains only three short sequences "GET", "POST", "HTTP/1.1", then the message "GET www.baidu.com HTTP/1.1" may be denoted as (1, 0, 1) because the message contains the closure frequent items "GET" and "HTTP/1.1", but does not contain "POST". The message "POST www.sina.cn HTTP/1.1" may be denoted as (0, 1) because the message contains the close frequent items "POST" and "HTTP/1.1" but does not contain "GET".

The second stage is to reduce the dimension by using the t-sne method. The object aimed at by the embodiment of the application is an unknown protocol, and the standard information of the protocol is lacked, so that the two-point reasons are mainly considered for reducing the dimension of the data after the vectorization of the message. First, the frequency threshold, if set too small, can result in a data dimension that is too high. When the set frequency threshold is too small, more closed frequent items are obtained during closed frequent item mining, and the data dimension is correspondingly higher during message vectorization. Second, in n-gram word segmentation, if the value of n is selected to be too small, frequent term features may be segmented, resulting in redundancy. For example, "User-Agent" is a frequent item feature of length 10. However, if the highest value of n is set to 8, then this feature will be described as three features: "User-Agent", "ser-Agent", "er-Agent". When the message is vectorized, a message with the real frequent item feature "User-Agent" is represented by three dimensions, and the redundancy needs to be processed by dimension reduction. T-sne (T-distributed Stochastic Neighbor Embedding) is a nonlinear dimension reduction algorithm suitable for reducing the dimension of high-dimension data to 2-dimension or 3-dimension. the t-sne method is capable of ensuring that the distribution probability among the dimensions of the samples is unchanged as much as possible while mapping the high dimension to the low dimension, enabling the distance among the samples with small similarity to be larger, enabling the distance among the samples with large similarity to be smaller, and solving the problem that the sample data is crowded in a sample space caused by the traditional dimension reduction algorithm. t-sne converts euclidean distance into conditional probabilities to express point-to-point similarity. T-sne is a very applicable dimension reduction method for data of this type of network message. In the embodiment of the application, the input of the t-sne algorithm is a high-dimensional message vector, the dimension value after dimension reduction is set to be 2, and two-dimensional data is output, namely, the output result is a two-dimensional message vector.

(5) Self-organizing map clustering

The embodiment of the application applies the self-organizing map neural network to the clustering of unknown protocol message vectors. Self-organizing map (SOM) neural networks are an important type of neural network based on an unsupervised learning method. By inputting samples into the self-organizing map neural network, the neural network will find the rules of the input samples and the relationships of the input samples to each other, and adaptively adjust the network according to these input samples, so that the neural network of the final output can find the type relationships of all the input samples. When applied to clustering, the self-organizing map neural network's competitive layer classifies input sample points by finding the optimal set of reference vectors for neurons.

Compared with the traditional clustering method, the Self-organizing map (Self-organization Mapping) clustering method does not need too much priori knowledge, and only needs to adjust the neural topology structure and the iteration times of the neural network. Because the clustering of the unknown messages lacks priori knowledge, the characteristic that the self-organizing map neural network does not need priori knowledge is very fit for the clustering of the unknown messages.

Because the two-dimensional data are obtained by dimension reduction in the last stage, when the self-organizing map clustering is implemented, the self-organizing neural network of the two-dimensional planar array is used as a network model. When one training data is input to the self-organizing map neural network, the network calculates the Euclidean distance between all neurons and the input training data. The neuron closest to the input data is called the best matching unit. After the best matching unit is determined, the reference vector of the best matching unit will be updated, while the reference vector of the neighboring area neurons will also be updated. Neurons far from the best matching unit have different updated weights, neurons near to the best matching unit have large weights when updated, and neurons far from the best matching unit have small weights when updated. The difference in update weights can cause the distance between different clusters to increase. After updating all the neural network nodes, the next training sample is input. After the final neural network is obtained through clustering, the sample node set near each neuron is considered to belong to the same cluster, and the clustering result consists of a plurality of clusters, wherein each cluster represents one type of message. Thus, the clustering of the messages of different types is completed.

In summary, according to the unknown protocol message clustering method based on closed frequent item mining, in the preprocessing stage, the datagram of the target protocol is converted into the message, and then the message is divided into a binary type, a text type and a mixed type according to the continuous maximum printable character number. And in the short sequence dividing stage, the obtained message is segmented by using an n-gram segmentation method with a variable n value, so as to obtain a message short sequence. According to the word segmentation result, in the closed frequent item mining stage, firstly, the frequent items are determined, and then, the frequent items meeting the closing attribute are selected. And in the message feature vectorization stage, vectorizing the message according to whether the message contains the corresponding closed frequent item feature, reducing the dimension of the message vector by using a t-sne dimension reduction algorithm after vectorizing the message, and finally clustering the message vector by adopting a self-organizing map clustering method to cluster the message vectors of the same type together. The method is suitable for network communication protocols with unknown protocol specifications, clusters the messages by adopting the closed frequent items in the protocol messages as the characteristics, solves the defect of low accuracy when the traditional sequence comparison method is used for clustering, and has the advantages of strong universality and high clustering accuracy.

Corresponding to the unknown protocol message clustering method based on closed frequent item mining provided in the above embodiment, a specific embodiment of the present application further provides an unknown protocol message clustering system based on closed frequent item mining, including: the system comprises a message capturing module, a short sequence segmentation module, a closed frequent item acquisition module, a message vector generation module and a message vector clustering module;

It will be apparent to those skilled in the art that, for convenience and brevity of description, the system described above,

the specific working processes of the apparatus and the units may refer to the corresponding processes in the foregoing method embodiments, which are not described herein again.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims

1. An unknown protocol message clustering method based on closed frequent item mining is characterized by comprising the following steps: converting the acquired datagram into a message; dividing the message into short sequences; extracting frequent items in the short sequence according to the occurrence frequency of the short sequence and a set frequency threshold value, and screening the frequent items according to the closing attribute to obtain closed frequent items; based on the closed frequent item, vectorizing the message, and performing dimension reduction on the vector to obtain a dimension-reduced message vector;

clustering the message vectors according to the distance between the vectors by using a self-organizing mapping neural network aiming at the message vectors after the dimension reduction, and clustering the message vectors of the same type;

dividing the message into three types, namely a text type message, a binary type message and a mixed type message containing text and binary characters, before dividing the message into each short sequence; the method for dividing the message into each short sequence comprises the following steps:

for binary system type messages and text type messages, the n-gram word segmentation method is used for directly segmenting words, and for mixed type messages, different types of contents are segmented according to the boundaries of predetermined binary contents and text contents;

when the n-gram word segmentation method is used for word segmentation,nthe value of (2) increases from the minimum value to the maximum value;

the datagrams include application layer datagrams transmitted via the TCP protocol and application layer datagrams transmitted via the UDP protocol; converting the acquired datagram into a datagram and writing specifically includes: for an application layer datagram transmitted through a TCP protocol, separating and reorganizing a new application layer message from a previous application layer message according to a TCP FIN mark and a TCP SYN mark to obtain a complete application layer message;

for application layer datagrams transmitted via the UDP protocol, the payload of each UDP datagram is considered as an independent application layer datagram;

based on closed frequent items, the method for vectorizing the message and performing dimension reduction on the vector to obtain the dimension reduced message vector comprises the following steps:

carrying out vectorization representation on each message based on the closed frequent item set, and setting the corresponding element to be 1 in the vectorization process if the message has a certain closed frequent item; if the closed frequent item does not appear in the message, setting the corresponding element to 0 in the vectorization process; then, reducing the dimension of the message vector by using a t-sne method, and converting the high-dimension message vector into a two-dimension message vector;

the self-organizing map clustering process comprises the following steps: by inputting the message vectors after the dimension reduction into the self-organizing map neural network, the neural network discovers the rule of the message vectors and the interrelation between the message vectors, and the neural network obtained by clustering, wherein the set of the message vectors near each neuron is considered to belong to the same cluster, and represents that the message vectors belong to the same type.

2. The unknown protocol message clustering method based on closed frequent item mining according to claim 1, wherein the process of extracting frequent items in the short sequence according to the occurrence frequency of the short sequence and the set frequency threshold value comprises the following steps: counting the total occurrence frequency of each short sequence, and taking the ratio of the total occurrence frequency of the short sequences to the total number of the short sequences as the frequency of the short sequences; a short sequence is a frequent item if its frequency exceeds a set frequency threshold, otherwise it is not a frequent item.

3. The unknown protocol message clustering method based on closed frequent item mining according to claim 1, wherein the filtering the frequent items according to the closed attributes specifically comprises:

checking whether each frequent item has a closing attribute or not in sequence, and selecting the frequent items meeting the closing attribute to form a closed frequent item set, wherein the method for judging that the sequence A in one set has the closing attribute is as follows: if and only if none of the sequences in the set in which sequence A is located is a supersequence of sequence A and the frequency of the sequence is equal to the frequency of sequence A, then sequence A in the set is determined to possess a closure property.

4. An unknown protocol message clustering system based on closed frequent item mining, which is characterized by comprising: the system comprises a message capturing module, a short sequence segmentation module, a closed frequent item acquisition module, a message vector generation module and a message vector clustering module;

the message vector clustering module is used for clustering the message vectors according to the distance between the vectors through the self-organizing mapping neural network aiming at the message vectors after the dimension reduction, and clustering the message vectors of the same type;

5. A computer readable storage medium storing a computer program, which when executed by a processor performs the steps of the method according to any one of claims 1 to 3.