CN115334179A - Unknown protocol reverse analysis method based on named entity recognition - Google Patents

Unknown protocol reverse analysis method based on named entity recognition Download PDF

Info

Publication number
CN115334179A
CN115334179A CN202210848786.7A CN202210848786A CN115334179A CN 115334179 A CN115334179 A CN 115334179A CN 202210848786 A CN202210848786 A CN 202210848786A CN 115334179 A CN115334179 A CN 115334179A
Authority
CN
China
Prior art keywords
protocol
field
character
label
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210848786.7A
Other languages
Chinese (zh)
Other versions
CN115334179B (en
Inventor
王俊峰
李晓慧
丛培鑫
张永光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210848786.7A priority Critical patent/CN115334179B/en
Publication of CN115334179A publication Critical patent/CN115334179A/en
Application granted granted Critical
Publication of CN115334179B publication Critical patent/CN115334179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/03Protocol definition or specification 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/06Notations for structuring of protocol data, e.g. abstract syntax notation one [ASN.1]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/30Definitions, standards or architectural aspects of layered protocol stacks
    • H04L69/32Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level
    • H04L69/322Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions
    • H04L69/329Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions in the application layer [OSI layer 7]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Communication Control (AREA)

Abstract

The invention discloses an unknown protocol reverse analysis method based on named entity recognition, which is characterized in that a field named entity is obtained by inducing and summarizing a large number of known protocol field composition specifications, and a BIO labeling method is adopted to label a protocol sequence by taking characters as units in combination with the named entity; performing feature extraction on the context semantics of the protocol sequence by adopting a BilSTM model, and outputting the identified label probability to a CRF layer to reasonably standardize the identification result of the named entity in combination with the conditional transition probability; and predicting the unknown protocol standard by adopting the trained intelligent agent. The method and the device improve the analysis capability of the protocol on the field structure and the semantic information reversely, can automatically extract the protocol semantic information without manual intervention, enhance the portability of application, and can reversely analyze protocol sequences of different types; reverse parsing of unknown protocols in binary character form is achieved, enabling good inference of composition specifications for known and unknown protocols.

Description

Unknown protocol reverse analysis method based on named entity recognition
Technical Field
The invention relates to the field of protocol reverse engineering, in particular to an unknown protocol reverse analysis method based on named entity identification, which carries out reverse inference on an unknown network protocol in a computer network.
Background
In recent years, the internet gradually becomes an important carrier for supporting the scientific and technological development of society, and meanwhile becomes an object on which realization of more and more production life styles needs to be supported. The number of people connected to the internet and Web applications through different devices is rapidly increasing due to the relatively friendly availability and easy accessibility of the internet. According to the 'statistical report' recently issued by the China Internet information center (CNNIC), the number of netizens in China reaches 10.32 hundred million people in the end of 2021 years, and meanwhile, the scale of mobile phone users and the popularity index of the Internet always keep a high-speed growth trend for many years. In the same period, the usage amount of personal Web application is steadily increased, the scale of users of online application such as online take-out and the like is obviously increased, and the increase rate is over 10 percent. The user amount of instant messaging, short video and network payment is steadily at the first three, and the maximum amount accounts for 97.3 percent of the total amount of netizens. The scale of netizens and the variety and number of devices connected to the internet are expanding, so that the data volume carried in the internet is increasing. Even in an industrial internet network, there are over 59 million industry-related applications, and over 7000 million connected devices (sets). The realization of various production and living modes of people increasingly needs to depend on the Internet.
Whether it is a public network with people as the main user group or a private network with secret-related enterprises and institutional users as the main user group. In order to ensure that various devices, applications, etc. operate stably, efficiently, and safely in a network, it is necessary to ensure the safety and reliability of data transmission. The types and the quantity of intelligent products such as wearable equipment and Web application are rapidly increased under the drive of science and technology, and the production life style of people is gradually enriched by the development of high-tech products. However, the proliferation of thousands of high-tech devices and applications has undoubtedly led to a proliferation of traffic and data volume in the network, and the variety and number of network protocols required for data exchange in the internet has been growing synchronously. Protocols in network traffic are classified into two broad categories, known protocols and unknown protocols, according to whether the protocols are disclosed by specifications or not. In the bit stream of data transmission, unknown protocols that are difficult to identify and analyze occupy over 40% of the traffic data. The number and the types of unknown protocols are increased rapidly, and the difficulty of the network owner in controlling the owned network is greatly increased.
Because of the limited ways to obtain protocol data, the data forms for the current protocol reverse problem are divided into two categories: offline network traffic and instruction execution traces. The latter is more time and effort consuming both from the point of view of data acquisition and manual analysis, since it requires monitoring of the instruction sequence and the corresponding context. By using the method of the off-line network flow, the experimental work is carried out on the basis of the original flow, more complicated preprocessing is not needed to be carried out on the data, and the applicability is wider. The problems of incomplete semantic information inference, reliance on manual analysis, and characteristic characters of text-like protocols still exist. In addition, the network protocols themselves can be classified into two types, text-type protocols and binary protocols. The former has more definite data expression, while the latter has a single data form.
For the resolution of the known protocols, the current tools are relatively mature, and common resolution tools such as tshark and tcpdump can realize the output of format fields of more than 2000 known protocols, but cannot realize unknown protocols with unpublished protocol formats. Unknown protocols have unique and undisclosed self-contained structures, often created by the designer himself for safety reasons. The original unknown protocol reverse engineering relied entirely on manual analysis, but the time required was not negligible. The specifications of the unknown protocol SMB were generated by the open source project Samba through a manual analysis method after 12 years. Therefore, the automatic protocol reverse engineering becomes a key problem, and different output results such as a protocol format, a protocol state machine, a complete protocol specification and the like can be obtained by analyzing two different data objects such as an execution track and network flow.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an unknown protocol reverse analysis method (for convenience of description, the method is abbreviated as NEREUP) based on named entity identification, so as to carry out reverse inference on the composition, design specification and semantic information of an unknown protocol in network flow.
The technical scheme adopted by the invention is as follows: an unknown protocol reverse analysis method based on named entity recognition comprises the following steps:
step 1: collecting data information containing different application layer protocols, summarizing and summarizing field components of the header of a known application layer protocol, discovering component field types of the existing known application layer protocol, and defining field type named entities;
step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting a binary protocol sequence of the protocols, and performing data preprocessing on an original binary protocol sequence;
and step 3: making a character vector conversion dictionary for the preprocessed data, respectively mapping converted protocol sequence characters into character vectors, taking the generated original character One-hot vectors as the input of a bidirectional long-short term memory network, respectively inputting values into forward and backward long-short term memory network units by an input layer, splicing by combining the characteristics and information of the previous moment and the next moment to obtain the vectors of an output layer, and outputting the vectors as hidden state vectors of the characters;
and 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; inputting the vector of the output layer into a Conditional Random Field (CRF), performing final probability correction by combining the transition probability of the character in real-time data to obtain the label probability corresponding to the predicted named entity class, and training to finish the intelligent agent;
and 5: inputting the unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step 2 and the step 3, and obtaining the named entity recognition result of the unknown protocol through prediction;
and 6: and improving the extended Backowsian pattern to formally describe the network protocol format, the field structure and the semantic information, standardizing the result after the named entity of the unknown protocol is identified by using a pattern mode, and finally obtaining the protocol specification and the convention of the unknown protocol.
Further, the field type named entity definition specifically includes:
1) An index field: the most direct identification field for distinguishing different protocol sequences in the same application layer protocol represents the sequence number of the protocol sequence or the sequence number of an object acted by the protocol;
2) A type field: indicating basic properties of the current protocol, expressing semantics in a binary mode, and enabling a one-bit binary character to be a field; when information is transmitted, expressive meanings are expressed only by one binary '0' and '1' or two binary bits of '0' to '3'; the semantic information which contains the state information of the protocol and can transmit data by using 'yes' and 'no'; the consecutive type field is merged into a 'flag' field at parsing time, expressed in bytes.
3) A function field: carrying information which is relatively intuitive in data and can directly generate action; one is to express the information to be transmitted in a limited number of digits, each digit alone representing a particular type of function; the other decimal digit semantics of the field itself is important data transmitted by the field, and carries data by using the number of bytes exceeding the required number of bits.
Further, the step 2 specifically includes:
step 2.1: selecting a PCAP data packet of flow collected from each port and a routing node in a secure network; performing data statistics on all PCAP packets, and counting the number of application layer protocols which can be made into a data set; each protocol subset is classified according to the type of the application layer protocol, the number of the types of all the application layer protocols is represented by g, the number of the protocols in each Packet is represented by m, and all the derived data Packet packets containing a single protocol are represented as follows:
Figure BDA0003752464410000032
wherein ,
Figure BDA0003752464410000033
represents the d protocol in the g application layer protocol;
step 2.2: extracting an original binary protocol sequence of each application layer protocol, converting the original binary protocol sequence into a hexadecimal system, carrying out BIO (binary input/output) labeling on the converted hexadecimal system sequence by taking a character as a unit, and adopting protocol header and protocol load labeling and protocol header field named entity labeling; the protocol head with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol head is reversely analyzed.
Further, the step 3 specifically includes:
step 3.1: traversing the data in the data set to generate a character vector conversion dictionary, carrying out vector conversion definition on hexadecimal characters, and predefining other values which may occur;
the character vector conversion dictionary Dic is defined as follows:
Figure BDA0003752464410000031
where p is a key giving a key-value pair representation of hexadecimal characters; < UNK > represents other characters outside the dictionary, and < PAD > represents a completion character;
step 3.2: after a single character is mapped into a character vector, the generated One-hot vector of the original character is used as an input value x and is respectively input into a forward LSTM network and a backward LSTM network; for a single LSTM network, the output value O at time series t t Comprises the following steps:
O t =σ(W o ·[h t-1 ,x t ]+b o ) (3)
wherein σ represents the passage "The door structure is processed by a Sigmoid function; h is a total of t-1 Is the output value of the previous time, x t As an input value at the present time, W o and bo The weights and the bias coefficients are determined through experiments;
step 3.3: and taking the output value vector obtained by splicing as a hidden state vector of the character, wherein the whole splicing process comprises the following steps:
and combining and outputting the final output of the bidirectional long-short term memory network, wherein each direction has e LSTM structure models, and the final output value h is expressed in the following form:
h=[h fe ,h be ] (4)
wherein ,hfe and hbe Respectively an output value of a forward LSTM network structure and an output value of a backward LSTM network structure;
after the bidirectional long-short term memory network processing, obtaining a value probability matrix P of each label corresponding to the current character; the original input sequence is marked as X, and Y is used for representing a label sequence corresponding to a prediction result output by the bidirectional long-short term memory network; both are expressed in the following form:
X={x 1 ,x 2 ,…,x n } (5)
Y={y 1 ,y 2 ,…,y n } (6)
in the formula, n is the number of characters in the sequence;
the set label quantity l is represented by k, the dimension of the output matrix P is n × k, and the representation method for the parameter of the ith row and the jth column is defined as follows:
P n×k ={P ij :x i →l j |1≤i≤n,1≤j≤k} (7)
in the formula ,Pij Representing the probability value of the character i predicted as the label j; l j Representing the jth label in the label set;
the mapped probability is input to the next layer as a non-normalized probability.
Further, the step 4 specifically includes:
step 4.1: the prediction result of the ith character is set as delta i The probability of transition to label j is defined as p ij The predicted result is expressed in the form:
δ i =max(p ij )1≤i≤n,1≤j≤k (8)
and 4.2: respectively adding the identifiers of the Start and the End before and after the prediction sequence as columns, adding new label types of the Start and the End as rows, and establishing a constraint conditional probability transition matrix A (k+2)×(k+2)
Step 4.3: for the probability s (X, y) of possible values y on the input sequence X, denoted by P n×k and A(k+2)×(k+2) Jointly calculating to obtain:
Figure BDA0003752464410000051
wherein ,
Figure BDA0003752464410000052
representing the transition probability of the label i to the label i +1 before and after the label character under the real condition;
Figure BDA0003752464410000053
representing the actual probability that a single character is predicted as label i;
the probability value for each tag is normalized and p (y | X):
Figure BDA0003752464410000054
wherein ,YallPossible Representing all possible sets of predictive labels;
Figure BDA0003752464410000055
representing a predictable label;
defining Loss function Loss as the inverse of log-likelihood, and obtaining output prediction label sequence by dynamic programming algorithm
Figure BDA0003752464410000059
Figure BDA0003752464410000056
Therefore, the output of the bidirectional long-short term memory network is input into a layer of linear chain element random field CRF to obtain a final prediction sequence y.
Further, the step 6 specifically includes:
step 6.1: expressing the protocol in a semantic expression composition area by using the field named entity definition given in the step 1;
step 6.2: explaining the specific forming forms of different types of fields by taking bytes as units according to the field forming characteristics of each protocol;
step 6.3: giving the value taking condition of byte characters; for all network protocols, a common formal representation method is given, protocol names are replaced by Protocol, and Index, type and Function are used to represent Type definitions of an Index field, a Type field and a Function field respectively, and a paradigm is as follows:
Protocol={F|O} (12)
Figure BDA0003752464410000057
the parenthesis in the normal form indicates [0, + ∞]A character string set composed of characters;
Figure BDA0003752464410000058
indicating an unnecessary field type;
therefore, the composition and semantic information of the unknown protocol are formally described.
Compared with the prior art, the invention has the beneficial effects that:
1) The method introduces a named entity identification method in NLP by improving the automation degree of reverse analysis of unknown protocols, and extracts field structures and semantic information by using context semantic information existing between fields in a BilSTM learning protocol sequence by using a deep learning method;
2) According to the invention, the prediction probability is output to the conditional random field CRF, so that the unreasonable prediction condition in the field is avoided, and the accuracy of the invention is improved;
3) The invention carries out serialization processing on the protocol, can break the limit of text protocols and binary protocols, is carried out on the premise of not distinguishing protocol types, and improves the application range of the invention.
Drawings
FIG. 1 is a system architecture diagram of the method NEREUP of the present invention.
Fig. 2 is an exemplary diagram of the protocol sequence labeled by the BIO labeling method.
Fig. 3 is a block diagram of a bi-directional LSTM network.
Fig. 4 is a schematic diagram of the operation mode between any two adjacent LSTM units in the BiLSTM model.
Fig. 5 is a diagram illustrating comparison between trained agent and other methods when segmenting application layer protocol header and protocol payload data.
Fig. 6 is a schematic diagram of model convergence caused by the fact that the change of hidden layer dimensions increases with the training turns when selecting hyper-parameters of model training when analyzing the format of an application layer protocol header.
Fig. 7 is a schematic diagram of model convergence caused by the fact that the number of batches changes with the increase of training rounds when selecting hyper-parameters for model training when analyzing the format of an application layer protocol header.
Fig. 8 is a schematic diagram of model convergence caused by the fact that the change of the learning rate is increased along with the training round when selecting the hyper-parameters of the model training when analyzing the format of the application layer protocol header.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The system structure of the method is shown in figure 1, and the method consists of three parts of data preprocessing, agent training based on BilSTM and CRF and format reverse inference of unknown protocols.
Step 1: a large amount of data information containing different application layer protocols is collected, and the protocol header fields are extracted to collect the known application layer protocol header components. It is found in the manual analysis process that the field composition logic in the protocol header usually contains a certain timing relationship. By sequentially observing the protocol header sequence from front to back, it can be found that the positions of the fields are often arranged in a global first and local second manner. The top field usually represents the overall data of the application layer protocol sequence, including sequence number, length, etc., and represents the basic information of the sequence. The following fields often sequentially represent control information carried by the data transmission, including contents indicating protocol functions, control basic information and the like. The fields of INDEX, TYPE, and FUNCTION are defined to represent the protocol specification. The network protocol field types defined are as follows:
1) The INDEX field (INDEX) is the most direct identification field in the same application layer protocol that distinguishes different protocol sequences. It usually represents the sequence number of the protocol sequence or the object sequence number to which the protocol acts. The uniqueness of the protocol sequence can be marked or the protocols for processing the same serial number object can be distinguished by distinguishing the index fields of the same protocol.
2) The TYPE field (TYPE) indicates the basic properties of the current protocol. Such fields typically express semantics in a binary fashion, and a one-bit binary character can become a field. When transmitting information, a few meanings are often expressed by only one binary "0" and "1" or two binary bits "0" to "3". Thus, the state information of the protocol and the semantic information that data can be transferred with yes and no are contained therein. The consecutive TYPE field is often merged into a "flag" field at parsing time to make the expression in byte form.
3) The FUNCTION Field (FUNCTION) is mostly present in the latter half of the protocol, and generally carries information that is relatively intuitive in data and can directly act on. This TYPE of field covers two cases, the first one being similar to the TYPE field, yet a limited number of digits can be used to separately express the information that is intended to be transmitted. Each number alone represents a particular type of function. Particularly, most of the data is expressed by taking byte as a unit, if the number of bits is more, a large number of '0' exists in the front part of the field, and only the word of a few bits at the tail part is used for expression. This approach leaves room for protocol updates that can deliver more data types to be expressed. The second case is more straightforward, and the decimal digit semantics of the field itself is the important data that the field transmits. Similarly, the representation method still uses the byte number far exceeding the required bit number to carry data, so that a margin can be left for explosive increase of subsequent transmission data, and the data which cannot be transmitted by a protocol is avoided.
Step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting binary protocol sequences of the protocols, and performing data preprocessing on original binary protocol sequences.
And selecting the PCAP data packet of the flow collected from each port and the routing node in the security network. And performing data statistics on all PCAP packets, and counting the number of application layer protocols which can be made into a data set. Each protocol subset is classified according to the type of the application layer protocol, the number of the types of all the application layer protocols is represented by g, the number of the protocols in each Packet is represented by m, and all finally derived data Packet packets containing a single protocol can be represented as follows:
Figure BDA0003752464410000071
wherein ,
Figure BDA0003752464410000072
indicating the d-th protocol in the g-th application layer protocol.
The original binary protocol sequence of each application layer protocol is extracted, the original binary protocol sequence is converted into hexadecimal, the original protocol sequence is shortened to 1/4 of the original protocol sequence, and the semantic and field structure information is guaranteed while the size of the data set is reduced. Then, marking the protocol hexadecimal sequence by using a BIO (B-begin, I-side, O-out) marking set, and adopting protocol header and protocol load marking and protocol header field named entity marking; the protocol head with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol head is reversely analyzed.
The set of annotation tags is shown in tables 1 and 2. Table 1 shows a protocol header and a protocol load segmentation label set, and protocol header data that needs to be reversely parsed is separated. Table 2 shows a set of label labels for the protocol header specification field, and labels the protocol field according to the composition specification. The labeling is shown in fig. 2.
Table 1 protocol header and protocol load division label
Figure BDA0003752464410000073
Figure BDA0003752464410000081
Table 2 protocol header specification field label
Figure BDA0003752464410000082
And step 3: making a character vector conversion dictionary for the preprocessed data, and respectively corresponding the converted protocol sequence characters to a specific value; the generated One-hot vector of the original character is used as the input of a bidirectional long-short term memory network, the values are respectively input into the long-short term memory network units in the forward direction and the backward direction by the input layer, the vectors of the output layer are obtained by combining the characteristics and information of the previous moment and the next moment and are spliced and output as the hidden state vector of the character.
A bidirectional long-short term memory network (BilTM network) is composed of a forward LSTM group and a backward LSTM group. The specific structure is schematically shown in fig. 3. The circular marks in the figure represent One-hot vectors of hexadecimal original characters, and before conversion, data in the data set needs to be traversed to generate a character vector conversion dictionary.
The character vector conversion dictionary Dic is defined as follows:
Figure BDA0003752464410000083
where p is a key giving a key-value pair representation of hexadecimal characters; < UNK > represents other characters outside the dictionary, and < PAD > represents a completion character.
The hyper-parameters of the BilSTM network are set as follows:
1) The minimum number of samples for a single input network is 64;
2) The dimension of the hidden layer is set to 300;
3) The update interval of the BilSTM network is 40 times of update;
4) The learning rate is set to 0.001;
5) Gradient clipping is set to 5.
In order to ensure that the model has portability when processing different data, vector conversion definition is carried out on hexadecimal characters, and other values which may occur need to be predefined. After a single character is mapped into a character vector, the value x is respectively input into the forward LSTM network and the backward LSTM network, and for a single LSTM network, the output value O is obtained in the time sequence t t Comprises the following steps:
O t =σ(W o ·[h t-1 ,x t ]+b o ) (3)
the Sigmoid function processing through the gate structure in equation (3) is represented by σ, h t-1 Is the output value of the previous moment, x t Is an input value at the present time, W O and bO Are weights and bias coefficients that need to be determined experimentally. After the characters are processed in the forward and backward double-LSTM network, output value vectors obtained by splicing are used as hidden state vectors of the characters and output to a CRF layer. Splicing deviceThe overall flow diagram is shown in fig. 4. And combining and outputting the final output of the LSTM in two directions, wherein each direction is respectively provided with e models of the LSTM structure, and the final output value h is expressed in the following form:
h=[h fe ,h be ] (4)
wherein ,hfe and hbe Respectively, the output value of the forward LSTM network structure and the output value of the backward LSTM network structure.
Specifically, after the processing of the bidirectional long-short term memory model, the value probability matrix P of each label corresponding to the current character can be obtained. The original input sequence is X, and Y represents a label sequence corresponding to the prediction result output by the bidirectional LSTM model. Both can be expressed in the following form:
X={x 1 ,x 2 ,…,x n } (5)
Y={y 1 ,y 2 ,…,y n } (6)
wherein n is the number of characters in the sequence, k is used for representing the set label number l to represent the label, the dimension of the output matrix P is n × k, and the representing method for the parameter of the ith row and the jth column is defined as follows:
P n×k ={P ij :x i →l j |1≤i≤n,1≤j≤k} (7)
in the formula ,Pij Representing a probability value representing that the character i is predicted as the label j; l j The representation represents the jth label in the set of labels.
Probability of mapping takes non-normalized probability as input of next layer
The mapped probability is input to the next layer as a non-normalized probability.
And 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; and inputting the vector of the output layer into a conditional random field CRF, performing final probability correction by combining the transition probability of the character in the real-time data to obtain the predicted label probability corresponding to the named entity class, and training to finish the intelligent agent.
Generally speaking, after the probability matrix P is obtained, it can achieve a good effect by selecting the label type with the maximum character transition probability as the prediction result of the character to be output. Here, the prediction result of the i-th character is defined as δ i The probability of transition to label j is defined as p ij The predicted result can be expressed as follows:
δ i =max(p ij )1≤i≤n,1≤j≤k (8)
considering the continuity of the fields of the same type needing to be marked, a conditional random field pair P needs to be added n×k And (6) correcting. Firstly, respectively adding Start and End marks before and after a prediction sequence as columns, adding new types of Start and End marks as rows, and establishing a constraint conditional probability transition matrix A (k+2)×(k+2) . Then, for the probability s (X, y) of possible value y on the sequence X, P is needed n×k and A(k+2)×(k+2) Jointly calculating to obtain:
Figure BDA0003752464410000101
wherein ,
Figure BDA0003752464410000102
representing the transition probability of the label i to the label i +1 before and after the label character under the real condition;
Figure BDA0003752464410000103
representing the actual probability that a single character is predicted as label i.
The probability value for each tag is normalized and p (y | X):
Figure BDA0003752464410000104
wherein ,YallPossible Representing all possible sets of predictive labels;
Figure BDA0003752464410000105
representing a predictable label.
Defining Loss function Loss as the inverse number of log-likelihood, and obtaining output prediction label sequence by dynamic programming algorithm
Figure BDA0003752464410000107
Figure BDA0003752464410000106
Therefore, the output of the bidirectional long-short term memory network is input into a layer of linear chain element random field CRF, constraint conditions are added to the output label sequence, the final result can simultaneously meet the internal transfer relation of labels on the basis of the output sequence Y, unreasonable prediction sequence combination is abandoned, and the final prediction sequence Y is obtained.
And 5: and (3) inputting the unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step (2) and the step (3), and obtaining a named entity identification result of the unknown protocol through prediction.
The header of the protocol and the payload information carried by the header need to be decomposed to better extract the protocol header data for the next protocol format inference. Then, after the protocol header data is obtained, the reverse format inference needs to be performed to achieve the final parsing purpose. And selecting a data set divided by an application layer protocol header and load information, marking each hexadecimal character of 2101395 pieces of marked protocol data, and then training and predicting by using the established model. The results shown in fig. 5 were obtained. In the figure, the horizontal axis represents four evaluation indexes, the vertical axis represents the calculated percentage, and the same horizontal axis value contains experimental data of a plurality of methods. Therefore, the evaluation indexes of the method on the data set can reach a level of more than 95%, and the method is ahead of other classical models. In order to show the application effect of the named entity recognition technology on the protocol sequence data, the accuracy index is particularly increased to evaluate the experimental result. Accuracy (Accuracy) is defined as follows:
Figure BDA0003752464410000111
the accuracy is different from other evaluation indexes, the region identified by the named entity is replaced by the predicted label type of each character, namely the predicted label sequence is compared with the correctly labeled label sequence to obtain the final ratio. As can be seen from the figure, the values of the accuracy evaluation indexes in the results of the invention are all higher than those of other evaluation indexes. This phenomenon indicates that the finally obtained named entity tag sequence is very similar to the correct tag sequence, and plays a role in accurately separating protocol header data and protocol load information.
Then, after separating the sequence field of the protocol header from the payload information carried in the protocol, a reverse inference of the format of the field contained in the protocol header is required. 2119796 pieces of protocol header data which are produced during data preprocessing are selected as data, and field type named entity recognition training and prediction are carried out on a protocol header sequence. In order to obtain the optimal hyper-parameter to be adopted during training, a plurality of groups of experiments are designed on the basis of controlling the unique variable, and the experimental results are shown in fig. 6, 7 and 8. After the optimal hyper-parameters are selected, model training and prediction are performed on the protocol header data, and the final evaluation index values are shown in table 3. It can be seen that the present invention performs better on unknown protocol resolution in comparison with various models. The numerical value on the accuracy rate is high, so that the recognition and the division of the field structure are well represented when the named entities of various field types are faced, and the format and the field semantic information of a complete application layer protocol header message sequence can be reversely deduced through the division of the field types.
TABLE 3 protocol header field Format inference results
Figure BDA0003752464410000112
And 6: and improving the extended Backowsian pattern to formally describe the network protocol format, the field structure and the semantic information, standardizing the result after the named entity of the unknown protocol is identified by using a pattern mode, and finally obtaining the protocol specification and the convention of the unknown protocol.
Natural language is not sufficiently standardized to describe strict protocol specifications. The extended bacause paradigm, which is a standardized representation of descriptive programming languages, is modified to illustrate protocol specifications, and is defined herein. First, the protocol is represented in the general composition area of the semantic expression by using the three field types given in step 1. Next, a specific configuration form of the different types of fields is explained in units of bytes according to the field configuration characteristics of each protocol. And finally, giving the value taking condition of the byte character. For all network protocols, a generic formal representation method is specifically given in order to ensure that the improved paradigm is able to fully describe the field and semantic information. The Protocol name is replaced by "Protocol", and Index, type and Function are used to represent the field Type definition proposed in step 1, and the paradigm is as follows:
Protocol={F|O} (13)
Figure BDA0003752464410000121
the parenthesis in the paradigm distinguishes from mathematical formulas for representing [0, + ∞]A set of strings of characters. In particular, by
Figure BDA0003752464410000122
Indicates an unnecessary field type, i.e., a field area that can be increased or decreased with a change in the protocol type in the same kind of protocol. Therefore, the composition and semantic information of the unknown protocol are formally described.

Claims (6)

1. An unknown protocol reverse analysis method based on named entity recognition is characterized by comprising the following steps:
step 1: collecting data information containing different application layer protocols, summarizing and summarizing field components of the header of a known application layer protocol, discovering the types of the component fields of the existing known application layer protocol, and defining field naming entities;
step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting a binary protocol sequence of the protocols, and performing data preprocessing on the original binary protocol sequence;
and step 3: making a character vector conversion dictionary for the preprocessed data, respectively mapping converted protocol sequence characters into character vectors, taking the generated original character One-hot vectors as the input of a bidirectional long-short term memory network, respectively inputting values into forward and backward long-short term memory network units by an input layer, splicing by combining the characteristics and information of the previous moment and the next moment to obtain the vectors of an output layer, and outputting the vectors as hidden state vectors of the characters;
and 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; inputting the vector of the output layer into a conditional random field CRF, performing final probability correction by combining the transition probability of the character in real-time data to obtain the predicted label probability corresponding to the named entity class, and training to finish the intelligent agent;
and 5: inputting an unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step 2 and the step 3, and obtaining a named entity recognition result of the unknown protocol through prediction;
and 6: and improving the extended Backowsian pattern to formally describe the network protocol format, the field structure and the semantic information, standardizing the result after the named entity of the unknown protocol is identified by using a pattern mode, and finally obtaining the protocol specification and the convention of the unknown protocol.
2. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said field named entity definition specifically comprises:
1) An index field: the most direct identification field for distinguishing different protocol sequences in the same application layer protocol represents the serial number of the protocol sequence or the serial number of an object acted by the protocol;
2) A type field: indicating basic properties of the current protocol, expressing semantics in a binary mode, and forming a field by a one-bit binary character; when information is transmitted, expressive meanings are expressed only by one binary '0' and '1' or two binary bits of '0' to '3'; the semantic information which contains the state information of the protocol and can transmit data by yes and no; merging the continuous type fields into 'flag' fields during parsing, and expressing the flag fields in a byte form;
3) A function field: carrying information which is relatively visual and can directly generate effects in data; one is to express the information to be transmitted in a limited number of digits, each digit alone representing a particular type of function; the other decimal digit semantics of the field itself is important data transmitted by the field, and the number of bytes exceeding the required number of digits is used for carrying the data.
3. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein the step 2 is specifically:
step 2.1: selecting a PCAP data packet of flow collected from each port and a routing node in a secure network; performing data statistics on all PCAP packets, and counting the number of application layer protocols which can be made into a data set; each protocol subset is classified according to the type of the application layer protocol, the number of the types of all the application layer protocols is represented by g, the number of the protocols in each Packet is represented by m, and all the derived data Packet packets containing a single protocol are represented as follows:
Figure FDA0003752464400000021
wherein ,
Figure FDA0003752464400000022
represents the d protocol in the g application layer protocol;
step 2.2: extracting an original binary protocol sequence of each application layer protocol, converting the original binary protocol sequence into a hexadecimal system, carrying out BIO (binary input/output) labeling on the converted hexadecimal system sequence by taking a character as a unit, and adopting protocol header and protocol load labeling and protocol header field named entity labeling; the protocol header with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol header is reversely analyzed.
4. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said step 3 is specifically:
step 3.1: traversing the data in the data set to generate a character vector conversion dictionary, carrying out vector conversion definition on hexadecimal characters, and predefining other values which may occur;
the character vector conversion dictionary Dic is defined as follows:
Figure FDA0003752464400000023
where p is a key giving a key-value pair representation of hexadecimal characters; UNK is represented as the other characters outside the dictionary, < PAD > represents the complement character;
step 3.2: after a single character is mapped into a character vector, the generated One-hot vector of the original character is used as an input value x and is respectively input into a forward LSTM network and a backward LSTM network; for a single LSTM network, the output value O at time series t t Comprises the following steps:
O t =σ(W o ·[h t-1 ,x t ]+b o ) (3)
wherein, sigma represents Sigmoid function processing when passing through a gate structure; h is t-1 Is the output value of the previous moment, x t Is that whenInput value of previous time, W o and bo The weights and the bias coefficients are determined through experiments;
step 3.3: and taking the output value vector obtained by splicing as a hidden state vector of the character, wherein the whole splicing process comprises the following steps:
and combining and outputting the final output of the bidirectional long-short term memory network, wherein each direction has e LSTM structure models, and the final output value h is expressed in the following form:
h=[h fe ,h be ] (4)
wherein ,hfe and hbe Respectively an output value of a forward LSTM network structure and an output value of a backward LSTM network structure;
after the bidirectional long-short term memory network processing, obtaining a value probability matrix P of each label corresponding to the current character; the original input sequence is marked as X, and Y is used for representing a label sequence corresponding to a prediction result output by the bidirectional long-short term memory network; both are expressed in the following form:
X={x 1 ,x 2 ,…,x n } (5)
Y={y 1 ,y 2 ,…,y n } (6)
in the formula, n is the number of characters in the sequence;
the set label quantity l is represented by k, the dimension of the output matrix P is n × k, and the representation method for the parameter of the ith row and the jth column is defined as follows:
P n×k ={P ij :x i →l j |1≤i≤n,1≤j≤k} (7)
in the formula ,Pij Representing a probability value representing that the character i is predicted as the label j; l j Representing the jth label in the representation label set; the mapped probability is input to the next layer as a non-normalized probability.
5. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 4, wherein said step 4 is specifically:
step 4.1: the prediction result of the ith character is set as delta i The probability of transition to label j is defined as p ij The predicted result is expressed in the form:
δ i =max(p ij )1≤i≤n,1≤j≤k (8)
and 4.2: respectively adding the identifiers of the Start and the End before and after the prediction sequence as columns, adding new label types of the Start and the End as rows, and establishing a constraint conditional probability transition matrix A (k+2)×(k+2)
Step 4.3: for the probability s (X, y) of possible values y on the input sequence X, denoted by P n×k and A(k+2)×(k+2) Jointly calculating to obtain:
Figure FDA0003752464400000031
wherein ,
Figure FDA0003752464400000032
representing the transition probability of the label i to the label i +1 before and after the label character under the real condition;
Figure FDA0003752464400000033
representing the actual probability that a single character is predicted as label i;
normalizing the probability value of each label, and outputting p (y | X) by using Softmax:
Figure FDA0003752464400000041
wherein ,YallP o ssible Representing all possible sets of predictive labels;
Figure FDA0003752464400000042
representing a predictable label;
defining a loss function Loss is the inverse of log-likelihood, and the output prediction label sequence is obtained through a dynamic programming algorithm
Figure FDA0003752464400000043
Figure FDA0003752464400000044
Therefore, the output of the bidirectional long-short term memory network is input into a layer of linear chain element random field CRF to obtain a final prediction sequence y.
6. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said step 6 is specifically:
step 6.1: expressing the protocol in a composition area of semantic expression by using the field named entity definition given in the step 1;
step 6.2: explaining the specific forming form of different types of fields by taking bytes as units according to the field forming characteristics of each protocol;
step 6.3: giving the value taking condition of byte characters; for all network protocols, a common formal representation method is given, protocol names are replaced by Protocol, and Index, type and Function are used to represent Type definitions of an Index field, a Type field and a Function field respectively, and a paradigm is as follows:
Protocol={F|O} (12)
Figure FDA0003752464400000045
the big brackets in the normal form indicate [0, + ∞]A character string set composed of characters;
Figure FDA0003752464400000046
indicating an unnecessary field type;
therefore, the composition and semantic information of the unknown protocol are formally described.
CN202210848786.7A 2022-07-19 2022-07-19 Unknown protocol reverse analysis method based on named entity recognition Active CN115334179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210848786.7A CN115334179B (en) 2022-07-19 2022-07-19 Unknown protocol reverse analysis method based on named entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210848786.7A CN115334179B (en) 2022-07-19 2022-07-19 Unknown protocol reverse analysis method based on named entity recognition

Publications (2)

Publication Number Publication Date
CN115334179A true CN115334179A (en) 2022-11-11
CN115334179B CN115334179B (en) 2023-09-01

Family

ID=83917003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210848786.7A Active CN115334179B (en) 2022-07-19 2022-07-19 Unknown protocol reverse analysis method based on named entity recognition

Country Status (1)

Country Link
CN (1) CN115334179B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883398A (en) * 2022-11-25 2023-03-31 电子科技大学 Reverse analysis method and device for proprietary network protocol format and state

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140589A1 (en) * 2006-12-06 2008-06-12 Microsoft Corporation Active learning framework for automatic field extraction from network traffic
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
US20120210426A1 (en) * 2009-10-30 2012-08-16 Sun Yat-Sen University Analysis system for unknown application layer protocols
CN104468262A (en) * 2014-11-17 2015-03-25 中国科学院信息工程研究所 Network protocol recognition method and system based on semantic sensitivity
KR101950374B1 (en) * 2018-05-25 2019-02-20 뉴브로드테크놀러지(주) Non-standard protocol reverse engineering analysis apparatus
CN109951464A (en) * 2019-03-07 2019-06-28 西安电子科技大学 The sequence of message clustering method of unknown binary system proprietary protocol
CN110049023A (en) * 2019-03-29 2019-07-23 中国空间技术研究院 A kind of reverse recognition methods of unknown protocol based on machine learning and system
KR20200093455A (en) * 2019-01-28 2020-08-05 건국대학교 산학협력단 Method and system for network protocol state inference through reverse engineering
CN112702235A (en) * 2020-12-21 2021-04-23 中国人民解放军陆军炮兵防空兵学院 Method for automatically and reversely analyzing unknown protocol
CN114553983A (en) * 2022-03-03 2022-05-27 沈阳化工大学 Deep learning-based high-efficiency industrial control protocol analysis method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140589A1 (en) * 2006-12-06 2008-06-12 Microsoft Corporation Active learning framework for automatic field extraction from network traffic
US20120210426A1 (en) * 2009-10-30 2012-08-16 Sun Yat-Sen University Analysis system for unknown application layer protocols
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
CN104468262A (en) * 2014-11-17 2015-03-25 中国科学院信息工程研究所 Network protocol recognition method and system based on semantic sensitivity
KR101950374B1 (en) * 2018-05-25 2019-02-20 뉴브로드테크놀러지(주) Non-standard protocol reverse engineering analysis apparatus
KR20200093455A (en) * 2019-01-28 2020-08-05 건국대학교 산학협력단 Method and system for network protocol state inference through reverse engineering
CN109951464A (en) * 2019-03-07 2019-06-28 西安电子科技大学 The sequence of message clustering method of unknown binary system proprietary protocol
CN110049023A (en) * 2019-03-29 2019-07-23 中国空间技术研究院 A kind of reverse recognition methods of unknown protocol based on machine learning and system
CN112702235A (en) * 2020-12-21 2021-04-23 中国人民解放军陆军炮兵防空兵学院 Method for automatically and reversely analyzing unknown protocol
CN114553983A (en) * 2022-03-03 2022-05-27 沈阳化工大学 Deep learning-based high-efficiency industrial control protocol analysis method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
R. JI, H. LI AND C. TANG: "Extracting Keywords of UAVs Wireless Communication Protocols Based on Association Rules Learning", 《2016 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS)》 *
丛培鑫等: "基于生物信息的未知二进制协议聚类方法", 《四川大学学报(自然科学版)》, vol. 59, no. 03 *
张路煜等: "基于卷积神经网络的未知协议识别方法", 《微电子学与计算机》, vol. 35, no. 07 *
张钊;温巧燕;唐文;: "协议规范挖掘研究综述", 计算机工程与应用, no. 09 *
白彬彬: "网络未知协议语法的智能逆向分析方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *
陈曼等: "基于网络数据流的未知密码协议逆向分析", 《信息安全与通信保密》, no. 06 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883398A (en) * 2022-11-25 2023-03-31 电子科技大学 Reverse analysis method and device for proprietary network protocol format and state
CN115883398B (en) * 2022-11-25 2024-03-22 电子科技大学 Reverse analysis method and device for private network protocol format and state

Also Published As

Publication number Publication date
CN115334179B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN111428054B (en) Construction and storage method of knowledge graph in network space security field
US7689527B2 (en) Attribute extraction using limited training data
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN113761893B (en) Relation extraction method based on mode pre-training
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113806531A (en) Drug relationship classification model construction method, drug relationship classification method and system
TWI828928B (en) Highly scalable, multi-label text classification methods and devices
CN115334179A (en) Unknown protocol reverse analysis method based on named entity recognition
CN111814450A (en) Aspect-level emotion analysis method based on residual attention
CN117670571B (en) Incremental social media event detection method based on heterogeneous message graph relation embedding
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN113901813A (en) Event extraction method based on topic features and implicit sentence structure
CN110110137A (en) A kind of method, apparatus, electronic equipment and the storage medium of determining musical features
CN113674846A (en) Hospital intelligent service public opinion monitoring platform based on LSTM network
Shen et al. Active learning for event extraction with memory-based loss prediction model
CN116975161A (en) Entity relation joint extraction method, equipment and medium of power equipment partial discharge text
CN113505207B (en) Machine reading understanding method and system for financial public opinion research report
CN113222059B (en) Multi-label emotion classification method using cooperative neural network chain
CN111460160B (en) Event clustering method of stream text data based on reinforcement learning
CN115017271A (en) Method and system for intelligently generating RPA flow component block
CN113536741B (en) Method and device for converting Chinese natural language into database language
CN113901172A (en) Case-related microblog evaluation object extraction method based on keyword structure codes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant