CN113965631B

CN113965631B - SECS2 data packet identification method for HSMS head information loss

Info

Publication number: CN113965631B
Application number: CN202111274024.2A
Authority: CN
Inventors: 吴承荣; 伍鹏; 唐璇; 张志华; 蔡骏飞
Original assignee: Fudan University; Semiconductor Manufacturing International Shanghai Corp
Current assignee: Fudan University; Semiconductor Manufacturing International Shanghai Corp
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-10-13
Anticipated expiration: 2041-10-29
Also published as: CN113965631A

Abstract

The invention belongs to the technical field of communication networks, and particularly relates to a SECS2 data packet identification method for losing HSMS header information. The invention comprises the following steps: establishing a HashMap to store session link state information, reading an unknown data packet, searching session information according to a data packet quintuple, entering a main body judging process if the session information cannot be directly judged, firstly detecting whether the data packet has an HSMS head, trying to search an access point enumeration value under the condition that the data packet does not have the HSMS head, and judging whether data after the enumeration value accords with the characteristics of SECS2 data; after the data packet is scanned and identified, the data packet is finally judged by combining the two dimensions of the duty ratio and the weight, and after the result is obtained, the session information is updated, so that the next judgment is convenient. The invention ensures the accuracy and efficiency of recognition; the session state manager, the data identifier and the comprehensive assessment device realized based on the method cover the protocol identification function.

Description

SECS2 data packet identification method for HSMS head information loss

Technical Field

The invention belongs to the technical field of communication networks, and particularly relates to a SECS2 data packet identification method for losing HSMS header information.

Background

Along with the development of network technology, internet service types are increasingly diversified, in conventional traffic identification, data traffic can be identified according to a data packet header format, and when header information is lost, the efficiency of detecting the network traffic type is greatly reduced. In this case, the precise identification of each service type, that is, the identification of each type of network traffic, is an important point of network academic research and deployment operation.

Network traffic is an important carrier for recording and reflecting network and user activities thereof, and network traffic identification can be used for evaluation of network situation, development analysis of application programs, fine operation and the like. For application layer protocols without fixed TCP ports, the header of the application layer is typically located at the beginning of the connection or interaction session, while the most distinct features of the protocol are the header of the application layer protocol, such as HTTP protocol (GET, POST operation instructions), SMTP protocol (EHELO, MAIL FROM, RCPT TO, etc. instructions). When the data load of the application layer protocol is transmitted, obvious protocol characteristics are not existed, namely if a section of data packet intercepted randomly is not likely to have obvious protocol characteristics, the efficiency and the accuracy of the traditional flow identification means are reduced. At present, the flow control technology at home and abroad has relatively mature theoretical support, and mainly comprises the following steps: DPI-based identification methods, DFI-based identification methods, data mining-based identification methods, and the like.

Deep packet inspection (Deep Packet Inspection, DPI) adds application protocol identification to application layer data, packet content inspection and deep decoding over conventional IP packet inspection techniques (inspection and analysis of packet elements contained between OSI L2-L4). The deep data packet detection can be technically divided into three types, namely a recognition technology based on characteristic words, an application layer gateway recognition technology and a behavior pattern recognition technology. Different applications typically rely on different protocols, which each have their own specific fingerprints, which may be specific ports, specific strings or specific Bit sequences, and the "signature" based identification technique determines the application carried by the traffic flow by detecting "fingerprint" information in specific data messages in the traffic flow. In this case, an application layer gateway identification technology needs to be adopted, and the application layer gateway needs to identify the control flow first, analyze the control flow according to the protocol of the control flow through a specific application layer gateway, and identify the corresponding service flow from the protocol content. Behavior pattern recognition techniques determine actions that a user is performing or is about to perform based on analysis of behaviors that have been performed by a terminal. Behavior pattern recognition techniques are typically used for the recognition of traffic that cannot be judged according to a protocol. For example: SPAM traffic and normal Email traffic are completely consistent from the view of Email content, and SPAM traffic can be accurately identified only by analysis of user behavior. The three recognition technologies are respectively used for recognizing different types of protocols and cannot be replaced with each other. And when the DPI technology is applied to deploy the DPI system, a layered DPI solution is adopted, and the three technologies are comprehensively applied, so that the detection efficiency and the flexibility are optimized.

The main principle of deep flow detection (Deep Flow Inspection, DFI) is to use a large number of traffic statistics to build a machine learning classification model to classify network traffic. Because the method only needs to extract TCP/IP header to calculate statistical characteristics, such as average data packet size, network flow duration and total data packet, data packet arrival time interval, TCP zone bit number, etc., the statistical characteristics are mostly based on macroscopic behavior characteristics of network flow, load information of flow application layer is not needed to be extracted, and the identification speed is high, so that the method is one of research hotspots in academic circles at present. DFI has the advantage of no need to extract application layer load characteristics, can identify encrypted or unencrypted traffic, is applicable to any traffic, and has the disadvantage of requiring a large number of samples of the signature class for training.

Data Mining (Data Mining) is a process of retrieving efficient, novel, potentially useful, ultimately understandable patterns from a large volume of Data by analyzing each Data. Data mining, also known as knowledge discovery in database (Knowledge Discovery in Database, KDD), is also considered by some as an essential step in the knowledge discovery process in databases. The knowledge discovery process has the following steps:

(1) Data cleaning; (2) data integration; (3) data selection; (4) data transformation; (5) data mining; (6) mode evaluation; (7) knowledge representation.

Currently, data mining has been applied to various fields. The analysis methods of data mining can be divided into two categories: direct data mining; indirect data mining. Direct data mining: the goal is to build a model using the available data, the model describing the remaining data, a particular variable; indirect data mining: the object is not selected one specific variable, and is described by a model, but a certain relation is established among all the variables. Common data mining methods and types are shown in the following table.

Common data mining method

。

The classification algorithms in data mining generally include three types, i.e., supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms. In the aspect of supervised learning algorithm research, the average byte number in continuous time and flow messages can be used as the characteristic of flow classification, and the characteristic can be used as the thought of the network application classification standard; or a Bayesian classification method is adopted to carry out classification and identification on the network traffic, and the method uses the manually classified network data as the input of the supervised naive Bayesian estimation.

The application recognition method based on data mining automatically extracts application features from an application session, and then recognizes the application according to feature matching, wherein the most important thing extracts the application features, and the features of the application features in the session are the basis of the design of a feature extraction algorithm. Application feature extraction is to extract a set of all features capable of representing a certain application from application layer data, in the communication process, the application features generally have high frequency and relevance, and meanwhile, the Offset (Message Offset, MO, and Byte Offset, BO) in a session is relatively fixed, so that a feature automatic extraction algorithm (SS-selec) is proposed, which is an improvement of a classical association rule discovery algorithm Apriori algorithm, so that the application features are suitable for extracting a frequent session fragment set in an application session, and then a feature set representing a certain application is obtained through appropriate filtering rule screening.

In terms of network application feature extraction, there are several methods: (1) Finding application characteristics by referring to the protocol content of the application layer; (2) Analyzing and counting application layer data acquired on a network through wireshark, tcpdump and other tools so as to obtain application characteristics; (3) The characteristic extraction algorithm is designed, firstly, the characteristic of the application characteristic shown in a session layer is ascertained, the captured single application flow is used as a training sample set of the application flow, then the application flow is divided into different sessions according to the binary group (sourceIP, destinationIP) information of a packet header and the mark (SYN/ACK/RST/FIN) of the session establishment and termination of a TCP data stream transmission layer, and each session is sequentially stored in a data file according to time sequence to complete the recombination process of the message data, and the automatic extraction of the application characteristic based on the session is realized on the basis of the conversation message recombination.

The application recognition model consists of a training process and a recognition process. The training process mainly comprises data preprocessing and feature extraction, the data preprocessing mainly completes data recombination, and the SS-selec algorithm realizes feature automatic extraction based on an application session set; the identification process includes application layer data reorganization and application identification. After the message arrives, the message content in the same session is stored in the same cache according to the time sequence, the cache content is used as a section of common text, the application identification based on feature matching is realized, and the output result is the application type of the session to which the packet belongs.

For application layer protocols without fixed TCP ports, header information is lost and cannot be identified. The header of the application layer is typically located at the beginning of the connection or interaction session, while the most distinct features of the protocol are the header of the application layer protocol, such as HTTP protocol (GET, POST operation instructions), SMTP protocol (EHELO, MAIL FROM, RCPT TO, etc. instructions). When transferring the data load of the application layer protocol, the obvious protocol features already do not exist, such as the data connection of FTP, and the HTTP directly transfers the file content during the process of downloading and uploading the file, and the protocol features do not exist. In the bypass interception data, the condition of missing the data packet where the application layer protocol header is located often occurs for various reasons, and in this case, the data packet cannot be judged to judge what protocol the data packet belongs to, which reduces the efficiency and accuracy of the traditional flow identification means. For SECS2 data packets, if a port identification-based manner is used, the problem of dynamic change of the communication port cannot be handled, but the method of identifying the HSMS header and further identifying the SECS2 data packets is adopted, and is not applicable to SECS2 data packets with missing HSMS header.

How to design a method for accurately identifying SECS2 industrial control protocol for losing HSMS head by combining network traffic identification technology is a problem to be researched and solved.

Disclosure of Invention

The invention aims to solve the problems and provides a method for accurately identifying SECS2 data packets under the condition that an HSMS head is missing.

The invention provides an SECS2 data packet identification method for HSMS head information loss, which mainly combines a flow identification technology, and carries out protocol identification based on data types and binary codes defined by SECS2 protocols, and the specific steps are as follows.

Step 1: storing the information and the state of session connection by using a HashMap, reading an unknown packet, and searching whether the information of the connection exists in the HashMap; if not, a node is newly established, and the connection is stored; if so, view its status: when the following situation occurs, it may be determined that the subsequent packet is a SECS2 packet-the preceding packet in (1) has taken the HSMS header and the connection has not been closed; (2) the previous packet has been determined to be a SECS2 data segment and the connection has not been closed; otherwise, go to the next step.

Step 2: preliminary detection is carried out on the unknown packet, and whether the unknown packet has an HSMS head is judged; if yes, directly judging that the data packet is an SECS2 data packet, recording in a HashMap, and directly jumping into a result output step (step 4); otherwise, the step 3 is entered to carry out the judgment.

Step 3: determining the format of the SECSII data packet; because the SECSII data packet has a unique format, the SECSII data packet can be identified based on the unique format, and the SECSII data format is shown in figure 3; intercepting the load (PAYLOAD) of an unknown data packet, scanning single byte one by one, searching a first enumeration value as a judgment access point, extracting length information from the enumeration value, and skipping the length to match the next enumeration value until the boundary or the jump-out boundary; if the Byte in the matching is not the enumeration value which should be present, indicating that the previous access point Byte is wrong, jumping back and continuing to search; and (4) finishing the matching and feeding back the weight value, and entering a step 4.

The step 3 comprises the following steps:

step 301: intercepting an unknown data packet, circularly reading the unknown data packet, preprocessing the unknown data packet, extracting data content, storing the data content into a cache, and taking the cache content as a section of common text;

step 302: scanning PAYLOAD of the data packet one by one in a single byte, searching for a first byte which can be a type field in SECSII data, wherein the value of the byte belongs to an enumeration value set, and extracting length and type information from the byte;

step 303: according to the length and type information extracted in the step 302, performing jump identification and judgment, if the next enumerated value is found to be incorrect in a jump, considering that the initial enumerated value is judged to be incorrect, and returning to the position behind the initial enumerated value to continue the step 302;

step 304: when the jump to the boundary or the jump from the boundary accords with the rule of the enumeration value, the finding of the cut-in point is considered to be correct; if the jump to the boundary or the jump from the boundary still does not find that a section of data accords with the set SECS2 rule, the section of data is considered to be basically impossible to belong to SECS2 data, the occupation ratio statistics is carried out on the data which accords with SECS2, the weight calculation is carried out, and the step 4 is entered;

the method for specifically searching the enumeration value of the access point and identifying the jump is as follows: byte-by-byte scanning of the payoad portion of the packet, the enumerated values in SECS2 protocol are shown in fig. 2; extracting 3-8 bits of a single byte, if an enumeration value appears, taking the enumeration value as an access point, extracting 1-2 bits of the byte as length byte information L, namely, the length of the subsequent L is the length information L of the data; if the enumerated value indicates that the data is ASCII code, the type of the data with L length is identified, otherwise, the data with L length is directly skipped and is not identified, because only ASCII code has the possibility and the necessity of correct identification, when the ASCII code type identification is successful, higher weight is given, and the identification process is shown in figure 4.

Step 4: and after the byte-by-byte scanning is carried out on the data packet, carrying out comprehensive assessment of two dimensions on the probability of the data packet being the SECS2 data packet, and giving out the judging results of the two dimensions and the judging results of the comprehensive judgment, wherein the judging is finished.

In step 1, the storing session link information and state by using HashMap includes: designing a data structure of the HashMap node, and taking a five-tuple (source IP address, source port number, destination IP address, destination port, session ID) of the data packet as a HashKey in a two-byte exclusive-or mode;

the reading-in of the unknown packet, searching whether the information of the connection exists in the HashMap, searching session link information and state of the data packet according to the HashKey formed by the five-tuple of the data packet, judging whether the data packet can be rapidly judged according to the previous work, if not, adding or updating node information, and entering the next step;

among them, the following is given for the session state information, and these are possible:

(1) The previous data packets are all judged to be non-SECS 2 data packets, or whether the previous data packets are SECS2 data packets can not be judged;

(2) In the early-stage data packet, the HSMS header information is not intercepted, but the data packet is judged to be an SECS2 data segment;

(3) Intercepting the data packet with HSMS head information from the previous data packet;

in the above state, for the link of which the session state is (1), a subsequent determination process is required; for links with session states (2) and (3), the subsequent data packet may be directly determined as an SECS2 data packet, and when session closure information (such as FIN) is received, this node is deleted.

In step 2, the determining whether the unknown packet has an HSMS header is performed according to an HSMS message format, which is shown in fig. 5, where the message header information occupies 10 bytes and includes 6 parts of information, which are respectively: session ID, header Byte, PType, SType, system Bytes. In the beginning of interactive data, stream and Function information of data information is contained in Header Byte in Message Header, and the difference between control information and data transmission information is embodied in Session ID and SType.

The specific flow of the step 4 is as follows:

step 401: the two-dimensional probability function is designed for investigation and is used for comprehensively evaluating the probability of the data packet being the SECS2 data packet in two dimensions; the dimension A can be determined as the proportion of the portion of SECS2 data to the PAYLOAD portion; dimension B, weight obtained in the identification process; based on the features of these two dimensions, the following requirements are placed on the probability function:

when A is larger than B, the probability is not higher than the judging threshold, the length of the excluded packet is too short, so that not too much data meets the SECS2 data format, but more duty ratio is displayed;

when A is small and B is large, the probability is not higher than the judgment threshold, the length of the excluded packet is too long, so that a lot of irrelevant data are misjudged as SECS2 data, and the situation of higher weight is displayed;

in the two cases, when the AB gap is larger, the probability is lower, the gap is smaller, and the probability is higher in a certain range;

when A is smaller than B hours, the probability is not higher than the judgment threshold, and obviously, both dimensions are not qualified;

when a is large B is large, it should be determined as a high probability SECS2 packet.

According to the above conditions, the probability function prototype is designed as follows:

assuming that the values of A and B are Pa and Pb, respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

wherein Pa not equal to Pb, the k value should be referred to A, and the B threshold value should adjust the result of P to be within a proper interval.

Step 402: according to the above analysis, the probability of whether an unknown packet is an SECS2 packet is comprehensively determined according to the probability function set in step 401 in combination with the duty statistics result and the weight calculation result in step 304.

Step 403: outputting the judging result in the step 402, adopting a function structure form for the returning result form, returning a function judging result for each data packet of the cyclic processing, and simultaneously continuing to circularly read the data packet which is not processed according to whether the processing of the flow data is completed. After the processing is completed, the HashMap node information is updated according to the result, so that the subsequent identification is convenient.

The invention improves and innovates based on several flow identification methods, so that the SECS2 protocol with lost HSMS head information is also identified, the efficiency of the identification process is considered, in the identification process, the SECS2 data structure is combined, the identification process is designed, and the accuracy and the efficiency of the identification are ensured. The three parts of the session state manager, the data identifier and the comprehensive assessment device realized based on the method basically cover the protocol identification function introduced in the invention.

Drawings

Fig. 1 is a SECS2 protocol identification process.

Fig. 2 is a diagram of data type enumeration values defined by SECS2 protocol.

Fig. 3 is a SECS2 protocol data field format definition.

Fig. 4 is a flowchart of an identification algorithm.

Fig. 5 is an HSMS message format.

Detailed Description

In order that the technical method of the present invention may be more clearly and rapidly understood by a practitioner skilled in the art, the following detailed description is provided with reference to the accompanying drawings.

The identification method of the invention, as shown in fig. 1, mainly comprises the following steps:

step 1: reading an unknown packet by using the HashMap to store the information and the state of session connection, searching whether the information of the connection exists in the HashMap, and if the information of the connection exists in the HashMap, creating a node and storing the connection; if so, checking the state of the HSMS header, and judging that the subsequent data packet is an SECS2 data packet which is the previous data packet in the step (1), wherein the connection is not closed yet; (2) the previous packet has been determined to be a SECS2 data segment and the connection has not been closed; otherwise, entering the next step;

step 2: the unknown packet is initially detected, whether the unknown packet has an HSMS head is detected, if yes, the unknown packet can be directly judged to be an SECS2 data packet, recording is carried out in the HashMap, and a result output step is directly jumped in; otherwise, the next step is carried out, and judgment is carried out;

step 3: because the SECSII data packet has a unique format, the SECSII data packet can be identified based on the unique format, and the SECSII data format is shown in figure 3; intercepting the load (PAYLOAD) of an unknown data packet, scanning single bytes one by one, searching bytes with the first value belonging to the enumeration value of the SECSII type field as determination access points, taking the access point bytes as type bytes type_bytes, and extracting type value type and length byte numbers len_bytes; extracting the subsequent len_types bytes as a length value length according to the length byte number len_bytes; judging whether the byte belongs to an ASCII type or not according to the type value type, and judging whether the following length bytes are an ASCII character string or not if the byte belongs to the ASCII type; skipping corresponding bytes according to the length value length, extracting the next type byte type_byte2, and judging whether the bytes belong to an enumeration value range defined by a protocol; if the type_byte2 belongs to the enumerated value range, continuing to analyze the subsequent data by taking the byte as the next access point; skip detection is carried out until the boundary or the boundary is jumped out; if the Byte reached after a certain jump is not the enumeration value which should appear, indicating that the previous access point Byte is wrong, jumping back and continuing to search; matching the feedback weight;

Further, the step 1 includes:

step 101: designing a data structure of a HashMap node, taking a five-tuple (source IP address, source port number, destination IP address, destination port, session ID) of a data packet as a HashKey in a two-byte exclusive-or mode, and storing session link information and state by using the HashMap;

step 102: reading in an unknown packet, forming a HashKey according to a data packet five-tuple, searching session link information and a state of the data packet according to the HashKey, judging whether the data packet can be rapidly judged according to the previous work, and if the data packet can not be rapidly judged, adding or updating node information, and entering the next step;

1. the previous data packets are all judged to be non-SECS 2 data packets, or whether the previous data packets are SECS2 data packets can not be judged;

2. in the early-stage data packet, the HSMS header information is not intercepted, but the data packet is judged to be an SECS2 data segment;

3. intercepting the data packet with HSMS head information from the previous data packet;

in the above state, for a link whose session state is 1, a subsequent determination process is required; for links with session states 2 and 3, the subsequent data packet may be directly determined as SECS2 data packet, and when session closure information (such as FIN) is received, this node is deleted.

Further, the step 2 includes:

step 201: detecting the unknown data packet to see if it has an HSMS header;

step 202: if the HSMS head is provided, directly entering a step 4, and updating nodes of the HashMap according to the HashKey; otherwise, entering a step 3;

the HSMS header is identified according to the HSMS message format, which is shown in fig. 5, wherein the message header information occupies 10 bytes, contains 6 parts of information, and is respectively: session ID, header Byte, PType, SType, system Bytes. In the beginning of interactive data, stream and Function information of data information is contained in Header Byte in Message Header, and the difference between control information and data transmission information is embodied in Session ID and SType.

Further, the step 3 includes:

the method for specifically searching the enumeration value of the access point and identifying the jump is as follows: the payoad portion of the packet is scanned byte by byte and the enumerated values in SECS2 protocol are shown in fig. 2. Extracting 3-8 bits of a single byte, if an enumeration value appears, taking the enumeration value as an access point, extracting 1-2 bits of the byte as length byte information L, namely, the length of the subsequent L is the length information L of the data. If the enumerated value indicates that the data is ASCII code, the type of the data with L length is identified, otherwise, the data with L length is directly skipped and is not identified, because only ASCII code has the possibility and the necessity of correct identification, when the ASCII code type identification is successful, higher weight is given, and the identification process is shown in figure 4.

For the identification process of the above steps, we enumerate several common data for interpretation:

1. bit 87654321

00100001 item,binary,1 length

00000001 1 byte long

10101010 data byte

the above represents a binary data, the data is 10101010;

2. bit 87654321

01000001 item,ASCII,1 length

00000011 3 byte long

01000001 ASCII A

01000010 ASCII B

01000011 ASCII C

the above represents three ASCII data, the data is ABC;

3. bit 87654321

01101001 item，2-byte integers

00000110 6 byte long（total）6/2=3 integers

XXXXXXXX MSByte number X

XXXXXXXX LSByte number X

YYYYYYYY MSByte number Y

YYYYYYYY LSByte number Y

ZZZZZZZZ MSByte number Z

ZZZZZZZZ LSByte number Z

the above represents three two byte length integer data.

4. bit 87654321

10010001 item，4-byte floating point

00000100 4 byte long (total) 4/4=1 floating point

ffffffff

ffffffff floating point number in IEEE 754

ffffffff

The above represents a four byte length of float data.

5. bit 87654321

00000001 List

00000011 3 Elements

00100001 Binary Item next byte length

00000001 1 byte long

00000100 Alarm set, category 4

01100101 Item, 1-byte integer, next byte length

00000001 1 byte long

00010001 Alarm 17

01000001 Item, ASCII, next byte length

00000111 7 characters

01010100 ASCII T

00110001 ASCII 1

00100000 ASCII space

01001000 ASCII H

01001001 ASCII I

01000111 ASCII G

01001000 ASCII H

The above represents a List of 3 items, respectively a binary data, an integer data, and a string of ASCII data.

Further, the specific flow of the step 4 is as follows:

step 401: the two-dimensional probability function is researched and designed, so that the evaluation of a two-dimensional result is conveniently and comprehensively carried out, namely, the proportion of the part of SECS2 data to the PAYLOAD part can be judged; B. the weight obtained in the identification process. Because of the features of these two dimensions, the following requirements are placed on the probability function:

assuming that the values of A and B are Pa and Pb, respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

An example of the identification process is given below.

Intercepting a data packet on a network, and assuming that a session link to which the data packet belongs is not determined to be a SECSII data session, performing HSMS header detection on the data packet, and finding that the data packet does not have an HSMS header. Preprocessing the data packet, extracting data, and obtaining the following arrays: raw [ ] = "0104410d5050315f466f726d6174746564410645714d444c4e4106312e312e303101030102a90200620102a50101410554657374310102a90200010102a50101a501020102a90200020109a50100a9020001b10400000001650164690203e87104000f424091043 dccccd 2501004132413132333435363738395f3132333435363738395f3132333435363738395f3132333435363738395f313233343536373839".

By using the identification method of the present invention, a possible entry point is first found, the first byte 01 (00000001) thereof meets the requirement of the entry point, the 3-8bit thereof is (000000), the meaning of the entry point is the following data type LIST, the 1-2bit is (01) in comparison with fig. 2, the meaning of the entry point is that the following data is about to use 1 byte to represent the data length, and the meaning of the next byte is that the number of elements contained in the LIST is represented for the LIST type. Assuming that this entry point is correct, the 1 byte continues to be extracted back to 04, meaning that there are 4 elements in LIST. Continuing to extract 1 byte 41 (01000001) backward, 3-8 bits (010000) of which the data type is ASCII, and the following 1 byte represents ASCII data length, extracting the following byte 0d represents ASCII data length 13, at this time we directly skip 13 bytes to continue extracting, find that the next byte is still 41, represent ASCII data, and so on, we get the jump recognition procedure as follows:

number of steps	Data type	Number of length bytes	Jump length	Cut-in point
					1	LIST	1		Is that
2	ASCII	1	13
					3	ASCII	1	6
4	ASCII	1	6
					5	LIST	1
6	LIST	1
					7	U2	1	2
8	LIST	1
					9	U1	1	1
10	ASCII	1	5
					11	LIST	1
12	U2	1	2
					13	LIST	1
14	U1	1	1
					15	U1	1	1
16	LIST	1
					17	U2	1	2
18	LIST	1
					19	U1	1	1
20	U2	1	2
					21	U4	1	4
22	I1	1	1
					23	I2	1	2
24	I4	1	4
					25	F4	1	4
26	BOOLEAN	1	1
					27	ASCII	1	50

After skipping the packet, we feed back the result, the total length of the packet is 162 bytes, the number of bytes of SECSII data is 162, the skip is 19 times, the average skip is once every 8.53 bytes, and the SECSII data is 100%. For this packet, our decision is that it must be a SECSII packet.

The invention improves and innovates based on several flow identification methods, and identifies the data packet based on the special data structure of SECSII data, so that the SECS2 data packet with lost HSMS head information is accurately identified, the efficiency of the identification process is considered, and the identification process is designed by combining with the SECS2 data structure in the identification process, thereby ensuring the accuracy and efficiency of the identification.

Claims

1. The SECS2 data packet identification method for the HSMS head information loss is characterized by comprising the following specific steps:

step 1: storing the information and the state of session connection by using a HashMap, reading an unknown packet, and searching whether the information of the connection exists in the HashMap; if not, a node is newly established, and the connection is stored; if so, view its status: judging that the subsequent data packet is SECS2 data packet-the data packet in front of (1) when the HSMS head is taken and the connection is not closed; (2) the previous packet has been determined to be a SECS2 data segment and the connection has not been closed; otherwise, entering the next step;

step 2: preliminary detection is carried out on the unknown packet, and whether the unknown packet has an HSMS head is judged; if yes, directly judging the data packet as SECS2 data packet, recording in HashMap, and directly jumping into a result output step; otherwise, entering a step 3, and judging;

step 3: determining the format of SECS2 data packets;

intercepting the load PAYLOAD of the unknown data packet, scanning the single bytes one by one, searching a first enumeration value as a judgment access point, extracting length information from the enumeration value, and skipping the length to match the enumeration value of the next round until the boundary or the jump-out boundary; if the Byte in the matching is not the enumeration value which should be present, indicating that the previous access point Byte is wrong, jumping back and continuing to search; step 4, the feedback weight is matched and the step is carried out; the specific flow is as follows:

step 302: scanning the PAYLOAD PAYLOAD of the data packet one by one in a single byte, searching for a first byte which can be a type field in SECS2 data, wherein the value of the byte belongs to an enumeration value set, and extracting length and type information from the byte;

step 304: when the jump to the boundary or the jump from the boundary accords with the rule of the enumeration value, the finding of the cut-in point is considered to be correct; if the jump to the boundary or the jump from the boundary still does not find that a section of data accords with the set SECS2 rule, the section of data is considered to be unlikely to belong to SECS2 data, the occupation ratio statistics is carried out on the data which accords with SECS2, the weight calculation is carried out, and the step 4 is entered;

the method for specifically searching the enumeration value of the access point and identifying the jump is as follows: byte-by-byte scanning of the payoad portion of the packet; extracting 3-8 bits of a single byte, if an enumeration value appears, taking the enumeration value as an access point, extracting 1-2 bits of the byte as length byte information L, namely, the length of the subsequent L is the length information L of the data; if the enumerated value indicates that the data is ASCII codes, identifying the type of the data with the L length, otherwise, directly skipping the L length and not identifying the data with the L length; when the ASCII code type identification is successful, giving a higher weight when evaluating;

2. The method according to claim 1, wherein in step 1:

the storing session link information and state by using HashMap comprises the following steps: designing a data structure of the HashMap node, and using five-tuple of a data packet: a source IP address, a source port number, a destination IP address, a destination port and a session ID are used as HashKey in a two-byte exclusive OR mode;

the unknown packet is read in, whether the connection information exists in the HashMap is searched for according to the HashKey formed by the five-tuple of the data packet, session link information and state of the data packet are searched for, whether the data packet can be rapidly judged according to the previous work is judged, if the node information cannot be increased or updated, the next step is carried out;

wherein, the following is set for the session state information:

session state information is possible as follows:

in the above state, for the link of which the session state is (1), a subsequent determination process is required; for the links with the session states of (2) and (3), the subsequent data packet is directly determined to be an SECS2 data packet, and when the session closing information is received, the node is deleted.

3. The method according to claim 2, wherein in step 2, the determining whether the unknown packet has an HSMS header is performed according to an HSMS message format, in which message header information occupies 10 bytes and includes 6 pieces of information, respectively: session ID, header Byte, PType, SType, system Bytes; in the beginning of interactive data, stream and Function information of data information is contained in Header Byte in Message Header, and the difference between control information and data transmission information is embodied in Session ID and SType.

4. A method according to claim 3, wherein the specific procedure of step 4 is as follows:

step 401: designing a two-dimensional probability function for comprehensively evaluating the probability of the data packet being an SECS2 data packet in two dimensions; the dimension A is used for judging that the part of SECS2 data accounts for the proportion of the PAYLOAD part; the dimension B is used for identifying the weight obtained in the process; based on the features of these two dimensions, the following requirements are placed on the probability function:

when A is small and B is large, the probability is not higher than a judgment threshold value, and the situation that a lot of irrelevant data are misjudged as SECS2 data and a higher weight is displayed due to the fact that the length of a packet is too long is excluded;

in the two cases, when the A, B gap is larger, the probability is lower, the gap is smaller, and the probability is higher within a certain range;

when A is large and B is large, the data packet is judged to be a high-probability SECS2 data packet;

assuming that the values of A and B are Pa and Pb, respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

wherein Pa not equal to Pb, the k value should be referred to A, B threshold value to adjust the result of P to be within the proper interval;

step 402: combining the duty ratio statistical result and the weight calculation result in the step 304, and judging whether an unknown data packet is the probability of an SECS2 data packet according to the two-dimensional probability function set in the step 401;

step 403: outputting the judging result in the step 402, adopting a function structure form aiming at the returning result form, returning a function judging result aiming at each data packet of the cyclic processing, and simultaneously continuing to circularly read the unprocessed data packet according to whether the flow data is processed or not; after the processing is completed, the HashMap node information is updated according to the result, so that the subsequent identification is convenient.