CN113965631A

CN113965631A - SECS2 data packet identification method for HSMS header information loss

Info

Publication number: CN113965631A
Application number: CN202111274024.2A
Authority: CN
Inventors: 吴承荣; 伍鹏; 唐璇; 张志华; 蔡骏飞
Original assignee: Fudan University; Semiconductor Manufacturing International Shanghai Corp
Current assignee: Fudan University; Semiconductor Manufacturing International Shanghai Corp
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-21
Anticipated expiration: 2041-10-29
Also published as: CN113965631B

Abstract

The invention belongs to the technical field of communication networks, and particularly relates to an SECS2 data packet identification method for HSMS header information loss. The invention comprises the following steps: establishing HashMap to store session link state information, reading in unknown data packets, searching session information according to five-tuple of the data packets, entering a main body judgment process if direct judgment cannot be carried out, firstly detecting whether the data packets have HSMS heads, trying to search an entry point enumeration value under the condition that the data packets do not have HSMS heads, and judging whether the data after the enumeration value conform to the characteristics of SECS2 data; after the data packet is scanned and identified, the data packet is finally judged by combining two dimensions of the proportion and the weight, and the session information is updated after a result is obtained, so that the next judgment is facilitated. The invention ensures the accuracy and efficiency of identification; the three parts of the session state manager, the data identifier and the comprehensive evaluator which are realized based on the method cover the protocol identification function.

Description

SECS2 data packet identification method for HSMS header information loss

Technical Field

The invention belongs to the technical field of communication networks, and particularly relates to an SECS2 data packet identification method for HSMS header information loss.

Background

With the development of network technology, the internet service types are increasingly diversified, data traffic can be identified according to the data packet header format in conventional traffic identification, and when header information is lost, the efficiency of detecting the network traffic types is greatly reduced. In this case, accurately identifying each service type, that is, identifying each type of network traffic, becomes a key point of attention in network academic research and deployment operations.

The network flow is an important carrier for recording and reflecting the network and the user activities thereof, and the network flow identification can be used for evaluating the network situation, developing and analyzing the application program, finely operating and the like. For application layer protocols without fixed TCP ports, the header of the application layer is generally located at the beginning of a connection or interactive session, and the most obvious features of the protocol are the headers of the application layer protocols, such as HTTP protocol (GET, POST operation instructions), SMTP protocol (EHELO, MAIL FROM, RCPT TO, etc. instructions). When the data load of the application layer protocol is transmitted, the obvious protocol characteristics do not exist, that is, if a segment of data packet is randomly intercepted, the obvious protocol characteristics are not likely to exist, which reduces the efficiency and accuracy of the traditional traffic identification means. At present, the flow control technology at home and abroad has mature theoretical support and mainly comprises the following steps: DPI-based identification methods, DFI-based identification methods, data mining-based identification methods, and the like.

Deep Packet Inspection (DPI) is an addition to traditional IP Packet Inspection techniques (Inspection analysis of Packet elements contained between OSI L2-L4) to application protocol identification, Packet content Inspection and Deep decoding of application layer data. The deep data packet detection technology can be divided into three types, namely a characteristic word-based identification technology, an application layer gateway identification technology and a behavior pattern identification technology. Different applications usually depend on different protocols, and different protocols have their specific fingerprints, which may be specific ports, specific character strings, or specific Bit sequences, and the identification technology based on the "characteristic word" determines the application carried by the traffic flow through the detection of the "fingerprint" information in the specific data message in the traffic flow. The control flow and the service flow of some services are separated, the service flow has no characteristics, in this case, an application layer gateway identification technology needs to be adopted, the application layer gateway needs to identify the control flow firstly, and analyze the control flow through a specific application layer gateway according to the protocol of the control flow, so as to identify the corresponding service flow from the protocol content. The behavior pattern recognition technology determines an action that a user is performing or an action to be performed based on an analysis of a behavior that the terminal has performed. Behavior pattern recognition techniques are typically used for the recognition of traffic that cannot be judged according to a protocol. For example: the SPAM service flow and the normal Email service flow are completely consistent from the aspect of Email content, and the SPAM service can be accurately identified only through analyzing user behaviors. The above three identification technologies are respectively used for identification of different types of protocols, and cannot be replaced mutually. And a layered DPI solution is adopted when a DPI technology is applied to deploy the DPI system, the three technologies are comprehensively applied, and the detection efficiency and the flexibility are optimal.

The main principle of Deep Flow Inspection (DFI) is to establish a machine learning classification model to classify network traffic by using a large number of traffic statistical features. Because the method only needs to extract the TCP/IP header to calculate statistical characteristics, such as average data packet size, network flow duration and total number of data packets, data packet arrival time interval, number of TCP zone bits, and the like, the statistical characteristics are mostly based on macroscopic behavior characteristics of the network flow, load information of a flow application layer does not need to be extracted, and the identification speed is high, and the method is one of research hotspots in academia at present. The DFI has the advantages that the application layer load characteristics are not required to be extracted, encrypted or unencrypted traffic can be identified, the DFI is suitable for any traffic, and the DFI has the defect that a large number of samples of marked classes are required to be trained.

Data Mining (Data Mining) is a process of obtaining an efficient, novel, potentially useful, and ultimately understandable schema from a large amount of Data by analyzing each piece of Data. Data mining is also known as Knowledge Discovery in databases (KDD), and is also considered as a basic step in the Knowledge Discovery process in databases. The knowledge discovery process comprises the following steps:

(1) clearing data; (2) integrating data; (3) selecting data; (4) data transformation; (5) data mining; (6) evaluating a mode; (7) and (4) knowledge representation.

Currently, data mining has been applied to various fields. Analysis methods for data mining can be divided into two categories: direct data mining; and (4) indirect data mining. Direct data mining: the goal is to build a model using the available data, which describes the remaining data for a particular variable; and (3) indirect data mining: a specific variable is not selected from the target and is described by a model, and a certain relation is established among all variables. Common data mining methods and types are shown in the following table.

Common data mining method

。

The classification algorithm in data mining generally includes three types, namely, supervised learning algorithm, unsupervised learning algorithm and semi-supervised learning algorithm. In the aspect of research of a supervised learning algorithm, the average byte number in continuous time and flow messages can be used as the characteristic of flow classification, and the characteristic can be used as the idea of network application classification standard; or a Bayesian classification method is adopted to classify and identify the network traffic, and the method uses the network data of artificial classification as the input of supervised naive Bayesian estimation.

The application identification method based on data mining is characterized in that application features are automatically extracted from application sessions, and then application is identified according to feature matching, wherein the most important thing is to extract the application features, and the features expressed by the application features in the sessions are the basis of feature extraction algorithm design. The application feature extraction is to extract a set of all features capable of representing a certain application from application layer data, in a communication process, the application features generally have high frequency and relevance, and offsets (Message offsets, MO and Byte offsets, BO) in one session are relatively fixed, so that an automatic feature extraction algorithm (SS-selec) is proposed, which is an improvement of a classical association rule discovery algorithm Apriori algorithm, so that the algorithm is suitable for extracting frequent session fragment sets in an application session, and a feature set representing a certain application is obtained through proper filtering rule screening.

In the aspect of network application feature extraction, the following methods are available: (1) finding application characteristics by looking up application layer protocol content; (2) analyzing and counting the application layer data collected on the network by using tools such as wireshark, tcpdump and the like so as to obtain application characteristics; (3) a characteristic extraction algorithm is designed, firstly, the characteristic of an application characteristic shown in a session layer is clarified, a captured single application flow is used as a training sample set of the application flow, then the application flow is divided into different sessions according to binary group (sourceIP and destination IP) information of a packet header and a mark (SYN/ACK/RST/FIN) of session establishment and termination of a TCP data stream transmission layer, and simultaneously, message data of each session are sequentially stored into a data file according to a time sequence to complete a recombination process of the session, and on the basis of session message recombination, automatic extraction of the application characteristic based on the session is realized.

The application recognition model consists of a training process and a recognition process. The training process mainly comprises data preprocessing and feature extraction, wherein the data preprocessing mainly completes data recombination, and the SS-selec algorithm realizes automatic feature extraction based on an application session set; the identification process includes application layer data reorganization and application identification. And when the message arrives, storing the message content in the same conversation into the same cache according to the time sequence, taking the cache content as a section of common text, realizing application identification based on feature matching, and outputting the result as the application type of the conversation to which the packet belongs.

For an application layer protocol without a fixed TCP port, if header information is lost, identification cannot be performed. The header of the application layer is generally located at the beginning of the connection or interactive session, and the most distinctive feature of the protocol is the header of the application layer protocol, such as HTTP protocol (GET, POST operation instructions), SMTP protocol (EHELO, MAIL FROM, RCPT TO, etc. instructions). When the data load of the application layer protocol is transmitted, obvious protocol characteristics do not exist, such as data connection of FTP, and HTTP directly transmits file contents in the process of downloading and uploading files and does not have protocol characteristics. In the data intercepted by the bypass, for various reasons, the situation of intercepting the data packet where the application layer protocol header is located frequently occurs, and under the situation, the protocol to which the data packet belongs cannot be judged, so that the efficiency and accuracy of the traditional traffic identification means are reduced. For SECS2 packets, if the method based on port identification is used, the problem of dynamic change of the communication port cannot be handled, but the method of identifying the HSMS header and further identifying the SECS2 packets is adopted, and the method is not applicable to SECS2 packets which lose the HSMS header.

How to design a method for accurately identifying the SECS2 industrial control protocol with the lost HSMS header in combination with the network traffic identification technology is a problem to be researched and solved.

Disclosure of Invention

The present invention aims to solve the above problems and provide a method for accurately identifying SECS2 packets in the case of missing HSMS headers.

The SECS2 data packet identification method for HSMS head information loss provided by the invention mainly combines the flow identification technology, and carries out protocol identification based on the data type and binary code defined by the SECS2 protocol, and the specific steps are as follows.

Step 1: using the HashMap to store the information and the state of the session connection, reading in the unknown packet, and searching whether the information of the connection exists in the HashMap; if not, a node is newly established, and the connection is stored; if so, checking the state of the mobile terminal: when the following situation occurs, the following data packet can be judged to be the SECS2 data packet-in the former data packet, the HSMS head is taken, and the connection is not closed; the previous packet has been determined to be an SECS2 data segment and the connection has not been closed; otherwise, go to the next step.

Step 2: carrying out primary detection on the unknown packet to judge whether the unknown packet has an HSMS (high speed Mobile station) header; if so, directly judging as an SECS2 data packet, recording in a HashMap, and directly jumping to the result output step (step 4); otherwise, entering step 3 for judgment.

And step 3: judging the format of the SECSII data packet; since the SECSII data packet has a unique format, it can be identified based on its unique format, which is shown in FIG. 3; intercepting load (PAYLOAD) of unknown data packets, scanning single bytes one by one, searching a first enumeration value as a judgment entry point, extracting length information from the enumeration value, and skipping length to perform next round of enumeration value matching until a boundary or a boundary is skipped; if the Byte appearing in the matching is not the enumeration value which should appear, the previous entry point Byte is wrong, and the point is jumped back and continuously searched; and (4) feeding back the weight value after matching is finished, and entering the step 4.

The step 3 comprises the following steps:

step 301: intercepting unknown data packets, circularly reading the unknown data packets, firstly preprocessing the unknown data packets, extracting data contents, storing the data contents into a cache, and taking the cache contents as a section of common text;

step 302: scanning single bytes of the PAYLOAD of the data packet one by one, searching a first byte which may be a type field in SECSII data, wherein the value of the byte belongs to an enumeration value set, and extracting length and type information from the byte;

step 303: performing jump identification and judgment according to the length and type information extracted in the step 302, if the next enumeration value is found not to be correct in a certain jump, considering that the initial enumeration value is judged to be wrong, and returning to a position behind the initial enumeration value to continue the operation in the step 302;

step 304: when jumping to the boundary or jumping out of the boundary accords with the enumeration value rule, the entry point is considered to be found correctly; if the data section is not found to conform to the set SECS2 rule when jumping to the boundary or jumping out of the boundary, the data section is considered to be the SECS2 data basically, proportion statistics is carried out on the data conforming to the SECS2, weight calculation is carried out, and the step 4 is carried out;

the specific methods for finding the access point enumeration value and the jump identification are as follows: scanning the PAYLOAD portion of the packet byte by byte, the enumerated values in SECS2 protocol are shown in fig. 2; extracting 3-8 bits of a single byte, if an enumeration value appears, taking the enumeration value as an entry point, extracting 1-2 bits of the byte as length byte information L, namely the length of the subsequent L is the length information L of the data; if the enumerated value indicates that the data is ASCII code, the type identification is carried out on the data with the length L, otherwise, the identification is not carried out by directly skipping the length L, because only the ASCII code has the possibility and the necessity of correct identification, when the type identification of the ASCII code is successful, higher weight is given when the evaluation is carried out, and the identification process is as shown in figure 4.

And 4, step 4: and after the data packet is scanned byte by byte, comprehensively evaluating the two dimensions of the probability that the data packet is the SECS2 data packet, giving out the judgment result of the two dimensions and the comprehensive judgment result, and finishing the judgment.

In step 1, the storing session link information and state by using HashMap includes: designing a data structure of a HashMap node, and using a five-tuple (source IP address, source port number, destination IP address, destination port and session ID) of a data packet as a HashKey in a two-byte XOR mode;

the unknown packet is read in, whether the information of the connection exists in the HashMap is searched, the session link information and the state of the data packet are searched according to the HashKey formed by the five-tuple of the data packet, whether the judgment can be rapidly carried out according to the previous work is judged, if the judgment cannot be carried out, the node information is added or updated, and the next step is carried out;

wherein the following rules are defined for the session state information, which are possible:

(1) the previous data packets are all judged to be non-SECS 2 data packets or can not be judged to be SECS2 data packets;

(2) in the early-stage data packet, the HSMS header information is not intercepted, but the data packet is judged as an SECS2 data segment;

(3) intercepting a data packet of HSMS header information from a previous data packet;

in the above state, for the link with the session state (1), a subsequent determination process needs to be performed; for links with session states (2) and (3), the subsequent packet may be directly determined as SECS2 packet, and when the session close message (e.g., FIN) is received, the node is deleted.

In step 2, the determination of whether the unknown packet has an HSMS header is performed according to an HSMS message format, which is shown in fig. 5, where the message header information occupies 10 bytes and includes 6 parts of information, which are: session ID, Header Byte, PType, SType, System Bytes. At the beginning of the interactive data, the Stream and Function information of the data information is contained in the Header Byte in the Message Header, and the difference between the control information and the data transmission information is embodied in the Session ID and the SType.

The specific process of the step 4 is as follows:

step 401: investigating and designing a two-dimensional probability function for comprehensively evaluating two dimensions of the probability that the data packet is the SECS2 data packet; dimension a, which can be determined as the proportion of SECS2 data to PAYLOAD; dimension B, weight value obtained in the identification process; based on the features of these two dimensions, the following requirements are placed on the probability function:

when A is large and B is small, the probability is not higher than the judgment threshold, and the exclusion is that the packet length is too short, so that not too much data meets the SECS2 data format, but a more proportion condition is displayed;

when A is small and B is large, the probability is not higher than a judgment threshold value, and the exclusion is that the length of the packet is too long, so that a lot of irrelevant data are misjudged as SECS2 data, and the condition of a higher weight is displayed;

in the two situations, when the difference between the AB and the AB is larger, the probability is lower, the difference is smaller, and the probability is higher within a certain range;

when A is small and B is small, the probability is not higher than a judgment threshold value, and obviously, both dimensions are not qualified;

when a is large and B is large, it should be judged as a high probability SECS2 packet.

According to the above conditions, the prototype of the probability function is designed as follows:

assuming that values of A and B are Pa and Pb respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

wherein Pa ≠ Pb, and the k value should be adjusted to be within an appropriate interval by referring to the threshold values A and B.

Step 402: according to the analysis, the probability of whether an unknown packet is an SECS2 packet is comprehensively determined according to the probability function set in step 401, in combination with the percentage statistical result and the weight calculation result in step 304.

Step 403: outputting the judgment result in the step 402, adopting a function structure form for the return result form, returning a function judgment result for each data packet of the circular processing, and continuously performing circular reading processing on the data packet which is not processed according to whether the processing of the flow data is finished. After the processing is finished, the HashMap node information is updated according to the result, and subsequent identification is facilitated.

The invention is improved and innovated on the basis of several flow identification methods, so that the SECS2 protocol with HSMS head information lost also has an identification mode, the efficiency of the identification process is considered, the identification process is designed by combining the SECS2 data structure in the identification process, and the accuracy and the efficiency of the identification are ensured. The three parts of the session state manager, the data identifier and the comprehensive evaluator which are realized based on the method basically cover the protocol identification function introduced in the invention.

Drawings

Fig. 1 illustrates the SECS2 protocol identification process.

Fig. 2 is a data type enumeration value defined by the SECS2 protocol.

Fig. 3 is a SECS2 protocol data field format definition.

Fig. 4 is a flow of the recognition algorithm.

Fig. 5 is an HSMS message format.

Detailed Description

In order to make the technical method of the present invention more clearly and quickly understood by those skilled in the art, the detailed description will be further provided with reference to the attached drawings.

The identification method of the invention, as shown in fig. 1, mainly comprises the following steps:

step 1: using the HashMap to store the information and the state of session connection, reading in an unknown packet, searching whether the information of the connection exists in the HashMap, if not, establishing a node newly, and storing the connection; if so, checking the state of the data packet, and judging that the subsequent data packet is an SECS2 data packet-in the previous data packet, the HSMS header is already taken and the connection is not closed; the previous packet has been determined to be an SECS2 data segment and the connection has not been closed; otherwise, entering the next step;

step 2: performing primary detection on the unknown packet, detecting whether the unknown packet has an HSMS (high speed Mobile station) head, if so, directly judging as an SECS2 data packet, recording in a HashMap, and directly jumping into a result output step; otherwise, entering the next step for judgment;

and step 3: since the SECSII data packet has a unique format, it can be identified based on its unique format, which is shown in FIG. 3; intercepting the load (PAYLOAD) of an unknown data packet, scanning single bytes one by one, searching bytes of which the first value belongs to an enumerated value of a SECSII type field as a judgment entry point, taking the entry point bytes as type byte type _ byte, and extracting the type value type and the length byte number len _ bytes; extracting subsequent lentypebytes as a length value length according to the length byte number len _ bytes; judging whether the type belongs to an ASCII type or not according to the type value type, and if the type belongs to the ASCII type, judging whether the following length bytes are an ASCII character string or not; skipping corresponding bytes according to the length value length, extracting the next type byte type _ byte2, and judging whether the byte belongs to the enumeration value range defined by the protocol; if type _ byte2 belongs to the enumerated value range, continuing to analyze subsequent data with the byte as the next entry point; jumping and detecting until the boundary or jumping out of the boundary; if the Byte arriving after a certain jump is not an enumeration value which should appear, the previous entry point Byte is indicated to be wrong, and the jump is carried out and the search is continued; the matching end feedback weight value;

Further, the step 1 comprises:

step 101: designing a data structure of a HashMap node, performing two-byte XOR on five tuples (a source IP address, a source port number, a destination IP address, a destination port and a session ID) of a data packet to serve as a HashKey, and storing session link information and a state by using the HashMap;

step 102: reading an unknown packet, forming a HashKey according to a five-tuple of a data packet, searching session link information and a state to which the data packet belongs according to the HashKey, judging whether the session link information and the state can be rapidly judged according to previous work, if not, adding or updating node information, and entering the next step;

1. the previous data packets are all judged to be non-SECS 2 data packets or can not be judged to be SECS2 data packets;

2. in the early-stage data packet, the HSMS header information is not intercepted, but the data packet is judged as an SECS2 data segment;

3. intercepting a data packet of HSMS header information from a previous data packet;

in the above state, for the link whose session state is 1, a subsequent determination process needs to be performed; for links with session states 2 and 3, the subsequent packet may be directly determined as SECS2 packet, and when the session close message (e.g., FIN) is received, the node is deleted.

Further, the step 2 comprises:

step 201: detecting the unknown data packet to see whether the unknown data packet has the HSMS header;

step 202: if the HSMS head exists, directly entering the step 4, and updating nodes of the HashMap according to the HashKey; otherwise, entering step 3;

the HSMS header is identified according to the HSMS message format, which is shown in fig. 5, wherein the message header information occupies 10 bytes, and includes 6 parts of information, which are respectively: session ID, Header Byte, PType, SType, System Bytes. At the beginning of the interactive data, the Stream and Function information of the data information is contained in the Header Byte in the Message Header, and the difference between the control information and the data transmission information is embodied in the Session ID and the SType.

Further, the step 3 comprises:

the specific methods for finding the access point enumeration value and the jump identification are as follows: the enumerated values in the PAYLOAD portion of the byte-by-byte scan packet, SECS2 protocol, are shown in fig. 2. And 3-8 bits of a single byte are extracted, if an enumeration value appears, the enumeration value is used as an entry point, 1-2 bits of the byte are extracted as length byte information L, and the length of the subsequent L is the length information L of the data. If the enumerated value indicates that the data is ASCII code, the type identification is carried out on the data with the length L, otherwise, the identification is not carried out by directly skipping the length L, because only the ASCII code has the possibility and the necessity of correct identification, when the type identification of the ASCII code is successful, higher weight is given when the evaluation is carried out, and the identification process is as shown in figure 4.

For the identification process of the above steps, we enumerate several common data for interpretation:

1. bit 87654321

00100001 item,binary,1 length

00000001 1 byte long

10101010 data byte

the above represents one binary data, the data is 10101010;

2. bit 87654321

01000001 item,ASCII,1 length

00000011 3 byte long

01000001 ASCII A

01000010 ASCII B

01000011 ASCII C

the above represents three ASCII data, data is ABC;

3. bit 87654321

01101001 item，2-byte integers

00000110 6 byte long（total）6/2=3 integers

XXXXXXXX MSByte number X

XXXXXXXX LSByte number X

YYYYYYYY MSByte number Y

YYYYYYYY LSByte number Y

ZZZZZZZZ MSByte number Z

ZZZZZZZZ LSByte number Z

the above represents three two-byte long integer data.

4. bit 87654321

10010001 item，4-byte floating point

00000100 4 byte long (total) 4/4=1 floating point

ffffffff

ffffffff floating point number in IEEE 754

ffffffff

The above represents a four byte length of float data.

5. bit 87654321

00000001 List

00000011 3 Elements

00100001 Binary Item next byte length

00000001 1 byte long

00000100 Alarm set, category 4

01100101 Item, 1-byte integer, next byte length

00000001 1 byte long

00010001 Alarm 17

01000001 Item, ASCII, next byte length

00000111 7 characters

01010100 ASCII T

00110001 ASCII 1

00100000 ASCII space

01001000 ASCII H

01001001 ASCII I

01000111 ASCII G

01001000 ASCII H

The above represents a List of 3 bits, respectively a binary data, an integer data, and a string of ASCII data.

Further, the specific process of step 4 is as follows:

step 401: a, a two-dimensional probability function is researched and designed, so that comprehensive evaluation of a two-dimensional result is facilitated, namely, the proportion of a part which can be judged as SECS2 data in a PAYLOAD part is calculated; B. and identifying the weight value obtained in the process. Due to the characteristics of these two dimensions, the following requirements are placed on the probability function:

assuming that values of A and B are Pa and Pb respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

An example of the identification process is given below.

Intercepting the data packet on the network, and performing HSMS header detection on the data packet under the assumption that the session link to which the data packet belongs is not determined as the SECSII data session, and finding that the data packet does not have the HSMS header. Preprocessing the data packet, and extracting data to obtain the following arrays: raw = "0104410d5050315f466f726d6174746564410645714d444c4e4106312e312e303101030102a90200620102a50101410554657374310102a90200010102a50101a501020102a90200020109a50100a9020001b10400000001650164690203e87104000f424091043 dcccd 2501004132413132333435363738395f3132333435363738395f3132333435363738395f3132333435363738395f 233342333536373839".

By using the identification method of the present invention, a possible access point is searched first, and the first byte 01(00000001) of the possible access point is found to meet the requirement of the access point, and 3-8 bits of the possible access point are (000000), as can be seen from a comparison of fig. 2, the possible access point means the following data type LIST, and 1-2 bits of the possible access point are (01), the possible access point means that the data length will be represented by 1 byte in the following data, and for the LIST type, the possible access point means that the next byte represents the number of elements contained in the LIST. Assuming that the entry point is correct, we continue to extract 1 byte backwards as 04, meaning that there are 4 elements in LIST. Continuing to extract 1 byte 41 (01000001) backwards, wherein 3-8 bits of the byte are (010000), the known data type is ASCII, the next 1 byte represents the ASCII data length, the next byte 0d is extracted, the ASCII data length is 13, at this time, we directly skip 13 bytes to continue extracting, find that the next byte is still 41, represents ASCII data, and so on, we obtain the skip identification process as follows:

number of steps	Data type	Number of bytes in length	Jump length	Entry point
						1	LIST	1	Is that
2	ASCII	1	13
					3	ASCII	1	6
4	ASCII	1	6
					5	LIST	1
6	LIST	1
					7	U2	1	2
8	LIST	1
					9	U1	1	1
10	ASCII	1	5
					11	LIST	1
12	U2	1	2
					13	LIST	1
14	U1	1	1
					15	U1	1	1
16	LIST	1
					17	U2	1	2
18	LIST	1
					19	U1	1	1
20	U2	1	2
					21	U4	1	4
22	I1	1	1
					23	I2	1	2
24	I4	1	4
					25	F4	1	4
26	BOOLEAN	1	1
					27	ASCII	1	50

After jumping out of the data packet, the result is fed back, the total length of the data packet is 162 bytes, the number of bytes of SECSII data is determined to be 162, jumping is carried out 19 times, jumping is carried out once every 8.53 bytes on average, and the proportion of the SECSII data is 100%. For this packet, our decision is that it must be a SECSII packet.

The invention is improved and innovated on the basis of several flow identification methods, and identifies the data packet on the basis of the special data structure of the SECSII data, so that the SECS2 data packet with HSMS head information loss also has an accurate identification mode, and the efficiency of the identification process is considered, and the identification process is designed by combining the SECS2 data structure in the identification process, thereby ensuring the accuracy and the efficiency of the identification.

Claims

1. A method for recognizing an SECS2 data packet with lost HSMS header information is characterized by comprising the following specific steps:

step 1: using the HashMap to store the information and the state of the session connection, reading in the unknown packet, and searching whether the information of the connection exists in the HashMap; if not, a node is newly established, and the connection is stored; if so, checking the state of the mobile terminal: when the following situation occurs, the following data packet is determined to be the SECS2 data packet-in the previous data packet, the HSMS header is already taken, and the connection is not closed; the previous packet has been determined to be an SECS2 data segment and the connection has not been closed; otherwise, entering the next step;

step 2: carrying out primary detection on the unknown packet to judge whether the unknown packet has an HSMS (high speed Mobile station) header; if so, directly judging as an SECS2 data packet, recording in a HashMap, and directly jumping to a result output step; otherwise, entering step 3 for judgment;

and step 3: judging the format of the SECSII data packet;

intercepting the PAYLOAD of the unknown data packet, scanning single bytes one by one, searching a first enumeration value as a judgment entry point, extracting length information from the enumeration value, and skipping the length to perform next round of enumeration value matching until a boundary or a boundary is skipped; if the Byte appearing in the matching is not the enumeration value which should appear, the previous entry point Byte is wrong, and the point is jumped back and continuously searched; after matching, feeding back the weight value, and entering the step 4; the specific process is as follows:

step 302: scanning the PAYLOAD PAYLOAD of the data packet one by one byte, searching a first byte which possibly is a type field in SECSII data, wherein the value of the byte belongs to an enumerated value set, and extracting length and type information from the byte;

step 304: when jumping to the boundary or jumping out of the boundary accords with the enumeration value rule, the entry point is considered to be found correctly; if the data section is not found to conform to the set SECS2 rule when jumping to the boundary or jumping out of the boundary, the data section is considered to be impossible to belong to SECS2 data, proportion statistics is carried out on the data conforming to SECS2, weight calculation is carried out, and the step 4 is carried out;

the specific methods for finding the access point enumeration value and the jump identification are as follows: scanning the PAYLOAD portion of the packet byte-by-byte; extracting 3-8 bits of a single byte, if an enumeration value appears, taking the enumeration value as an entry point, extracting 1-2 bits of the byte as length byte information L, namely the length of the subsequent L is the length information L of the data; if the enumerated value indicates that the data is ASCII code, identifying the type of the data with the length of L, otherwise, directly skipping the length of L and not identifying; when the ASCII code type is successfully identified, a higher weight is given during evaluation;

2. The method according to claim 1, characterized in that in step 1:

the using the HashMap to store the session link information and the state comprises the following steps: designing a data structure of a HashMap node, wherein the data structure comprises the following five-element group of a data packet: the source IP address, the source port number, the destination IP address, the destination port and the session ID are used as HashKey in a two-byte XOR mode;

wherein, the session state information is defined as follows:

the session state information may be as follows:

in the above state, for the link with the session state (1), a subsequent determination process needs to be performed; for the links with the session states of (2) and (3), the subsequent data packet is directly determined as the SECS2 data packet, and when the session closing information is received, the node is deleted.

3. The method according to claim 2, wherein in step 2, said determining whether the unknown packet has an HSMS header is performed according to an HSMS message format, in which message header information occupies 10 bytes and includes 6 parts of information, which are respectively: session ID, Header Byte, PType, SType, System Bytes; at the beginning of the interactive data, the Stream and Function information of the data information is contained in the Header Byte in the Message Header, and the difference between the control information and the data transmission information is embodied in the Session ID and the SType.

4. The method according to claim 3, wherein the specific process of step 4 is as follows:

step 401: designing a two-dimensional probability function for comprehensively evaluating the probability of the SECS2 data packet as two dimensions; wherein, the dimension A is used for judging the proportion of the SECS2 data part in the PAYLOAD part; the dimension B is used for identifying the weight obtained in the process; based on the features of these two dimensions, the following requirements are placed on the probability function:

when A is small and B is large, the probability is not higher than a judgment threshold value, and the condition that the length of a packet is too long, so that a lot of irrelevant data are misjudged as SECS2 data and a higher weight is displayed is eliminated;

in the above two cases, when the A, B difference is larger, the probability should be lower, the difference is smaller, and within a certain range, the probability is higher;

when A is large and B is large, it should be judged as a high probability SECS2 packet;

assuming that values of A and B are Pa and Pb respectively, then:

P=k(Pa*Pb)/|Pa-Pb|

wherein Pa ≠ Pb, the k value should refer to the threshold values A and B to adjust the result of P to be within a proper interval;

step 402: combining the proportion statistical result and the weight calculation result in the step 304, and judging the probability of whether an unknown data packet is an SECS2 data packet according to the two-dimensional probability function set in the step 401;

step 403: outputting the judgment result in the step 402, adopting a function structure form for the return result form, returning a function judgment result for each data packet subjected to the cyclic processing, and continuously performing cyclic reading processing on the data packet which is not subjected to the cyclic processing according to whether the processing of the flow data is finished; after the processing is finished, the HashMap node information is updated according to the result, and subsequent identification is facilitated.