CN114024748B

CN114024748B - Efficient Ethernet traffic identification method combining active node library and machine learning

Info

Publication number: CN114024748B
Application number: CN202111302612.2A
Authority: CN
Inventors: 胡晓艳; 舒卓卓; 童钟奇; 程光; 吴桦; 龚俭
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2024-04-30
Anticipated expiration: 2041-11-04
Also published as: CN114024748A

Abstract

The invention provides a high-efficiency Ethernet flow identification method combining an active node library and machine learning, which is divided into four parts, wherein the first part is the structure of the active node library; the second part is training of the recognition model, the third part is performing comparison analysis by using different machine learning algorithms, and selecting a model which is obtained after training of the machine learning algorithm and is most suitable for classification as the recognition model; the fourth part is the Ethernet traffic identification, and the specific content is that the traffic is divided into TCP and UDP traffic after being screened by the active node library and is input into an identification model for identification, and meanwhile, the node information in the Ethernet active node library is updated according to the identification result. The invention can effectively identify the Ethernet flow existing in the current network, and the accuracy of the monitoring effect reaches 99%. The method is convenient for a network manager to monitor the Ethernet network traffic.

Description

Efficient Ethernet traffic identification method combining active node library and machine learning

Technical Field

The invention belongs to the technical field of network space safety, and relates to a high-efficiency Ethernet flow identification method combining active node library and machine learning.

Background

Blockchains are a distributed ledger technique maintained jointly by multiple parties, ensuring transport and access security through cryptography. The method can realize the functions of consistent storage of data in the account book, difficult tampering, repudiation prevention and the like. The blockchain technology provides a new solution for further solving trust problems, security problems and efficiency problems in the Internet, and also brings new opportunities and challenges for the development of industries such as finance and the like.

After the block chain technology is first proposed by the Zhongben. Various blockchain industries, such as bitcoin, ethernet, etc., including encrypted digital currency, are rapidly developing. According to the statistics of China electronic information industry development institute, the domestic blockchain industry scale in 2020 reaches 48.5 hundred million yuan, and the growth rate reaches 48.5 percent compared with the last year. With the rapid development of the whole industrial scale, potential safety supervision problems in blockchains are also exposed. Firstly, the blockchain digital currency provides a safe and stable money washing way for crimes such as money washing, luxoviruses and the like, and the development of dark nets and black products is promoted to a great extent; secondly, the blockchain digital currency enables the money transfer across the national border to be simpler, and influences the stability of financial markets of various countries; finally, due to the fact that the blockchain is decentralised and cannot be tampered, the blockchain is often used for storing and spreading sensitive information, and health of a network ecological environment is seriously affected. The abuse of blockchain not only jeopardizes national security and social stability, but also brings great threat and challenge to network security supervision.

As a representative application in the blockchain, the bitcoin implements blockchain application development with a scripting engine. This also makes bitcoin limited by the expressive power of the scripting language, difficult to maintain complex contract development, and therefore its performance is greatly limited; the Ethernet (Ethereum) abstracts the blockchain system into a transaction-based state machine on the basis of an Ethernet virtual machine (EVM, ethereum Virtual Machine), and supports recording arbitrary information and executing arbitrary functions by using a complete programming language of the figure. In the 23 rd-stage global public chain technical evaluation index issued by China electronic information industry development institute, the Ethernet stands for the first place in the applicability evaluation of 37 public chains. Compared with other blockchain implementation schemes such as bitcoin, the Ethernet can better support the blockchain distributed application development, and has higher research value and research significance.

However, not symmetrical to the rapid development of the blockchain industry is the lag of blockchain supervision technology. The existing research on the blockchain security problem is mostly aimed at exploring the blockchain technology, such as a blockchain attack mode, a blockchain design vulnerability, a blockchain application direction and the like, and the analysis on the blockchain security problem on the network traffic supervision level is lacking. And ethernet is the most applicable blockchain platform, and as blockchain technology matures, the ethernet will be developed. The Ethernet network traffic is measured and analyzed, and the Ethernet safety supervision scheme is explored, so that the Ethernet network traffic monitoring method has important significance for Ethernet network safety and even block chain network safety.

Therefore, the invention gathers the Ethernet traffic in the network by constructing the active Ethernet nodes in the network. Traffic is then divided into TCP and UDP traffic to respectively correspond to the identification features. And (5) completing the identification and the distinction of the normal flow and the Ethernet flow by using a random forest algorithm.

Disclosure of Invention

In order to effectively monitor the Ethernet and realize the identification of the Ethernet traffic, the invention provides a high-efficiency Ethernet traffic identification method combining active node library and machine learning. Aiming at the problem of concealment of the traffic characteristics of the Ethernet, a high-efficiency traffic identification method of the Ethernet is provided by combining an active node library and machine learning. According to the method, an active node library is initialized by using an Ethernet core node library according to the inherent 'small world' characteristic of the Ethernet. Then constructing an active node library based on the core node library, wherein the active node library comprises Ethernet nodes in an active state; then, respectively extracting corresponding characteristics aiming at a UDP-based Ethernet node discovery process and a TCP-based Ethernet data transmission process, and further identifying the Ethernet flow by a machine learning method; and finally, combining the selected characteristics and the model generated by training, filtering the flow through an active node library, and inputting the flow into the identification model to finish the identification of the Ethernet flow. In order to achieve the above purpose, the present invention provides the following technical solutions:

An efficient ethernet traffic identification method combining active node library and machine learning, comprising the following steps:

(1) Based on the assumption that the total number of the Ethernet nodes in the supervision area tends to converge, the active node stores the Ethernet node information in the current area. And collecting Ethernet core node information, initializing an active node library and acquiring Ethernet traffic.

(2) The flow characteristics of the Ethernet UDP flow and the TCP flow are selected respectively, and corresponding flow characteristics are extracted for the Ethernet NDP protocol and RLPx characteristics to be supplemented.

(3) The accurate identification of the Ethernet flow is realized by a machine learning method, and a data set is constructed to test and evaluate the obtained model.

(4) Based on the constructed active node library and the acquired identification model, identifying the Ethernet flow input and correspondingly updating the active node library;

Further, the step (1) specifically includes the following sub-steps:

(1.1) acquiring all the currently disclosed Ethernet core node information through a web crawler, and storing the information in a core node library in the form of an IP address;

(1.2) initializing an active node library according to the collected information of the core node library;

(1.3) dynamically updating the known node information through the information of the nodes in the active node library, so as to obtain the information of the Ethernet nodes in the whole supervision area;

(1.4) setting an expiration time, eliminating inactive Ethernet nodes in a long-time active node library, ensuring timeliness of the active node library and improving efficiency of flow screening;

(1.5) modifying the means NodeFinder for detecting an ethernet node to communicate with the detected ethernet node;

(1.7) capturing ethernet traffic on the intermediate router.

Further, the step (2) specifically includes the following sub-steps:

(2.1) reflecting the characteristic correlation according to the mutual information, and respectively selecting the first 10 characteristics with the highest mutual information value in each of the Ethernet UDP flow and the TCP flow;

(2.2) analyzing the Ethernet UDP flow data packet structure to obtain UDP flow characteristics;

(2.3) analyze RPLx the encryption handshake ENCHANDSHAKE procedure of the protocol to obtain the ethernet TCP traffic characteristics.

Further characteristics of ethernet TCP and UDP traffic we selected are shown in table 1, table 2 below:

Further, the step (3) specifically includes the following sub-steps:

(3.1) combining the acquired Ethernet traffic with various application traffic in the public data set VPN-nonVPN to form a data set ETI required by an experiment;

(3.2) data set was prepared according to 8:2 into training and test sets, four machine learning algorithms are used: support vector machine, random forest, logistic regression, and K-nearest neighbor evaluate the method used from multiple indices.

Further, the step (4) specifically includes the following sub-steps:

(4.1) screening unidentified traffic through an active node library, inputting an identification model, and outputting whether the traffic is Ethernet traffic or not;

and (4.2) updating node information of the Ethernet active node library according to the identification result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) The invention can effectively identify the Ethernet flow existing in the current network, and the accuracy of the monitoring effect reaches 99%. The method is convenient for a network manager to monitor the Ethernet network traffic.

(2) According to the invention, TCP and UDP flows are separated, and data packet structure analysis and other works are respectively carried out on the flows, so that the characteristics most suitable for classification are obtained, and the accuracy of monitoring is effectively improved by combining the use judgment of mutual information.

(3) The invention constructs the Ethernet active node library, and screens out potential Ethernet traffic through the active node library. Compared with the flow detection method without filtering by the active node library, the indexes such as detection accuracy and precision are improved by 3% on average. The time consumed for detecting the Ethernet flow of the same spline number is less than 50% of the time consumed by the flow detection method without being filtered by the active node library.

(4) The method for screening the traffic through the Ethernet active node library can effectively avoid negative influence on the identification performance.

Drawings

FIG. 1 is a schematic diagram of an experimental environment setup;

FIG. 2 is a schematic diagram of an identification framework;

FIG. 3 is a diagram showing performance of different machine learning algorithms on various performance indicators before and after screening using an active node library on identification of UDP flows;

FIG. 4 is a schematic representation of different machine learning algorithms on various performance indicators before and after screening using an active node library on the identification of UDP flows;

Fig. 5 identifies a time-consuming schematic, where (a) UDP traffic identification efficiency is schematic and (b) TCP traffic identification efficiency is schematic.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

Example 1: the invention provides a high-efficiency Ethernet traffic identification method combining an active node library and machine learning, wherein an identification framework is shown in figure 2 and is divided into four parts, the first part is a structure of the active node library, and the specific content is that the active node library is initialized by storing information of core nodes ensuring stable operation of the Ethernet in the core node library, so that operations such as searching, adding, deleting and the like of the active nodes are completed, and the active node library is constructed. Then, a flow collection unit is deployed through an active node library to collect the Ethernet flow; the second part is training of an identification model, and specifically comprises the steps of dividing the Ethernet traffic into TCP and UDP traffic, respectively carrying out correlation analysis on the data packet structure of the Ethernet traffic, acquiring the Ethernet traffic identification characteristics most suitable for classification through actual data verification, and simultaneously carrying out characteristic screening work by using a measurement unit of mutual information. After the selection of the characteristics is completed, the Ethernet flow and the background flow acquired before are used as data sets and are divided into training set test sets; the third part is to use different machine learning algorithms for comparison analysis, and select a model which is obtained after training the machine learning algorithm most suitable for classification as an identification model; the fourth part is the Ethernet traffic identification, and the specific content is that the traffic is divided into TCP and UDP traffic after being screened by the active node library and is input into an identification model for identification.

Specifically, the method for rapidly identifying the flow of the bit coin dug botnet comprises the following steps:

(1) Constructing an active node library, and building an experimental environment to collect relevant Ethernet traffic.

The specific process of the step is as follows:

(1.1) acquiring all the currently disclosed Ethernet core node information by using a web crawler, and storing the information in a core node library in the form of an IP address;

Initializing an active node library by using the information acquired by the core node library, continuously searching the currently existing Ethernet active nodes, and dynamically updating the node information of the active node library;

(1.3) setting an expiration time for each active node, eliminating the Ethernet nodes which are not active for a long time, and ensuring the timeliness of an active node library;

(1.4) modifying the means NodeFinder for detecting an ethernet node to communicate with the detected ethernet node;

(1.5) capturing ethernet traffic on the intermediate router by means of Wireshark software;

(1.6) various application traffic in the public data set VPN-nonVPN is employed as background flow.

(2) Dividing the original Ethernet traffic into TCP and UDP traffic, analyzing the data packet structure of the two traffic to extract the characteristics which can be used for complete traffic data identification and classification, using mutual information and carrying out characteristic selection, and reserving the characteristics which can be used for recording identification and classification.

The specific process in the step is as follows:

(2.1) dividing the original Ethernet traffic into TCP and UDP traffic;

(2.2) screening of features using mutual information indicators was performed based on the common 80 traffic statistics proposed by Draper et al. And respectively selecting the first ten features with highest mutual information of TCP and UDP flows.

(2.3) Analyzing the Ethernet UDP flow data packet structure, wherein the lengths of the data packets in the UDP flow have strict sequence relation, and the lengths of the data packets of each type have different and stable distribution. The length of the first eight packets in the UDP stream is extracted as a feature.

(2.3) Extracted 18 features of Ethernet UDP traffic have feature names and corresponding meanings as

Table 3 shows

(2.4) Analyzing the ethernet TCP traffic interaction procedure, it is found that the ethernet TCP stream will contain a number of packets with equal payload lengths. The payload of the packet carrying the header is typically a packet combination of 32B, 1B and 12B. Characterized by the average length of the load of two packets in the encryption handshake phase and the proportion of packets with load lengths of 32B, 1B and 12B in the total packets.

(2.5) Extracted feature names of 12 features of Ethernet TCP traffic and corresponding meanings of the features are as follows

Table 4 shows

(3) After the feature selection is completed, the ethernet flow and the background flow acquired before are used as data sets and are divided into training set test sets. And (3) performing comparison analysis by using different machine learning algorithms, and selecting a model which is obtained after training by the machine learning algorithm most suitable for classification as an identification model.

The specific process in the step is as follows:

(3.1) constructing an ethernet traffic data set using the data collected in step (1), and setting the data set to 8: the ratio of 2 is divided into a training set and a test set. And selecting a random forest algorithm with highest accuracy by comparing parameters such as accuracy of the algorithm models such as random forest, K neighbor, naive Bayes and the like. And meanwhile, the identification effect before and after the flow is screened by comparing with the method using the active node library. The accuracy of identification after screening by the active node library method is improved by 3% compared with the prior method, and the specific analysis results are shown in fig. 3 and 4.

And (3.2) performing time-consuming evaluation of the combination of the active node library and the machine learning identification method Ethernet traffic identification and the traditional method detection. Compared with the traditional detection method which is time-consuming, the method for identifying the Ethernet flow by combining the active node library and the machine learning identification method has the advantages that the time consumption is reduced by more than 50%, and the specific analysis result is shown in fig. 5.

(4) And dividing the traffic into TCP and UDP traffic after being screened by the active node library, inputting the TCP and UDP traffic into an identification model for identification, and updating the information of the Ethernet active node library according to the identification traffic result.

The method specifically comprises the following steps:

And (4.1) inputting the IP address of the source and destination of the traffic extraction to be detected into an active node library, and judging whether the active node library contains the IP address.

And (4.2) if the IP address is contained, dividing the traffic into TCP and UDP traffic, lifting the relevant characteristics, and then respectively putting the TCP and UDP traffic into the identification model obtained in the step (3) for judgment and identification.

(4.3) If the traffic is identified as the Ethernet traffic and neither source nor destination IP is in the active node library, adding relevant IP address information to the active node library as a new active node.

(4.4) Setting an active time for the active node, and if the node which does not respond beyond the active time, deleting the node from the active node library.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The efficient Ethernet traffic identification method combining active node library and machine learning is characterized by comprising the following steps:

(1) Based on the assumption that the total number of the Ethernet nodes in the monitoring area tends to be converged, storing the Ethernet node information in the current area by using the active node library, and collecting the Ethernet core node information to initialize the active node library so as to acquire the Ethernet traffic;

(2) Respectively selecting flow characteristics of the Ethernet UDP flow and the TCP flow, and extracting corresponding flow characteristics as supplement aiming at the characteristics of the Ethernet NDP protocol and RLPx;

(3) The accurate identification of the Ethernet flow is realized by a machine learning method, a data set is constructed, and the obtained model is tested and evaluated;

(4) Based on the constructed active node library and the acquired identification model, the Ethernet traffic input is identified,

Step (1) collects the information of the Ethernet core nodes to initialize the active node library, and obtains the Ethernet flow; the method specifically comprises the following substeps:

(1.4) setting an expiration time, eliminating the inactive Ethernet nodes in the long-time active node library,

(1.7) capturing ethernet traffic on the intermediate router;

Wherein, the step (2) specifically comprises the following sub-steps:

(2.1) analyzing the Ethernet UDP flow data packet structure to obtain UDP flow characteristics;

(2.2) analyzing the encryption handshake process of RPLx protocols to obtain the characteristics of the Ethernet TCP flow;

(2.3) reflecting the correlation of the features according to the mutual information, respectively selecting the first 10 features with the highest mutual information value in the Ethernet UDP flow and the TCP flow, and adding the related features obtained in the steps (2.1) and (2.2) as the last selected feature;

wherein, the step (3) specifically comprises the following sub-steps:

(3.1) combining the acquired Ethernet traffic with various application traffic in the public data set VPN-nonVPN to form a data set required by an experiment;

(3.2) the dataset was written with 8:2 into training and test sets, four machine learning algorithms are used: the method comprises the steps of evaluating a used method from a plurality of indexes by a support vector machine, a random forest, logistic regression and K nearest neighbor;

the step (4) specifically comprises the following sub-steps:

(4.1) screening unidentified traffic through an active node library, inputting an identification model, outputting whether the traffic is Ethernet traffic or not,