CN115514720A

CN115514720A - Programmable data plane-oriented user activity classification method and application

Info

Publication number: CN115514720A
Application number: CN202211135710.6A
Authority: CN
Inventors: 章玥; 朱信宇; 蒲戈光
Original assignee: Shanghai Industrial Control Safety Innovation Technology Co ltd; East China Normal University
Current assignee: Shanghai Industrial Control Safety Innovation Technology Co ltd; East China Normal University
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-12-23
Anticipated expiration: 2042-09-19
Also published as: CN115514720B

Abstract

The invention discloses a user activity classification method facing a programmable data plane, which comprises the following three planes: learning plane, control plane, data plane. The method can realize the classification of user activities on a data plane, realize a proper machine learning model through P4 language and deploy the model to a programmable switch, the switch analyzes the flowing data packet, extracts the characteristic data of the network flow and stores the characteristic data in a storage structure formed by sketch and a hash table, and then a machine learning classifier is used for analyzing the characteristics of the network flow in a period of time interval, judging the user activities in the period of time interval and uploading the classification result.

Description

Programmable data plane-oriented user activity classification method and application

Technical Field

The invention belongs to the technical field of programmable data planes, machine learning and computer networks, and relates to a user activity classification method and application for the programmable data planes.

Background

The increasing popularity of network terminals and the continuous emergence of new applications not only bring exponential growth of network traffic, but also increase the complexity of user activity analysis. According to data display of 49 th 'Chinese Internet development condition statistical report' issued by the Chinese Internet information center (CNNIC) in 2022, as the time comes to 12 months in 2021, the scale of Chinese netizens reaches 10.32 hundred million, and is increased by 4296 million compared with 12 months in 2020, and the Internet popularity reaches 73.0%. The time for surfing the Internet per week reaches 28.5 hours, and is improved by 2.3 hours compared with 12 months in 2020. The use of network terminal equipment by netizens is more diversified, and the use of mobile phones, desktop computers, notebook computers, televisions and tablet computers is more and more popular. The increase of transmission data flow and the continuous abundance of application programs in the network cause the problems of tense use of network resources, unbalanced resource distribution, low utilization rate of forwarding equipment and the like.

The user activity classification can classify online network traffic according to application types, and provides a powerful solution to the above problem. By classifying the user activities, internet Service Providers (ISPs) can acquire the current flow type occupying the main bandwidth, adjust the architecture deployment of the network to support different network operation targets, and provide higher quality of service (QoS) for users; analyzing the current flow situation of the network, and providing policy support for further network optimization; an enterprise or campus network manager can manage and control non-critical traffic such as P2P, voIP, PPS and the like in a network in a traffic peak period so as to ensure smooth network; and finding out suspicious network flow, detecting whether intrusion flows such as DoS, ddos, port Scan and the like exist, and improving the network security performance. The user activity classification is an important work in network security management as a part of network traffic classification, and has important significance in the aspects of network management, qoS policy application, traffic attack detection, dynamic traffic control and the like.

The traditional network flow classification method uses a transmission layer protocol UDP or TCP port number for classification, the method is easy to realize, and the algorithm time complexity is low. However, with the diversification of network protocols and the emergence of port hopping and port masquerading technologies, the accuracy of a traffic classification method based on port identification is lower and lower. Madhukar et al experimentally demonstrated that approximately 70% of network traffic could not be classified using this method. Later, deep Packet Inspection (DPI) technology was receiving increasing attention. DPI techniques achieve traffic classification mainly by analyzing the payload of packets in network flows. The DPI technology does not use a port of a data packet, so the DPI technology is not influenced by the technologies such as port hopping, disguise and the like. However, difficulties arise in classifying network traffic using this technique due to current data encryption and privacy security issues.

In recent years, as machine learning has become increasingly prominent in its performance in classification tasks, more and more work has begun to utilize machine learning to address traffic classification problems. The method is characterized in that machine learning is applied to the field of traffic classification, a user traffic sample is obtained from an actual network, network flow characteristics are extracted, a traffic classifier is obtained by utilizing a data set and a machine learning algorithm, and the trained classifier can be used for classifying network traffic. 248 traffic characteristics designed by Moore et al, including four layers of protocol port numbers, packet sizes, time intervals between packets, etc., are combined with different machine learning algorithms to implement traffic classification. Since most features in the Moore dataset require the entire network flow information, i.e. classification can be done after one flow is over, this is not suitable for traffic classification in real-time or some time-shared scenarios. V i sector Labayen et al have proposed a three-layer model to classify the user's activities in the network traffic, with an average classification accuracy of 97.37%. The first two layers of the model use unsupervised learning algorithms, such as K-Means, gaussian mixture, etc., while the last layer uses supervised learning algorithms, such as SVM, random forest, etc. The machine learning technology is combined with the control plane, so that the capability of implementing unified control and management on the flow can be enhanced. A machine learning algorithm is deployed in a control plane, and fine management and accurate control of network traffic can be better realized by using a global view and hardware computing capacity of a controller, but the method needs to acquire and upload a large number of network traffic samples through a data plane, so that a communication pipeline between the control plane and the data plane is greatly occupied or blocked, and transmission delay and processing delay are caused. Therefore, bruno et al propose to deploy a decision tree algorithm into a data plane for intrusion detection, and to construct classification models for data packets and data streams, respectively. The detection speed under the classification scene aiming at the data packet is very high, the influence on the forwarding of the data packet is hardly generated, but the average classification precision is only 86%; in a data stream classification scene, although the classification precision is improved, the information of the whole network stream needs to be acquired, and the real-time classification requirement is not met. In addition, the method does not consider how to reduce the occupation of the memory space of the switch under the data flow classification scene.

Disclosure of Invention

In order to solve the defects of the prior art, the invention aims to provide a user activity classification method and application facing to a programmable data plane.

The invention explores the application of machine learning technology in the aspect of classifying user activities by deploying the machine learning technology in a programmable data plane, and designs a machine learning-assisted user activity classification method in the programmable data plane. The method can realize the classification of user activities on a data plane, realize a proper machine learning model through P4 language and deploy the model to a programmable switch, the switch analyzes a flowing data packet, extracts the characteristic data of the network flow and stores the characteristic data in a storage structure formed by sketch and a Hash table, and then a machine learning classifier is used for analyzing the characteristics of the network flow in a period of time interval, judging the user activities in the period of time interval and uploading classification results.

The invention provides a user activity classification method facing a programmable data plane, which is realized on a software defined network comprising a learning plane, a control plane and a data plane. In particular, the amount of the solvent to be used,

the learning plane is responsible for acquiring user activities from an external data set, in-band network telemetry or active measurement, extracting and recording characteristic data in a period of time sequence in the flow data, wherein the characteristic data comprises an Ethernet type, an IP protocol number, a source IP address, a destination IP address, a source port number, a destination port number, the duration of network flow, the size range of a data packet, the number of the data packet and the like. Then, learning and constructing a machine-learned flow classification model based on the characteristic data and issuing the flow classification model to a control plane;

the control plane is connected with the learning plane and the data plane and is responsible for converting a pre-trained flow classification model into an application program realized based on a P4 language and generating a group of corresponding matching-action rules. The set of match-action rules indicates that a predetermined action, such as checking, dropping, retransmitting, etc., is performed on the matched packet. In addition, the generated P4 application program is deployed in a programmable switch of a data plane after being compiled, and a matching-action rule is issued to a switch of a control plane to receive a classification result from the switch of the data plane;

the data plane is a programmable data plane and is composed of a programmable switch, after the P4 program deployment is completed, the switch analyzes the header field of each passing data packet, and extracts the Ethernet type, the IP protocol number, the source IP address and the destination IP address, the source port number and the destination port number, the data packet time, the data packet size and the like. Network flow characteristic data in a period of time are recorded through a storage structure formed by a sketch and a hash table, wherein the hash table is used for storing the network flow characteristics in the period of time, the sketch is used for assisting in selecting network flows needing to be stored, and a machine learning classifier classifies user activities, uploads classification results and executes corresponding forwarding or intercepting strategies based on the recorded network flow characteristic data.

The invention combines the characteristics of the network user activity to divide the flow into time windows for classification; each of these time windows may consist of multiple network flows, where a network flow is defined as the collection of all packets having the same source IP address, destination IP address, source port, destination port, and transport layer protocol.

Inferring user activity in the network using machine learning, constructing a two-tiered traffic classification model using scimit-leann machine learning toolkit in the learning plane:

and (3) a flow-level layer: each flow is associated with one of K possible clusters based on the unsupervised K-Means algorithm, indicating the type of behavior that the network flow exhibits over the time interval. And (4) the network flow in the window is accumulated in a one-dimensional array with the size of K after being clustered, and the occurrence frequency of each behavior type in the time interval is counted.

Window level layer: and combining the characteristic parameters in the window level layer and the one-dimensional array generated in the flow level layer as an input characteristic set of the model based on a supervised decision tree algorithm. Through this process, the user activity for each window will be identified.

The invention constructs a conversion method from machine learning to P4 language in the control plane, and the conversion method comprises the following steps:

s1, analyzing the trained K-Means model, and extracting K cluster centers, wherein each cluster center consists of m coordinate values, and each feature is one. The similarity of a given input x and cluster c is determined using the square of the Euclidean Metric equation as follows:

and S2, analyzing the trained decision tree model, traversing all nodes in the tree, and converting the judgment of the nodes into if-else statements. if the if condition is satisfied, the left sub-tree of the current judging node is continuously executed, if the if condition is not satisfied, the right sub-tree area is entered, and else statements are added until the leaf nodes are traversed, which indicates the end of the classification condition.

The invention automatically completes the generation, compilation and deployment of the P4 program in the control plane, and maps the two-layer model into the programmable switch.

The method constructs a storage structure by using sketch and hash table flow characteristics in a data plane, wherein the hash table is used for storing network flow characteristics in a period of time, and the sketch is used for assisting in selecting the network flow to be stored.

The invention also provides application of the user activity classification method in classification of user activities. The application comprises the following steps:

s1, extracting the characteristics of a data packet:

s101, analyzing an Ethernet layer of a data packet, and determining whether the data packet conforms to an IPV4 protocol; if not, the subsequent process is not carried out; and if the IPV4 protocol is met, analyzing the IPV4 layer.

S102, analyzing an IPV4 layer of the data packet, extracting a source IP address, a destination IP address, a source port, a destination port, the size of the data packet, a transmission layer protocol and a time stamp of the arrival of the data packet in the current data packet, and forming a character string for representing a key value of the network flow according to the sequence of the source IP address, the destination IP address, the source port, the destination port and the transmission layer protocol.

S2, data packet feature storage and updating:

and S201, storing the characteristic information extracted in the step S102 into a storage structure consisting of the sketch and the hash table. The sketch data structure is Count-Min sketch, namely a two-dimensional array with the size of d rows and w columns; the hash table is a one-dimensional array formed by n buckets, wherein the buckets are structural bodies capable of storing a plurality of characteristic parameters in one network flow.

Step S202, according to a network flow key value formed by the network flow characteristic character string, as an input, the storage position of the network flow in the hash table is determined through a hash function f (key)% n.

Step S203, checking whether the network flow exists in the hash table; if the network flow already exists or the storage position is empty, directly updating corresponding parameter information; if the storage location is not empty and not the network flow, it indicates that the hash table has a collision.

S204, when the hash table is collided, storing the network flow into the sketch, comparing the size of the network flow in the sketch with the size of the network flow at the collision position in the hash table, and judging whether to execute replacement; if the size of the network flow in the sketch is larger than that of the network flow at the conflict position in the hash table, executing replacement operation; otherwise, the replacement operation is not executed;

step S205, initializing a storage structure formed by the sketch and the hash table in the step S201 at the time t1, if the current time t2-t1 is greater than a preset threshold value threshold, exporting the data in the hash table in a serialized manner, submitting the data to a classifier to execute classification operation, and initializing the storage structure.

S3, classifying user activities:

s301, acquiring characteristic data: and traversing the feature data in the serialization, and inputting the features of each network flow into the classifier.

Step S302, clustering based on stream level characteristics: inputting the characteristics of the single network flow into a K-Means clustering model, judging that the network flow is associated with one of K possible clusters, accumulating the result after clustering into an array with the size of K, and counting the occurrence times of each behavior type.

Step S303, classification based on window level features: after traversing all network flows in a period of time interval, combining the characteristic parameters of the window level and the array generated by the flow level and inputting the combined values into a decision tree classification model, and judging the activities of the user in the time interval.

S304, classification result statistics: after the classifier finishes a flow classification task at a time interval, the classification result is uploaded to a control plane for information statistics.

The beneficial effects of the invention include: the classification of user activities on the programmable switch is realized, and data packets do not need to be sent to a control plane any more, so that the occupation of a channel between a controller and a data plane is eliminated; the memory occupation of the programmable switch by the proposed storage structure is reduced by 20%.

Drawings

FIG. 1 is a flow chart of the programmable data plane oriented user activity classification method of the present invention.

FIG. 2 is a block diagram of the K-Means algorithm of the present invention.

Fig. 3 is a diagram illustrating the structure of the stream level and window level models of the present invention.

Fig. 4 is a structural diagram of the sketch of the present invention.

Fig. 5 is a structural diagram of a hash table of the present invention.

Fig. 6 is a diagram illustrating the deployment of the user activity classification method of the present invention in a programmable switch.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

The invention discloses a user activity classification method facing a programmable data plane and application thereof. The user activity classification method comprises three parts, namely a learning plane, a control plane and a data plane. The learning plane is used for acquiring network traffic generated by user activity from an external data set, in-band network telemetry or active measurement, recording feature data in a period of time interval in the traffic, and constructing a machine-learned traffic classification model to learn useful field features in each time interval; the control plane is connected with the learning plane and the data plane and is responsible for converting a pre-trained traffic classification model into an application program realized based on a P4 language, then generating a group of corresponding matching-action rules, and deploying the generated P4 application program and the matching-action rules into a programmable switch of the data plane; and after the P4 program is deployed, the switch records network flow characteristic data in a period of time by using a storage structure consisting of a sketch and a hash table, identifies the user activity category in the period of time based on the network flow characteristics, uploads the classification result and executes a corresponding forwarding or intercepting strategy. The invention provides a user activity classification method facing a programmable data plane, which adopts a top-down design mode and gives consideration to data acquisition and feature extraction, model construction and conversion, P4 program deployment and user activity classification.

The implementation method of the user activity classification method comprises three parts, namely selection and construction of a flow classification model, conversion and deployment from machine learning to P4 language, and application of the flow classification method, and specifically comprises the following steps:

1. selection and construction of flow classification model

S1, selecting and constructing a flow classification model. And extracting network flow characteristics and constructing a flow classification model on a learning plane, and respectively selecting a machine learning algorithm suitable for realizing the P4 language aiming at the two layers of classification models based on scimit-learn.

Further, the step of constructing the classification model by utilizing scimit-learn in the learning plane is as follows:

and S101, capturing network flow data through an external data set, in-band network telemetry or active measurement. If the captured network traffic data is the same activity performed by the same user in different periods, the data can be regarded as being performed in one period and combined into one network flow to facilitate subsequent segmentation processing. If the capturing of the network traffic data is performed in a network with multiple users, no merging step is required. The output of this step is a single data stream generated by multiple users, so the subsequent steps of the process can be independent of the data input format.

Step s102. Since it cannot be guaranteed that a single network flow performs only one activity during the entire trace, the captured traffic must be segmented and sorted for time windows. The length of the time window is positively correlated with the number of the data packets contained in the time window, the number of the data packets can directly influence the classification precision and the memory load, and a parameter tuning process is executed to determine the length of the time window, so that the classification precision and the memory load are balanced, and the occupation of the memory space is reduced while the classification precision is ensured. Each time window may be composed of multiple network streams, and thus the partitioned time windows are reconstructed into a two-layer data structure comprising a stream-level layer and a window-level layer. One of the network flows is defined as the set of all packets having the same source IP address, destination IP address, source port, destination port, and transport layer protocol.

And S103, extracting a group of characteristic sets containing characteristics of the flow level layer and the window level layer for each window based on a two-layer data structure of the flow level layer and the window level layer. The stream level layer comprises the characteristics of Ethernet type, protocol number, duration of network stream, number of data packets in the network stream, maximum value and minimum value of the size of the data packets in the network stream, the size of the network stream and the like; the window level layer includes characteristics such as the number of packets in a window, the sum of the sizes of the included network flows, the number of included hosts, etc. The characteristic parameters of the flow level layer are used for constructing a flow level layer model, and the characteristic parameters of the window level layer are used for constructing a window level layer model. S104, constructing a flow level model: each flow is associated with one of K possible clusters based on the unsupervised K-Means algorithm, indicating the type of behavior that the network flow exhibits over the time interval. And selecting proper K value size through cross validation. The network flows within the window are clustered and then accumulated in an array of size K, indicating the number of occurrences of each behavior type within the time interval. The K-Means algorithm construction process is shown in FIG. 2, and first a sample set in a file is read, and K cluster centers are randomly selected. And (4) calculating the distance from any sample to K cluster centers, and classifying the sample to the cluster of the center with the minimum distance. If the sample set has samples which are not classified, randomly selecting K clustering centers again to execute the process; and if the samples which are not classified do not exist in the sample set, averaging each class of the classified samples, updating the clustering center, and after multiple calculations, the clustering center does not change any more, namely the construction process is ended when errors do not exist in two times of continuous calculations. S105, window level model construction: and combining the characteristic parameters in the window level layer and the array generated in the flow level layer as an input characteristic set of the model based on a supervised decision tree algorithm. Through cross-validation to select the appropriate decision tree depth. The two-layer model structure composed of the flow-level layer and the window-level layer used in the present invention is shown in fig. 3.

2. Conversion and deployment of machine learning into P4 language

And S2, converting and deploying the machine learning to the P4 language. And analyzing a pre-trained two-layer flow classification model consisting of K-Means and a decision tree, converting the flow classification model into an application program realized based on a P4 language, and generating a group of corresponding matching-action rules. In addition, the P4 application and the match-action rules are deployed on a programmable switch of the data plane.

Further, the specific steps of implementing the conversion and deployment of the machine learning to the P4 language in the control plane are as follows:

s201, analyzing the K-Means and the decision tree model, and extracting K cluster centers, wherein each cluster center is composed of m-dimensional coordinate values, and each coordinate value represents a characteristic parameter. The similarity of a given input x and center c is determined using the squared Euclidean Metric equation, with the smaller the calculated value, the higher the similarity. Determining the cluster to which x belongs by k times of calculation, wherein the Euclidean Metric formula is as follows:

where x denotes a given input sample, c denotes a certain cluster center, and i denotes the ith characteristic parameter.

And S202, converting the generated decision tree into a corresponding classification tree by using a tree packet in scimit-spare by using the classification code, recursively traversing nodes of the decision tree, and converting the judgment of the nodes into if-else statements. if the if condition is satisfied, the left sub-tree of the current judging node is continuously executed, if the if condition is not satisfied, the right sub-tree region is entered, else statements are added until the traversal to the leaf node is completed, and the end of the classification condition is indicated.

Step S203, the inherent metadata of the P4 language is modified for recording and processing the characteristics of the data packet, the data plane with state processing is supported, and the information can also be used as parameters of the matching-action rule table.

Step S204, the model conversion program does not make specific method provisions for the extraction operation of the flow characteristics, and the access of the characteristics needs to be customized before deployment. When the stream duration is calculated, the program reads the starting time in the register corresponding to the stream network stream key value, if the initial value is '0' to indicate that the stream is a new stream, the starting time of the stream is set as the time stamp of the packet, the register is updated, and the stream duration of the packet is calculated by the difference between the network stream starting time and the time stamp of the packet. And the maximum value and the minimum value of the data packet size are also assigned by reading and comparing data in the register, and when the data packet size is larger than the maximum value in the register or smaller than the minimum value in the register, the content of the register is updated.

And S205, completing the compiling of the P4 program by using the P4 c. And compiling by a compiler to generate a P4 information file and a JSON format network equipment configuration file. Json configures all hosts using the commands listed in topology, and deployment can be smoothly completed under the condition that the P4 program is grammatically and logically guaranteed to be correct and each end definition in the topology can correspond to the forwarding target of the matching action table.

3. Application of flow classification method

And S3, generating a sketch and hash table storage structure. In the invention, a storage structure formed by sketch and hash table is designed in consideration of the limited storage space in the data plane programmable switch, and the storage structure is already deployed in the programmable switch. As shown in fig. 4, the sketch data structure is a Count-Min sketch, i.e., a two-dimensional array with a size of d rows and w columns; as shown in fig. 5, the hash table is a one-dimensional array formed by n buckets, where a bucket is a structural body capable of storing multiple characteristic parameters in a network flow, and is a network flow key, the number of data packets in a network flow, the arrival time of the first data packet of a network flow, the arrival time of the last data packet, the maximum value and the minimum value of the size of a data packet in a network flow, the size of a network flow, and the like. Further, the specific steps of completing the data packet feature extraction and storage in the programmable switch are as follows:

and S301, extracting the characteristics of the data packet. And analyzing an IPV4 layer of the data packet, extracting a source IP address, a destination IP address, a source port, a destination port, a data packet size, a transport layer protocol and a time stamp of data packet arrival in the current data packet, and forming a character string for representing a network flow key value according to the sequence of the source IP address, the destination IP address, the source port, the destination port and the transport layer protocol.

Step S302, storing the network flow characteristics. And taking a network flow key value formed by the network flow characteristic character string as an input, and determining the storage position of the network flow in the hash table through a hash function f (key)% n, wherein the function f (key) is used for converting the character type key into an integer, and n is the size of the hash table.

Step S303, checking whether the network flow exists in a hash table; if the network flow already exists and the storage position is correct or the storage position of the network flow is empty, directly updating corresponding parameter information; if the storage location is not empty and not the data stream, it indicates that the hash table has collided.

S304, when Hash collision occurs, storing the network flow into the sketch, comparing the size of the network flow in the sketch with the size of the network flow at the collision position in the Hash table, and judging whether to execute replacement; if the size of the network flow in the sketch is larger than that of the network flow at the conflict position in the hash table, executing a replacement operation; otherwise, the replacement operation is not executed;

step S305, initializing a storage structure formed by the sketch and the hash table in the step S301 at the time t1, if the current time t2-t1 is greater than a preset threshold value threshold, indicating that network flow in a time window is captured, exporting data in the hash table in a serialized mode, submitting the data to a classifier to execute classification operation, and initializing the storage structure.

And S4, classifying the user activities. As shown in fig. 6, the input hash table serialized data is analyzed by a machine learning classifier deployed in a data plane, so as to determine the user activity in the time interval.

Further, the specific steps for completing the classification of user activities in the programmable switch are as follows:

step S401, acquiring characteristic data: and traversing the feature data in the serialization, and inputting the features of each network flow into the machine learning classifier.

S402, clustering based on stream level features: inputting the characteristics of the single network flow into a K-Means clustering model, judging that the network flow is associated with one of K possible clusters, accumulating the result after clustering into an array with the size of K, and counting the occurrence times of each behavior type.

Step S403, classification based on window level features: after traversing all network flows in a period of time interval, combining the characteristic parameters of the window level and the array generated by the flow level and inputting the combined values into a decision tree classification model, and judging the activities of the user in the time interval.

S404, classification result statistics: after finishing a flow classification task at a time interval, the classifier uploads a classification result to a control plane for information statistics.

Examples

The invention discloses a user activity classification method facing a programmable data plane, and designs a machine learning auxiliary user activity classification method in the programmable data plane. The flow classification method can realize the classification of user activities on a data plane, realize a proper machine learning model ice through P4 language and deploy the ice to a programmable switch, the switch analyzes a flowing data packet, extracts the characteristic data of a network flow and stores the characteristic data in a storage structure formed by a sketch and a hash table, and then a machine learning classifier is used for analyzing the characteristics of the network flow in a period of time interval, judging the user activities in the period of time interval and uploading classification results.

The invention relates to a user activity classification method facing a programmable data plane, which comprises three planes shown in figure 1, namely a learning plane, a control plane and a data plane. In particular, the amount of the solvent to be used,

the learning plane is used for acquiring user activities from an external data set, in-band network telemetry or active measurement, extracting and recording feature data in a time sequence in the flow, then learning useful field features in each time sequence, and constructing a machine-learned flow classification model;

the control plane is connected with the learning plane and the data plane and is responsible for converting a pre-trained flow classification model into an application program realized based on a P4 language and generating a group of corresponding matching-action rules. In addition, the generated P4 application program is deployed into a programmable switch of a data plane after being compiled, and a matching-action rule is issued to the switch to receive a classification result from the data plane switch;

and after the P4 program is deployed, the exchanger analyzes the header field of each passing data packet and extracts the required network flow characteristic data. And recording network flow characteristic data in a period of time through a storage structure formed by the sketch and the hash table, and realizing classification of user activities by the machine learning classifier based on the characteristic data, uploading classification results and executing corresponding forwarding or intercepting strategies.

The implementation method of the user activity classification method comprises three parts, namely selection and construction of a machine learning model, conversion and deployment from machine learning to P4 language, and application of a traffic classification method, and specifically comprises the following steps:

1. selection and construction of flow classification model

And S1, selecting and constructing a flow classification model. And extracting network flow characteristics and constructing a flow classification model on a learning plane, and respectively selecting a machine learning algorithm suitable for P4 language realization aiming at two layers of classification models based on sciit-learn.

step S101, deploying a network detector on a router connected to a user, wherein the router forwards network traffic when the user performs network activities. These network activities include web browsing, video viewing, online chatting, sending mail, transferring files, and the like. SPAN session is enabled in the router and Linux CLI tool Tshark is used to capture a copy of all traffic flowing through the router. The simulation is performed one activity at a time, which may be performed by a single user, or by multiple users simultaneously. And analyzing the generated pcap file format by using a Python program, and exporting the network traffic data to a CSV file.

Step s102. Since it cannot be assumed that the user performs only one activity during the entire trace, the traffic has to be classified into time intervals. By determining the length of each time window to be 3 seconds through parameter tuning, the window can be composed of a plurality of network flows, and therefore the divided time windows can be reconstructed into a two-layer data structure comprising a flow-level layer and a window-level layer. One of the network flows is defined as the set of all packets having the same source IP address, destination IP address, source port, destination port and transport layer protocol.

And 103, extracting a group of feature sets comprising the characteristics of the flow level layer and the window level layer for each window based on the two-layer data structure of the flow level layer and the window level layer, wherein the extracted attributes are shown in the table 1. The characteristics contained in the flow level layer comprise Ethernet type, protocol number, duration of network flow, number of data packets in the network flow, maximum value and minimum value of the size of the data packets in the network flow, the size of the network flow and the like; the window level layer includes characteristics such as the number of packets in a window, the sum of the sizes of the included network flows, the number of included hosts, etc.

TABLE 1 two-layer data structure related attribute description table

S104, constructing a flow level model: each flow is associated with one of K possible clusters based on the unsupervised K-Means algorithm, indicating the type of behavior that the network flow exhibits over the time interval. The K value was determined to be 10 by cross-validation. . The network flows within the window are clustered and accumulated in an array of size 12, indicating the number of occurrences of each behavior type within the time interval. The K-Means algorithm construction process is shown in FIG. 2.

Step S105, window level layer model construction: and combining the characteristic parameters in the window level layer and the array generated in the flow level layer as an input characteristic set of the model based on a supervised decision tree algorithm. The decision tree depth is determined 12 by cross validation.

2. Conversion and deployment of machine learning into P4 language

And S2, converting and deploying the machine learning to the P4 language. And converting a pre-trained two-layer flow classification model consisting of K-Means and a decision tree into an application program realized based on a P4 language, and generating a group of corresponding matching-action rules. In addition, P4 applications and match-action rules are deployed to the programmable switch.

step S201, analyzing the K-Means and the decision tree model, extracting 12 cluster centers, wherein each cluster center consists of 7-dimensional coordinate values which are respectively the type of the Ethernet, the protocol number, the duration of the network flow, the number of data packets in the network flow, the maximum value and the minimum value of the size of the data packets in the network flow and the size of the network flow. The similarity of a given input x and cluster c is determined using the square of the Euclidean Metric equation as follows:

And S202, converting the generated decision tree into a corresponding classification tree by using a tree packet in scimit-spare by using the classification code, recursively traversing nodes of the decision tree, and converting the judgment of the nodes into if-else statements. if the if condition is satisfied, the left sub-tree of the current judging node is continuously executed, if the if condition is not satisfied, the right sub-tree area is entered, and else statements are added until the leaf nodes are traversed, which indicates the end of the classification condition.

Step S203, the inherent metadata of the P4 language is modified for recording and processing the characteristics of the data packet, the data plane with state processing is supported, and the information can also be used as parameters of a matching action table.

Step S204, the model conversion program does not make specific method provisions for the extraction operation of the flow characteristics, and the access of the characteristics needs to be customized before deployment. When calculating the stream duration, the program reads the start time in the register corresponding to the stream network stream key value, if the initial value is '0' to indicate that a new stream exists, the start time of the stream is set as the time stamp of the packet, the register is updated, and the stream duration of the packet is calculated by the difference between the network stream start time and the time stamp of the packet. The maximum value and the minimum value of the data packet size are also assigned by accessing and comparing data in the register, and the content of the register is updated under the condition that the condition is met.

3. Application of flow classification method

And S3, generating a sketch and hash table storage structure. In the invention, a storage structure formed by sketch and hash table is designed in consideration of the limited storage space in the data plane programmable switch, and the storage structure is already deployed in the programmable switch. As shown in fig. 4, the sketch data structure is Count-Min sketch, and the sketch is set as a two-dimensional array with 3 rows and 150 columns; as shown in fig. 5, the hash table is a one-dimensional array composed of 20 buckets, where a bucket is a structure capable of storing a plurality of feature parameters in a network flow. Initializing a storage structure when extracting the characteristics of the data packet, setting Key parts in all packets to be NULL, and setting the rest parts to be 0; in addition, the values in sketch are all set to 0.

Further, the specific steps of completing the data packet feature extraction and storage in the programmable switch are as follows:

and S301, extracting the characteristics of the data packet. And analyzing an IPV4 layer of the data packet, extracting a source IP address, a destination IP address, a source port, a destination port, the size of the data packet, a transport layer protocol and a time stamp of arrival of the data packet in the current data packet, and forming a character string according to the sequence of the source IP address, the destination IP address, the source port, the destination port and the transport layer protocol to form a network flow key value.

Step S302, storing the network flow characteristics. The storage position of the network flow in the hash table is determined by a hash function f (key)% 20 according to a network flow key value formed by the network flow characteristic character string as an input, wherein the function f (key) is used for converting the character type key into an integer.

Step S303, checking whether the network flow exists in the hash table; if the network flow already exists or the storage position is empty, directly updating corresponding parameter information; if the storage location is not empty and not the network flow, a hash table collision is indicated.

S304, when Hash collision occurs, storing the network flow into the sketch, comparing the size of the network flow in the sketch with the size of the network flow at the collision position in the Hash table, and judging whether to execute replacement or not; if the size of the network flow in the sketch is larger than that of the network flow at the conflict position in the hash table, executing a replacement operation; otherwise, the replacement operation is not executed;

step S305, initializing a storage structure formed by the sketch and the hash table in the step S201 at the time t1, and if the current time t2-t1 is greater than a preset threshold value for 3 seconds, serializing and exporting data in the hash table to generate the following data: the characteristic vectors of the Ethernet protocol, the protocol version, the network flow duration, the minimum value of the size of the data packet in the network flow, the maximum value of the size of the data packet in the network flow, the number of the data packets in the network flow and the size of the network flow are delivered to a classifier to execute classification operation and initialize a storage structure.

And S4, classifying the user activities. As shown in fig. 6, the input hash table serialized data is analyzed by a machine learning classifier deployed in a data plane, and the user activity in the time interval is determined.

step S401, acquiring characteristic data: and traversing the feature data in the serialization, and inputting the feature vector of each network flow into the machine learning classifier.

S402, clustering based on stream level features: inputting the characteristics of the single network flow into a K-Means clustering model, judging that the network flow is associated with one of 12 possible clusters, accumulating the result after clustering into an array with the size of 12, and counting the occurrence frequency of each behavior type.

Step S403, classification based on window level features: after traversing all network flows within the time interval of 3 seconds, combining the characteristic parameters of the window level and the array generated by the flow level, inputting the combined values into a decision tree classification model, and judging the activities of the user within the time interval.

S404, classification result statistics: after the classifier finishes a flow classification task at a time interval, the classification result is uploaded to a control plane for information statistics.

The invention was validated on a V actor Labayen dataset and an iscxnvpn 2016 dataset, both of which are public datasets. Wherein the V i sector Labayen dataset is a set of pcap/csv formatted network traffic traces captured from a single user. The data set contains five different activities: file transfer, streaming video, web browsing, text chat, and idle status. ISCX-non VPN contains richer traffic activity, but some activity traces are very few, e.g. email has only 4 traces, P2P has only one trace, so four activities of file transfer, streaming video, text chat and voice call are finally selected for verification. The general indexes of the classification system are adopted: and evaluating the classification accuracy by using the precision rate, the recall rate and the F value. Statistical results are shown in table 2, for the V i sector Labayen dataset, the average F-value, accuracy and recall were 0.947, 95.12% and 94.36%, respectively; for the V i sector iscxnvpn 2016 dataset, the average F value, accuracy and recall were 0.957, 96.06% and 95.43%, respectively.

TABLE 2 recognition rate statistics

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art are intended to be included within the present invention without departing from the spirit and scope of the inventive concept and are intended to be protected by the following claims.

Claims

1. A user activity classification method facing to a programmable data plane is characterized by comprising the following three planes:

the learning plane is used for acquiring network traffic generated by user activity from an external data set, in-band network telemetry or active measurement, extracting and recording feature data in the network traffic within a period of time, and then learning useful field features in each time interval; constructing a machine-learned flow classification model based on the field characteristics and issuing the flow classification model to a control plane;

the control plane is connected with the learning plane and the data plane and is responsible for converting a pre-trained flow classification model into an application program realized based on the P4 language and generating a group of corresponding matching-action rules, and the matching-action rules represent that preset actions are executed on matched data packets; after the generated application program is compiled, the application program is deployed into a programmable switch of a data plane, and a matching-action rule is issued to a switch of a control plane;

the data plane is a programmable data plane and consists of a programmable data switch, and after the deployment of the application program is completed, the switch analyzes the header field of each passing data packet and extracts data flow characteristic data; the switch records network flow characteristic data in a period of time through a storage structure formed by the sketch and the hash table; the system comprises a hash table, a sketch, a user activity classification module, a forwarding or intercepting strategy module, a storage module and a data processing module, wherein the hash table is used for storing network flow characteristics within a period of time, the sketch is used for assisting in selecting network flows needing to be stored, classification of user activities is achieved based on the recorded network flow characteristic data, classification results are uploaded, and the corresponding forwarding or intercepting strategy is executed.

2. The programmable data plane-oriented user activity classification method according to claim 1, characterized in that the characteristic data comprises ethernet type, IP protocol number, source and destination IP address, source and destination port number, duration of network flow, size range of packets, number of packets.

3. The programmable data plane-oriented user activity classification method according to claim 1, characterized in that, in combination with the characteristics of network user activity, network traffic is classified as time windows; each of the time windows consists of a plurality of network flows, wherein one network flow is defined as a collection of all packets having the same source IP address, destination IP address, source port, destination port, and transport layer protocol.

4. The programmable data plane-oriented user activity classification method according to claim 1, characterized in that the activities of users in the network are inferred by machine learning, and a traffic classification model of two layers is constructed in the learning plane by scinit-left machine learning toolkit:

and (3) a flow-level layer: each flow is associated with one of K possible clusters based on an unsupervised K-Means algorithm, indicating the type of behavior that the network flow exhibits over the time interval; network flows in the window are accumulated in a one-dimensional array with the size of K after being clustered, and the occurrence frequency of each behavior type in the time interval is counted;

window level layer: based on a supervised decision tree algorithm, combining the characteristic parameters in the window level layer with the one-dimensional array generated in the flow level layer to serve as an input characteristic set of the model, and identifying the user activity of each window through the process.

5. The programmable data plane-oriented user activity classification method according to claim 1, characterized in that a machine learning to P4 language conversion method is constructed in the control plane, comprising the following steps:

s1, analyzing the trained K-Means model, extracting K cluster centers, wherein each cluster center consists of m coordinate values, each feature is one, and determining the similarity of a given input x and a given cluster c by using the square of an Euclidean Metric formula as follows:

wherein x represents a given input sample, c represents a certain cluster center, and i represents the ith characteristic parameter;

s2, analyzing the trained decision tree model, traversing all nodes in the tree, and converting the judgment of the nodes into if-else statements; if the if condition is satisfied, the left sub-tree of the current judging node is continuously executed, if the if condition is not satisfied, the right sub-tree area is entered, and else statements are added until the leaf nodes are traversed, which indicates the end of the classification condition.

6. The programmable data plane-oriented user activity classification method according to claim 1, characterized in that the generation, compilation and deployment of P4 programs are done automatically in the control plane, mapping the two-layer model into programmable switches.

7. Use of a programmable data plane oriented user activity classification method according to any of claims 1 to 6, characterized in that it comprises the following steps:

s1, extracting the characteristics of a data packet:

s101, analyzing an Ethernet layer of a data packet, and determining whether the data packet conforms to an IPV4 protocol; if not, the subsequent process is not carried out; if the IPV4 protocol is met, analyzing an IPV4 layer;

s102, analyzing an IPV4 layer of a data packet, extracting a source IP address, a destination IP address, a source port, a destination port, a data packet size, a transmission layer protocol and a time stamp of arrival of the data packet in the current data packet, and forming a character string for representing a network flow key value according to the sequence of the source IP address, the destination IP address, the source port, the destination port and the transmission layer protocol;

s2, data packet feature storage and updating:

step S201, storing the characteristic information extracted in the step S102 into a storage structure consisting of a sketch and a hash table; the sketch data structure is Count-Min sketch, namely a two-dimensional array with the size of d rows and w columns; the hash table is a one-dimensional array formed by n buckets, wherein the buckets are structural bodies capable of storing a plurality of characteristic parameters in one network flow;

step S202, according to a network flow key value formed by a network flow characteristic character string, determining the storage position of the network flow in a hash table through a hash function f (key)% n;

step S203, checking whether the network flow exists in a hash table; if the network flow already exists or the storage position is empty, directly updating corresponding parameter information; if the storage position is not empty and not the network flow, indicating that the hash table is collided;

s204, when the hash table is collided, storing the network flow into the sketch, comparing the size of the network flow in the sketch with the size of the network flow at the collision position in the hash table, and judging whether to execute replacement or not; if the size of the network flow in the sketch is larger than that of the network flow at the conflict position in the hash table, executing replacement operation; otherwise, the replacement operation is not executed;

s205, initializing a storage structure formed by the sketch and the hash table in the step S201 at the time t1, if the current time t2-t1 is greater than a preset threshold value threshold, exporting data in the hash table in a serialized mode, submitting the data to a classifier to execute classification operation, and initializing the storage structure;

s3, classifying user activities:

s301, acquiring characteristic data: traversing the feature data in the serialization, and inputting the feature of each network flow into a classifier;

step S302, clustering based on stream level characteristics: inputting the characteristics of the single network flow into a K-Means clustering model, judging that the network flow is associated with one of K possible clusters, accumulating the clustered result into an array with the size of K, and counting the occurrence frequency of each behavior type;

step S303, classification based on window level features: after traversing all network flows in a period of time interval, merging the characteristic parameters of the window level and the array generated by the flow level, inputting the merged array into a decision tree classification model, and judging the activities of the user in the time interval;

s304, classification result statistics: after finishing a flow classification task at a time interval, the classifier uploads a classification result to a control plane for information statistics.