CN107634874A

CN107634874A - P2P flow rate testing methods based on BP neural network under SDN environment

Info

Publication number: CN107634874A
Application number: CN201710785712.2A
Authority: CN
Inventors: 施佺; 刘德靖; 曹阳; 孙玲; 许致火; 邵叶秦; 朱森来; 沈琴琴
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-01-26

Abstract

The present invention relates to the P2P flow rate testing methods based on BP neural network under SDN environment, comprise the following steps：Step 1）BP neural network model is built under SDN environment, the input of the BP neural network model includes being used for the feature for distinguishing P2P flows, and the output of BP neural network model is P2P flow detection results；Step 2）Above-mentioned input feature vector is brought directly into neutral net by the normalization formula of setting to be trained；Step 3）Repeat step 2）With the increase of training algebraically, when output result and previous output result error afterwards once is approached to some value, neutral net completes convergence, and BP neural network model completes training, and the BP neural network model analysis for completing training is detected into remaining data.Have the beneficial effect that：The P2P flow rate testing methods of the application are then to be based on numerous data, and deep mining analysis is carried out to it, find out the information for distinguishing P2P flows and non-P2P flows, judge that P2P behaviors provide foundation to network manager.

Description

BP neural network-based P2P flow detection method in SDN environment

Technical Field

The invention relates to a P2P flow detection technology in an SDN environment, in particular to a P2P flow detection method based on a BP neural network in the SDN environment.

Background

The research results in the SDN network are many, but the research on the P2P traffic detection method in the SDN is very rare, so the P2P traffic detection method in the SDN is selected for research.

For P2P traffic traversing a network device, which consists of numerous small packets, the study of individual packets presents the problem that it is meaningless to classify a single packet as a class of packets, since there are many similar packets, and packets of a certain size can be generated by any software. Secondly, it is difficult for a single packet to reflect the overall situation, because only the previous and the next packet of the packet are known to be able to analyze the situation of the packet coherently.

When network data is monitored by a traditional method, a large amount of flow information is usually stored, and the flow information is stored in a database in a certain format for people to check historical data. The traffic detection systems of many computer rooms usually have a lot of historical data, and although the detection systems can perform some simple analyses on the historical data, the detection systems do not perform further mining on a large number of data, and ignore a lot of valuable information hidden in the data.

The traditional method adopts an off-line P2P flow detection method, which acts on flow information in the storage device, performs statistical analysis and feature extraction on the flow information in the storage device, and finally utilizes a model of feature fitting synthesis. The method can be applied to network use condition analysis, for example, when a network is congested, if the reason suspected to cause the congestion is that too many users download files by using P2P software, when the network congestion reason analysis is performed, the flow characteristics required by an offline P2P flow detection method can be firstly extracted, and the offline P2P flow detection method is used for analysis to determine which of the flow information is P2P flow and which is not P2P flow. After the P2P traffic is determined, the source and the proportion of the P2P traffic can be analyzed, and the analyzed results can be used for an administrator to take corresponding measures for the user generating the P2P traffic, including reminding the user of using less P2P traffic software, changing the charging mode, and the like. The method has the advantages that the P2P traffic can be quickly found out from a large amount of mixed traffic according to the model, so as to obtain an analysis report, and a typical application scenario generally includes:

(1) When a network failure occurs, the traffic data of the network is stored in the database, so that the network segment with the failure, a host computer which possibly causes the failure and the like can be detected by the offline traffic detection method.

(2) And (4) checking historical data, counting abnormal conditions in a previous period of time, and analyzing by an offline method.

Although the detection method can be applied to the above-mentioned scenes, the method has the disadvantages that the method cannot be overcome, most obviously, although the P2P traffic can be separated from the stored traffic, the method cannot control the transmission of the traffic in real time according to the detection result after detecting the traffic, and can only make some pre-judgment on possible behaviors in the future from the analysis result of data, so that the method cannot be applied to the control scenes, which greatly limits the application.

In a conventional network, many researches are dedicated to analyzing and researching flow information in a storage device, and a document "dungli is a P2P flow detection research [ D ] based on a genetic neural network, university of electronic technology, 2013 ] aims at information in P2P flow, firstly, semi-open connection is used for primarily screening information, secondly, some characteristics of the semi-open connection are used as input of the neural network, and finally, a fitted model is used for detecting P2P flow and non-P2P flow. Document "P2P traffic detection [ D ] on large-scale NetFlow data," 2008 ", university of double denier," P2P traffic and non-P2P traffic information are filtered purely according to some characteristic information, such as the number of half-open connections, characteristic strings, and the like. The document "Schbo, chen Ming, hu Chao, sunrui. NetFlow. P2P flow analysis System [ J ]. The university of Beijing post and Electricity, 2010, (02): 94-98" is a system for making decisions based on some statistical information generated by NetFlow. In conventional networks, acquisition of such network traffic data requires collection and analysis by a dedicated device. There are many studies on traffic in SDN, for example, an information matrix of traffic is established by using information of a start point and an end point of the traffic, and the traffic is estimated by modeling information in the matrix; and detecting network viruses and the like by using the original destination address information of the traffic. However, the existing documents for P2P traffic detection in SDN are still relatively few.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a P2P flow detection method based on a BP neural network in an SDN environment, and the method is specifically realized by the following technical scheme:

a P2P flow detection method based on a BP neural network under an SDN environment comprises the following steps:

step 1) constructing a BP neural network model in an SDN environment, wherein the input of the BP neural network model comprises characteristics for distinguishing P2P traffic, and the output of the BP neural network model is a P2P traffic detection result;

step 2) directly substituting the characteristics into a neural network for training through a set normalization formula;

step 3) determining the final output value of each node of the neural network through a Sigmoid activation function;

and 4) repeating the step 2) and the step 3), wherein along with the increase of the training algebra, when the error between the current output result and the previous output result approaches to a constant, the neural network finishes convergence, the BP neural network model finishes training, and the trained BP neural network model is analyzed and detected for the rest data.

The P2P traffic detection method based on the BP neural network in the SDN environment is further designed in that the features used for distinguishing the P2P traffic in step 1) include five dimensions, which are an average packet size, an average flow duration, a traffic size, an interactive IP number, and a port number.

The method for detecting the P2P flow based on the BP neural network in the SDN environment is further designed in that the distribution of port numbers of the host is more than 1024, and the P2P flow behavior is set.

The method for detecting the P2P flow based on the BP neural network in the SDN environment is further designed in that the detection result of the P2P flow is two, namely 10 and 01 coding output respectively, 10 indicates that the output of the neural network is not the P2P flow, and 01 indicates that the output of the neural network is the P2P flow.

The method for detecting the P2P flow based on the BP neural network in the SDN environment is further designed in the following way that the normalization formula is as shown in formula (1):

in the formula, value currently needs to be processed, minValue represents the minimum Value of all values in the current attribute list, maxValue represents the maximum Value of all values in the current attribute list, and δ represents the gradient of change.

The method for detecting the P2P flow based on the BP neural network in the SDN environment is further designed in the following way that an activation function of each neuron in the BP neural network model is represented by a Sigmoid function, and the result calculated by the activation function is used for influencing the input of the next layer of nodes as shown in the formula (2):

wherein e represents a natural constant, and x represents an input value of a certain node of the neural network;

BP neural network H _o The implementation algorithm of (3):

initializing each connection weight value to be randomly assigned as a number between (-1, 1), and outputting a value of o representing the o-th node of the hidden layer for the input training set<PiOj><P _i O _j &gt, the numerical characteristic of Pi input, O _j Representing the final expected result, i has a value range of 1,2,3 \8230n, n represents the number of input features, j represents the number of coding bits of the output result, and the connection weight between the input layer and the hidden layer and between the hidden layer and the output layer is W _io The method comprises the steps of representing, i represents the ith node of the previous layer, o represents the nodes of the next layer, setting the value range of i to be 1,2,3 \8230, n represents a seed tree of input characteristics, j represents the value range of 1,2,3 \8230, m represents the number of coded bits of an output result;

the output value of the output layer node is as formula (4), where o represents the o-th node of the hidden layer, and on represents the connection between the o-th node of the hidden layer and the n-th node of the output layer:

in the formula (4), x represents an output value of a hidden node;

the descending gradient is;

the updating formula of the weight is as follows:

w(t)＝w(t+1)+δεO _n (6)

δ represents a gradient of change, ε represents a learning rate, and t represents an algebra.

The invention has the advantages of

The method for detecting the P2P flow based on the BP neural network in the SDN environment analyzes the flow information stored in the storage device, extracts useful characteristics of the flow information, brings the characteristics into the neural network for training, and obtains a model capable of distinguishing the P2P flow. The BP neural network model fitted by the method of the invention can play a good role in distinguishing P2P traffic from non-P2P traffic.

Drawings

Fig. 1 is a diagram illustrating average packet sizes.

Fig. 2 is a schematic diagram of port distribution.

Fig. 3 is a schematic of flow duration.

Fig. 4 is a graph showing the magnitude of the average flow rate.

Fig. 5 is a diagram illustrating the number of interactive IPs.

FIG. 6 is a diagram illustrating the training error of the BP neural network.

Fig. 7 is an overall network topology.

Detailed Description

The following describes the embodiments of the present invention in detail with reference to the accompanying drawings.

The P2P traffic detection method based on the BP neural network in the SDN environment provided in this embodiment includes the following steps:

step 1) constructing a BP neural network model, wherein the input of the BP neural network model comprises the characteristics for distinguishing P2P flow, and the output of the BP neural network model is the P2P flow detection result.

And 2) directly bringing the input characteristics into a neural network for training through a set normalization formula.

And 3) repeating the step 2), when the error between the output result of the next time and the output result of the previous time approaches to a certain value along with the increase of the training algebra, finishing convergence of the neural network, finishing training of the BP neural network model, and analyzing and detecting the trained BP neural network model to detect other data.

In this embodiment, the characteristics for distinguishing P2P traffic include five dimensions, which are an average packet size, an average flow duration, a traffic size, an interactive IP number, and a port number. The selection of the five dimensions is explained in detail below:

in network traffic information, each specific traffic information has different value distribution characteristics on its respective attributes, and a certain traffic can be identified by organizing these different value distribution characteristics. The method is characterized by analyzing the distribution of different types of flow numerical values.

In order to analyze features, analysis must be performed from the most primitive information that reflects the most essential information. The original information contains attributes such as the number of packets, the number of bytes passing through the flow, the duration of the flow, the source port number, the destination port number, the source address, the destination address, etc. as shown in table 1, where the number of packets indicates the number of packets created from the flow to pass through the flow. The number of bytes passing through the stream indicates how many bytes have passed through the stream stripe from the stream creation to the current time; the stream duration indicates how long the stream was created; the source port and the destination port number represent host virtual port numbers occupied by the flow at the local host and the remote host; the source and destination addresses indicate IP addresses of the source and destination hosts. The traffic information in table 1 is the most original information, and for some information, such as source address and destination address, which are inconvenient to count, and for some information, direct statistics does not have any meaning, such as source port number, destination port number, etc., and in order to obtain useful information from the above information, further analysis processing is required to find out information with regularity, and analysis modeling is performed.

TABLE 1

Through the analysis of the P2P protocol, the P2P can download information from a plurality of hosts in the network at the same time, the bandwidth of the user is limited, and in order not to influence the experience of the user, only a small part of file information is extracted from the host of each user each time. Therefore, from the traffic information aspect, the average packet size of the user is relatively small, and the average packet size of the user for acquiring the information after analysis is shown in fig. 1.

As can be seen from fig. 1, the average size of the machines running P2P (10.0.0.1) programs is centered below 2 bytes, while the average size of the machines running non-P2P programs (machines other than 10.0.0.1) is centered above 2 bytes. We therefore identified in this experiment that the average packet size can be used as a basis for distinguishing P2P traffic from non-P2P traffic.

The network port number is a flag for finding a particular program in the host. In early P2P programs, the programs ran on a specific port number. However, with the development of P2P programs, users of the P2P programs are wider and wider, and bandwidths occupied in networks are also larger and larger, many networks start to limit the speed of the P2P programs in order to ensure user experience, and the most original method is to limit the speed of the P2P programs according to ports on which the P2P programs run. In recent years, in order to avoid limitation, P2P traffic tends to be disguised as normal traffic to be transmitted in a network, so that an interactive port of the P2P traffic is usually selected to be a port larger than 1024, and the port number occupied by the traffic information is analyzed statistically to obtain information as shown in fig. 2, and it can be seen from an observation of fig. 2 that a port distribution of a host running a P2P program is larger than 1024 in a general trend of a P2P-behaving host, compared with a host running a non-P2P program, while a non-P2P-behaving host does not have the distribution trend, so that the port distribution can be used as a feature for distinguishing P2P traffic from non-P2P traffic.

For P2P programs, the files they download are typically relatively large, especially files such as high definition movies. Therefore, in the OpenFlow switch, when the P2P host interacts with the server host that downloads the file, the interactive stream is stored in the switch for a long time, that is, the duration of the stream is long. The behavior of the non-P2P host, due to its burstiness, is either long in duration or short in duration, and the analysis effect is shown in fig. 3.

In fig. 3, the abscissa represents the period and the ordinate represents the time in milliseconds, and the stream durations of the P2P behavior masters (IP address 10.0.0.1) are generally distributed between 600 and 1200, while the stream durations of the non-P2P behavior masters are distributed between 0 and 800. It can be assumed that the stream duration can be used as a basis for detecting P2P traffic.

In the experiment, the period when the statistical data was collected was 5s. For different applications, the data traffic flowing in each period is different, for example, a browser program, the behavior of the application is bursty, and the traffic that may pass in one period is very small, while for a P2P program, except for the first and last periods, the traffic that may pass in a stable phase is very large. After analysis, the result shown in fig. 4 is obtained, the abscissa of fig. 4 represents the period, and the ordinate represents the number of bytes passed, and it can be seen from the figure that the host (host 10.0.0.1) in which the P2P action occurs has passed the number of bytes in each period which is significantly greater than the host (hosts 10.0.0.3 and 10.0.04) in which the P2P action does not occur in most of the time period. Therefore, the traffic passing through a certain flow in each period can be used to roughly judge whether a certain host communicating through the certain flow has P2P behavior.

One of the main reasons that P2P programs can download files quickly is: the P2P program can quickly acquire one part of the required file from a plurality of P2P server hosts. Therefore, for the P2P program, it is a significant feature that it maintains the interactive behavior with many hosts at the same time, and in the flow table of the OpenFlow switch, the shown feature is that in a period, the number of flows occupied by the P2P host is generally greater than that occupied by the non-P2P behavior host, and the result graph of the interactive IP number shown in fig. 5 is obtained according to the above analysis.

The abscissa of fig. 5 represents the period, and the ordinate represents the number of hosts interacting simultaneously, and it can be seen from the results in the figure that the number of hosts interacting simultaneously is significantly different for the host (10.0.0.1) in which the P2P behavior occurs compared with the host in which the P2P behavior does not occur, so that the feature can be used to distinguish the important features of the P2P host and the non-P2P host.

In the above section, various features were analyzed, and the features selected to be useful for distinguishing between P2P traffic and non-P2P traffic were combined into a table, the final form of which is shown in table 2.

TABLE 2

Table 2 is a digitized representation of the final statistical features including average packet size, average flow duration, traffic size, number of IP interactions, and number of ports greater than 1024, etc. The average packet size in the above table refers to the total number of bytes flowing through a certain flow over a period of time divided by the flow time; mean flow duration represents the time during which each flow is averaged over a period of time; the flow size represents the total number of bytes passing through a certain host within a period of time; the number of IP interacted indicates the total number of other hosts interacted with by a certain host in a period of time; the port number indicates the number of streams of which the port number used by the protocol is greater than 1024 in one period, the period indicates the period occurring when the data of the row is collected, the tag indicates whether the statistical data is P2P traffic, 0 indicates non-P2P traffic, and 1 indicates P2P traffic.

In this embodiment, the two P2P traffic detection results are 10 and 01 coded outputs, where 10 indicates that the output of the neural network is not the P2P traffic, and 01 indicates that the output of the neural network is the P2P traffic.

According to the technical scheme, when the input features are directly brought into a neural network for training, a condition occurs, even if the training time is long, the final training error cannot be converged, and more times, the training error is an oscillation result, and the analysis and the discovery are carried out on the training result, in the training process of the neural network, the ranges of input feature values are different, and the values are large, namely, the attribute values of the average flow duration, the flow size, the interactive IP quantity and the port proportion are overlarge, so that in the final training, the training error is large in jumping, and the training error is difficult to converge, in order to solve the problem, the program respectively carries out normalization operation on the average flow duration, the flow size, the interactive IP quantity and the port quantity, and the normalization formula is as follows:

in the formula, value currently needs to be processed, minValue represents the minimum Value of all values in the current attribute list, maxValue represents the maximum Value of all values in the current attribute list, and δ represents the gradient of change. On the other hand, the denominator of MaxValue-MinValue +1 in the formula (1) is intended to prevent the maximum value and the minimum value from being equal to each other and thereby to cause the denominator to be 0, and this illegal case can be excluded by this writing. After the above adjustment operation, the parameters are introduced into the neural network for training, the training result converges smoothly, and the relationship between the training error result and the training algebra is shown in fig. 6.

In this embodiment, the activation function of each neuron in the BP neural network model is represented by a Sigmoid-type function, as shown in formula (2):

BP neural network H _o The implementation algorithm of (3):

initializing each connection weight value to be randomly assigned as a number between (-1, 1), and outputting a value of o representing the o-th node of the hidden layer for the input training set<P _i O _j &gt, the numerical characteristic of Pi input, O _j And (3) representing the final output result, wherein the value range of i is 1,2,3 \8230n, n represents the number of the types of the input features, in the embodiment, n takes 5, j represents the number of the coding bits of the output result, and in the embodiment, j takes 2. W for connection weights between input layer and hidden layer, and between hidden layer and output layer _ij The method comprises the steps of representing, i represents the nodes of the previous layer, j represents the nodes of the next layer, setting the value range of i to be 1,2,3 \8230, n represents a seed tree of input characteristics, and n represents 5 if the input characteristics are five in the embodiment; the value range of j is 1,2,3 \8230, m represents the coding bit number of the output result; the output structure of the present embodiment is represented by 01 or 10, and the value of j is 2;

the output value of the output layer node is as follows (4):

in the formula (4), x represents an output value of a hidden node;

a falling gradient of

The updating formula of the weight is

w(t)＝w(t+1)+δεO _n (6)

δ represents a gradient of change, and ε represents a learning rate.

A set scene of the P2P flow detection method of the BP neural network is to detect flow information in storage equipment, and for the flow information in the storage equipment, preprocessing is firstly needed, the flow information is changed into data easy to process, then the characteristic information is fitted by the BP neural network, and finally a model is obtained. However, the obtained model is obtained by using data of a specific time period, and whether the model has universality or not is a very important link for whether the method can be applied or not. The generality of the model is verified mainly in the embodiment from the following aspects, such as the accuracy of model detection, the accuracy of test for mixing various different flows, and the like. The machine used in the test of this embodiment is a virtual machine, 4 cores and 8G memories, the operating system is ubuntu14.04, the network device emulator is Mininet, and the SDN controller is Pox. The topology of the overall network is shown in fig. 7.

In fig. 7, the entire network is composed of four hosts, three OpenFlow switches, and one controller, where the four hosts respectively run different programs, including a P2P program, a mail program, and a browser. Four hosts are connected to the OpenFlow switch 1, a built-in module is arranged on the OpenFlow switch 1, the module guides the flow passing through the switch to an external network, and interaction between the switch application program and the external network is realized through the module. The controller acquires the flow information from the OpenFlow switch 1, because the OpenFlow switch functions as an aggregation switch, all flows are transmitted from the switch, and it can acquire the flow information of all hosts.

For the BP neural network-based P2P traffic detection model, P2P traffic information in a certain period of time is extracted, and the characteristics of the P2P traffic information are acquired and then judged. However, the extraction of the features is based on a certain period of time, and the general effect of the detection effect is unknown, so that the effectiveness of the training model can be determined only by testing data in multiple time periods, the experiment is tested for multiple times by adopting flow data in different time periods, and the average effect is shown in table 3.

TABLE 3

The data in table 3 are the average of the results of the data tests taken on different days, and the numbers in the table are in a proportional relationship. From the results in the table, the detection rate of the model after the multiple sets of experiments can basically reach about 80%, thus demonstrating that the model trained herein can stably detect the P2P flow rate in the flow rate.

In the existing downloading software, a lot of software utilizes the characteristic of P2P fast downloading, several commonly used P2P downloading software under Linux are selected in the text, the detection rate of the software is tested by using a model based on a BP neural network in the text, and the result is shown in Table 4.

TABLE 4

Name of software	Detection rate
		qBittorrent	79％
Amule	61％
		Deluge	82％

It can be seen from table 4 that this section tests qBittorrent, amule, and Deluge software, where the detection rates of qBittorrent and Deluge reach 79% and 82% respectively, and the detection rate of Amule is 61%, and the reason for analyzing this is that qBittorrent and Deluge protocols are designed based on the BitTorrent protocol, and when downloading on the internet, there are many resources and the downloading speed is fast, so the detection rate is also relatively high, and Amule is designed based on the eDonkey protocol, and in the downloading process of this document, its speed is slow, and has a certain influence on the characteristics of the number of IP of interactions, the average stream duration, and the like, but can still reach 60%, which indicates that the method can be used to detect the P2P traffic based on eDonkey to a certain extent.

In the P2P traffic detection method based on the BP neural network in the SDN environment of this embodiment, some characteristics unique to the P2P traffic that is different from other traffic are analyzed, and a P2P traffic detection model is obtained by performing fitting modeling on the analyzed characteristics by using the BP neural network. In the test part, the test is mainly carried out according to two problems, the first problem is whether the method can correctly test the flow in different time periods, and the other problem is whether the method can work at the same time aiming at the flow generated by different P2P software. Aiming at the two problems, after sufficient tests are carried out, the BP neural network model fitted by the method is verified to have a good effect in distinguishing P2P traffic from non-P2P traffic.

The P2P traffic detection method based on the BP neural network in the SDN environment provided by the present invention is described in detail above, so as to facilitate understanding of the present invention and its core idea. For a person skilled in the art, many modifications and alterations are possible in the concrete implementation according to the core idea of the invention. In view of the above, this description should not be taken in a limiting sense.

Claims

1. A P2P flow detection method based on a BP neural network in an SDN environment is characterized by comprising the following steps:

step 2) directly bringing the characteristics into a neural network for training through a set normalization formula;

and 4) repeating the step 2) and the step 3), wherein with the increase of training algebra, when the error between the current output result and the previous output result approaches to a constant, the neural network finishes convergence, the BP neural network model finishes training, and the trained BP neural network model is analyzed and detected for other data.

2. The method of claim 1, wherein the characteristics used in step 1) to distinguish P2P traffic comprise five dimensions, which are average packet size, average flow duration, traffic size, number of interacting IPs and number of ports.

3. The method of claim 2, wherein the distribution of host port numbers greater than 1024 is set as a P2P traffic behavior.

4. The method of claim 1, wherein the P2P traffic detection result is two, i.e. 10 and 01 coded outputs, where 10 indicates that the output of the neural network is not P2P traffic, and 01 indicates that the output of the neural network is P2P traffic.

5. The method of claim 1, wherein the normalization formula is given by formula (1):

6. The method of claim 1, wherein an activation function of each neuron in the BP neural network model is represented by a Sigmoid-type function, and a result of the activation function is used to influence an input of a next-layer node as shown in formula (2):

BP neural network H _o The implementation algorithm of (3):

initializing each connection weight value to be randomly assigned as a number between (-1, 1), and outputting a value H by the hidden layer _o Where o denotes the o-th node of the hidden layer, for the input training set<P _i O _j >，P _i Numerical characteristics of the input, O _j The value range of i is 1,2,3 \8230, n represents the number of input features, j represents the number of coded bits of output result, and the connection weight between the input layer and the hidden layer and between the hidden layer and the output layer is W _io The method comprises the steps of representing, i represents the ith node of the previous layer, o represents the nodes of the next layer, setting the value range of i to be 1,2,3 \8230, n represents a seed tree of input characteristics, j represents the value range of 1,2,3 \8230, m represents the number of coded bits of an output result;

in the formula (4), x represents an output value of a hidden node;

the descending gradient is;

the updating formula of the weight is as follows:

w(t)＝w(t+1)+δεO _n (6)