CN103780501B

CN103780501B - Peer-to-peer network traffic identification method of indistinguishable wavelet support vector machine

Info

Publication number: CN103780501B
Application number: CN201410017016.3A
Authority: CN
Inventors: 汪绪彪; 王文彬; 王宏昕; 孙媛; 伍又云; 任艳梅
Original assignee: Puyang Vocational and Technical College
Current assignee: Puyang Vocational and Technical College
Priority date: 2014-01-03
Filing date: 2014-01-03
Publication date: 2017-02-15
Anticipated expiration: 2034-01-03
Also published as: CN103780501A

Abstract

The invention relates to a peer-to-peer network traffic identification method of an undifferentiated wavelet support vector machine, which comprises the following steps: (1) selecting a characteristic vector: the following three-dimensional feature vectors are used: vector ═ v1, v2, v3 >; wherein V1 represents the mean square error value of the packet size change, V2 represents the ratio of the uplink and downlink speeds at the node, and V3 represents the ratio of the number of IP addresses to the number of ports; (2) selecting an appropriate kernel function; (3) selecting an incremental training algorithm; (4) the Boosting algorithm of P2P flow identification of the wavelet SVM finally obtains a strong classifier H (x) by adopting a weighted voting mode for identifying P2P flow. The invention can efficiently identify the P2P network flow, take countermeasures in time and effectively control the P2P network flow.

Description

A peer-to-peer network traffic identification method based on non-separable wavelet support vector machine

技术领域technical field

本发明涉及一种不可分小波支持向量机的对等网络流量识别方法，属于计算机对等网络技术领域。The invention relates to a peer-to-peer network traffic identification method of an inseparable wavelet support vector machine, which belongs to the technical field of computer peer-to-peer networks.

背景技术Background technique

对等网络技术(Peer to Peer Computing，简称为P2P)，正在以飞快的速度发展，P2P技术是作为一种全新的网络通信模式，现在已经被列为影响未来Intemet发展的科技技术之一，与网格计算技术(Grid Computing)、云计算技术(Cloud Computing)并列成为分布式计算技术领域的相关研究重点，越来越受到研究者的重视。目前，针对P2P技术还没有确切的定义，但是它的思想改变了人们对于互联网的理解和认识。P2P网络与传统网络最大的区别是，它允许两个用户之间进行互相链接，彼此进行文件传输和共享，改变了传统网络中，服务器／客户机的传输模式，资源的需求者同时也是资源的提供者，同一个资源的需求者越多，其下载速度就越快，从而明显提高了数据传输的速度和效率。P2P technology (Peer to Peer Computing, referred to as P2P) is developing at a rapid speed. As a new network communication mode, P2P technology has been listed as one of the technologies that will affect the future Internet development. Grid computing technology (Grid Computing) and cloud computing technology (Cloud Computing) have become the focus of related research in the field of distributed computing technology, and have been paid more and more attention by researchers. At present, there is no exact definition for P2P technology, but its ideas have changed people's understanding and understanding of the Internet. The biggest difference between the P2P network and the traditional network is that it allows two users to link each other, transfer and share files with each other, which changes the transmission mode of the server/client in the traditional network, and the resource demander is also the resource owner. Providers, the more demanders of the same resource, the faster the download speed, which significantly improves the speed and efficiency of data transmission.

P2P技术的迅速发展，也带来了很多问题，体现在以下几个方面：(1)占据大量的网络带宽：分享视频和高清视频等P2P应用占据了大量的网络带宽，消耗过多的网络资源，引起网络的拥塞，其他正常的网络业务不能开展，影响到了非P2P应用的用户的权利，损害了ISP的利益。(2)网络安全防护问题：P2P应用普及的同时，也使得大量的病毒、木马程序、不健康的内容信息乘虚而入，在互联网上快速的进行传播，给黑客和不法分子以可乘之机，危害到用户的利益安全。(3)P2P文件共享的版权问题：据统计，P2P下载超过80％内容涉嫌盗版侵权，损害了原创作者的利益，随着3G网络的普及，2009年国家广电总局针对P2P下载站的色情内容、盗版等问题加大了重点打击的力度。The rapid development of P2P technology has also brought many problems, which are reflected in the following aspects: (1) Occupying a large amount of network bandwidth: P2P applications such as sharing video and high-definition video occupy a large amount of network bandwidth and consume too much network resources , causing network congestion, and other normal network services cannot be carried out, which affects the rights of users of non-P2P applications and damages the interests of ISPs. (2) Network security protection issues: While the popularity of P2P applications, it also allows a large number of viruses, Trojan horse programs, and unhealthy content information to enter and spread rapidly on the Internet, giving hackers and lawbreakers opportunities to exploit. endanger the interests and safety of users. (3) Copyright issues of P2P file sharing: According to statistics, more than 80% of the content downloaded by P2P is suspected of piracy and infringement, which damages the interests of original authors. Problems such as piracy have increased the intensity of key crackdowns.

因此，网络的安全性、可管理性及传统应用的可用性等都受到了挑战，加强网络流量监控，这就非常有必要对P2P流量和网络行为进行深入的了解和分析，为管理与监控P2P网络提供技术支持。P2P的流量与传统的WEB流量不同，它具有难以管理、控制的特点：(1)没有固定的网络协议标准：P2P应用使用的是其专有协议，普通的防火墙技术不能对P2P流量进行完全过滤；(2)使用了动态端口：为了躲避使用固定端口检测P2P流量，采用了动态端口，典型应用有PPlive，Skype可以由用户改变原来的默认端口，端口的设置更加灵活，为正确识别P2P流量加大了难度；(3)伪装为正常流量：Kazza等P2P应用在进行流量传输时，其报文格式伪装成HTTP流量，更加不易于识别。(4)使用流量加密技术：Skype等使用了报文加密技术，使得根据应用层特征匹配的方法不能识别出经过加密的P2P流量。Therefore, network security, manageability, and availability of traditional applications are all challenged. To strengthen network traffic monitoring, it is very necessary to conduct in-depth understanding and analysis of P2P traffic and network behavior for the management and monitoring of P2P networks. provide technical support. P2P traffic is different from traditional WEB traffic in that it is difficult to manage and control: (1) There is no fixed network protocol standard: P2P applications use its proprietary protocol, and ordinary firewall technology cannot completely filter P2P traffic ; (2) Dynamic ports are used: in order to avoid using fixed ports to detect P2P traffic, dynamic ports are adopted. Typical applications include PPlive and Skype. (3) Masquerading as normal traffic: When Kazza and other P2P applications transmit traffic, the packet format is disguised as HTTP traffic, which is even more difficult to identify. (4) Use traffic encryption technology: Skype and others use message encryption technology, so that the encrypted P2P traffic cannot be identified according to the method of application layer feature matching.

所以，要实现对P2P流量的管理，首先要解决的问题是实现对P2P流量的识别。深入研究P2P网络流量的特征，选取适当的识别模型，进而高效的对P2P网络流量进行识别，及时的采取对策，对P2P网络流量进行有效的控制具有非常重要的理论意义和现实价值。Therefore, in order to realize the management of P2P traffic, the first problem to be solved is to realize the identification of P2P traffic. In-depth study of the characteristics of P2P network traffic, selection of appropriate identification models, and efficient identification of P2P network traffic, timely countermeasures, and effective control of P2P network traffic have very important theoretical significance and practical value.

发明内容Contents of the invention

本发明的目的在于提供一种不可分小波支持向量机的对等网络流量识别方法，以便通过小样本提供有限信息的情况下，来寻找分类结果的最优解，从而回避了很多机器学习的方法需要大样本数据集的缺点和使用非线性的方法需要针对具体的问题来建立相应的模型的缺点，进而高效的对P2P网络流量进行识别，及时的采取对策，对P2P网络流量进行有效的控制。The purpose of the present invention is to provide a peer-to-peer network traffic identification method of non-separable wavelet support vector machine, in order to find the optimal solution of the classification result under the condition of providing limited information through small samples, thus avoiding the need of many machine learning methods The shortcomings of large sample data sets and the use of nonlinear methods need to establish corresponding models for specific problems, and then efficiently identify P2P network traffic, take timely countermeasures, and effectively control P2P network traffic.

为了实现上述目的，本发明的技术方案如下。In order to achieve the above object, the technical solution of the present invention is as follows.

一种不可分小波支持向量机的对等网络流量识别方法，包括以下步骤：A peer-to-peer network traffic identification method for an inseparable wavelet support vector machine, comprising the following steps:

1、选取特征向量：1. Select the eigenvectors:

选取合适的特征向量，是对P2P网络流量进行识别的重要方面，对P2P网络流量进行特征选择的时候，遵循有两个原则：(1)具有不同功能和提供不同服务的节点流量呈现出有差异的行为特征，所以尽可能的选择节点流量的行为特征。(2)特征的选取要能够反映出P2P流量和非P2P流量的区别从而起到缩短训练时间，提高识别的精确度的目的。当有足够多的特征向量，可以为分类器提供更精确的识别率，但是提供过多的特征会使训练的时间更长，计算复杂度加大。Selecting an appropriate feature vector is an important aspect of identifying P2P network traffic. When selecting features for P2P network traffic, there are two principles to follow: (1) The traffic of nodes with different functions and providing different services shows differences. Behavioral characteristics, so choose the behavioral characteristics of node traffic as much as possible. (2) The selection of features should be able to reflect the difference between P2P traffic and non-P2P traffic, so as to shorten the training time and improve the accuracy of recognition. When there are enough feature vectors, the classifier can provide a more accurate recognition rate, but providing too many features will make the training time longer and increase the computational complexity.

基于以上原因，本发明中通过数据包、网络流、节点连接三个层面进行特征向量的分析：Based on the above reasons, in the present invention, the analysis of the feature vector is carried out at three levels of data packets, network flows, and node connections:

(1)数据包层面的特征：包括包的平均长度，包的最大长度，包的最小长度，以及方差等统计特征。(1) Features at the data packet level: including the average length of the packet, the maximum length of the packet, the minimum length of the packet, and statistical characteristics such as variance.

(2)网络流层面的特征：通过对流原始的统计特征，如开始时间，结束时间，服务类型等得到流相关的统计特征：平均流持续的时间，平均传输速率，流的平均字节数，包到达的时间间隔以及方差等。(2) Features at the network flow level: Through the original statistical characteristics of the flow, such as start time, end time, service type, etc., the flow-related statistical characteristics are obtained: the average flow duration, the average transmission rate, the average number of bytes of the flow, Packet arrival time interval and variance, etc.

(3)节点连接层面的特征：通过TCP的连接状态，对节点连接的相关特征进行统计，包括连接呈现出的对称性以及IP地址，端口特性等。(3) Features at the node connection level: Through the TCP connection status, statistics are made on the relevant characteristics of the node connection, including the symmetry of the connection and the characteristics of the IP address and port.

本发明中采用如下三维特征向量：The following three-dimensional feature vectors are adopted in the present invention:

Vector＝<v1，v2，v3>；Vector = <v1, v2, v3>;

其中，V1代表数据包大小变化的均方差值，V2代表节点处上下行速度的比值，V3代表IP地址数量和端口数量的比值。在对网络流量进行识别时，将三维特征向量作为输入向量，然后就可以使用SVM模型生成的决策函数对其样本P2P样本数据进行有效的识别。Among them, V1 represents the mean square error value of the packet size change, V2 represents the ratio of uplink and downlink speeds at the node, and V3 represents the ratio of the number of IP addresses to the number of ports. When identifying network traffic, the three-dimensional feature vector is used as an input vector, and then the decision function generated by the SVM model can be used to effectively identify its sample P2P sample data.

2、选择适当的核函数：2. Select the appropriate kernel function:

P2P网络流量呈现出突发性，不确定的非线性流量特征，小波分析适合于信号的局部分析和突变信号的检测，结合小波分析引入多尺度的小波基函数来构造SVM的核函数，建立小波SVM的识别算法，能充分提高SVM的识别精度。引入小波基函数来构造SVM的核函数，并且用于P2P网络的流量识别，需要满足两个条件：(1)符合SVM核函数的构造的条件。(2)选择的小波基函数的计算复杂度不能太高，过多的参数设置会加大样本的训练时间。P2P network traffic presents sudden and uncertain nonlinear traffic characteristics. Wavelet analysis is suitable for local analysis of signals and detection of mutation signals. Combined with wavelet analysis, multi-scale wavelet basis functions are introduced to construct the kernel function of SVM, and wavelet The recognition algorithm of SVM can fully improve the recognition accuracy of SVM. The wavelet basis function is introduced to construct the kernel function of SVM, and it is used for the traffic identification of P2P network, two conditions need to be met: (1) It meets the conditions of the construction of SVM kernel function. (2) The computational complexity of the selected wavelet basis function should not be too high, too many parameter settings will increase the training time of samples.

3、选择增量训练算法：3. Select the incremental training algorithm:

SVM增量训练算法的思想就是其决策函数是由支持向量决定的，将训练集中的支持向量全部保留下来，舍弃非支持向量，最终增量训练的结果是和未使用增量学习的结果是一致的。The idea of the SVM incremental training algorithm is that its decision function is determined by the support vectors. All the support vectors in the training set are retained, and the non-support vectors are discarded. The final incremental training result is consistent with the result of unused incremental learning. of.

增量训练算法如下：The incremental training algorithm is as follows:

步骤1：在初始的训练集上经过训练得到SVM的初始分类器f(x)，SVs¹表示f(x)的支持向量集；Step 1: Obtain the initial classifier f(x) of SVM after training on the initial training set, and SVs ¹ represents the support vector set of f(x);

步骤2：将SVs¹与新增样本集合并为新的训练集，经过训练后，将得到新的分类器Step 2: Merge SVs ¹ and the newly added sample set into a new training set. After training, a new classifier will be obtained

f’(x)，新的支持向量集SVs²；f'(x), the new support vector set SVs ² ;

步骤3：使得SVs¹=SVs²，返回步骤2。Step 3: Make SVs ¹ =SVs ² , return to Step 2.

在增量训练算法中，由于算法中每次的增量学习仅保留了支持向量，舍弃了非支持向量，但实际情况中，非支持向量中也包含了数据集中分类的有用信息，会影响到识别的精确度。In the incremental training algorithm, because each incremental learning in the algorithm only retains the support vectors and discards the non-support vectors, but in reality, the non-support vectors also contain useful information for classification in the data set, which will affect recognition accuracy.

4、小波SVM的P2P流量识别的Boosting算法：4. Boosting algorithm for P2P traffic recognition of wavelet SVM:

Boosting算法是集成学习中专门处理错分样本的一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器(弱分类器)，然后把这些弱分类器集合起来，构成一个更强的最终分类器(强分类器)。其算法本身是通过改变数据分布来实现的，它根据每次训练集之中每个样本的分类是否正确，以及上次的总体分类的准确率，来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练，最后将每次训练得到的分类器融合起来，作为最后的决策分类器。这样可以将分类器处理的关键放在错分的样本这些关键的训练数据上面，从而提高样本的识别精确率。The Boosting algorithm is an iterative algorithm that specifically deals with misclassified samples in integrated learning. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers to form a stronger classifier. The final classifier (strong classifier) of . The algorithm itself is realized by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct or not, and the accuracy of the last overall classification. The new data set with modified weights is sent to the lower classifier for training, and finally the classifiers obtained from each training are fused together as the final decision classifier. In this way, the key of classifier processing can be placed on the key training data of misclassified samples, thereby improving the recognition accuracy of samples.

小波SVM的Boosting算法就是将小波SVM作为基分类器对样本进行训练，首先从整个P2P和非P2P样本集S中根据权重大小选择M个样本构成一个训练子集S_j，经过训练后得出一个基分类器WSVM_j，然后用WSVM_j测试样本集S，可以得出WSVM_j的分类精确度；然后对错分的样本给予较高的权重；最后依据调整后的权重大小再次从S中选择M个样本构成新的训练子集S_j+1，若S_j+1=S则退出，否则重复上面的步骤。经过训练t轮后，(t≤T，T是迭代的次数)，得到一个基于WSVM的识别函数序列WSVM₁，...，WSVM_j，同时WSVM_j也赋予权值，也就是对样本集S识别的准确率；最终通过采用有权重的投票的方式得到一个强分类器H(x)，用于P2P流量的识别。The Boosting algorithm of wavelet SVM is to use wavelet SVM as the base classifier to train samples. First, M samples are selected from the entire P2P and non-P2P sample set S according to the weight to form a training subset S _j , and a training subset S j is obtained after training. Base classifier WSVM _j , and then use WSVM _j to test the sample set S, and the classification accuracy of WSVM _j can be obtained; then give higher weight to the misclassified samples; finally select M from S again according to the adjusted weight samples constitute a new training subset S _j+1 , if S _j+1 =S, exit, otherwise repeat the above steps. After t rounds of training, (t≤T, T is the number of iterations), a WSVM-based recognition function sequence WSVM ₁ ,...,WSVM _j is obtained, and WSVM _j is also given a weight, that is, for the sample set S Accuracy of recognition; finally, a strong classifier H(x) is obtained by weighted voting, which is used for P2P traffic recognition.

具体算法描述如下：The specific algorithm is described as follows:

(1)输入。基分类器WSVM；P2P和非P2P流量训练样本集S={(x₁，y₁)，...，(x_n，y_n)}，其中x_i∈R^d，y∈{-1，1}，1≤i≤n；训练的迭代次数T；样本初始权重Di=0(1≤i≤n)；(1) ENTER. Base classifier WSVM; P2P and non-P2P traffic training sample set S={(x ₁ , y ₁ ),..., (x _n , y _n )}, where x _i ∈ R ^d , y ∈ {-1, 1}, 1≤i≤n; number of training iterations T; sample initial weight Di=0(1≤i≤n);

(2)for(j=1；j≤T；j++)(2) for (j=1; j≤T; j++)

{{

1)根据权重的大小依次从训练样本集S中选取M个样本，得到训练样本子集S_j；1) Select M samples from the training sample set S sequentially according to the size of the weight to obtain the training sample subset S _j ;

2)如果S_j=S_j-1(j>1)则退出循环；2) If S _j =S _j-1 (j>1), exit the loop;

3)用WSVM算法训练S_j，得到一个基分类器WSVM_j；3) Train S _j with WSVM algorithm to get a base classifier WSVM _j ;

4)用WSVM_j分类样本集S，得到错误率为e_j；4) Use WSVM _j to classify the sample set S, and get the error rate e _j ;

5)WSVM_j的权重记为α_j＝1-ej；5) The weight of WSVM _j is recorded as α _j =1-ej;

6)调整支持向量和样本集S中的错分的样本权重为D_i=D_1+j；6) Adjust the support vector and the misclassified sample weight in the sample set S to be D _i =D _1+j ;

}}

(3)输出。决策函数序列h={WSVM₁，...，WSVM_t}，其权重α={α₁，...，α_t}，t≤T，最终的决策函数是：(3) output. Decision function sequence h={WSVM ₁ ,...,WSVM _t }, its weight α={α ₁ ,...,α _t }, t≤T, the final decision function is:

该发明的有益效果在于：本发明的通过提供一种不可分小波支持向量机的对等网络流量识别方法，在通过小样本提供有限信息的情况下，来寻找分类结果的最优解，从而回避了很多机器学习的方法需要大样本数据集的缺点和使用非线性的方法需要针对具体的问题来建立相应的模型的缺点，进而高效的对P2P网络流量进行识别，及时的采取对策，对P2P网络流量进行有效的控制。The beneficial effect of this invention is that: the present invention provides a peer-to-peer network traffic identification method of non-separable wavelet support vector machine, in the case of providing limited information through small samples, to find the optimal solution of classification results, thereby avoiding the Many machine learning methods require large sample data sets and use non-linear methods to establish corresponding models for specific problems, so as to efficiently identify P2P network traffic and take timely countermeasures. effective control.

附图说明Description of drawings

图1是本发明实施例中P2P应用连接实施模式图。Fig. 1 is an implementation mode diagram of P2P application connection in the embodiment of the present invention.

图2是本发明实施例中P2P应用连接实施模式图。Fig. 2 is an implementation mode diagram of P2P application connection in the embodiment of the present invention.

图3是本发明实施例中小波SVM的Boosting算法流程图。Fig. 3 is a flowchart of the Boosting algorithm of the wavelet SVM in the embodiment of the present invention.

具体实施方式detailed description

下面结合附图对本发明的具体实施方式进行描述，以便更好的理解本发明。Specific embodiments of the present invention will be described below in conjunction with the accompanying drawings, so as to better understand the present invention.

实施例Example

一种不可分小波支持向量机的对等网络流量识别方法，包括：A peer-to-peer network traffic identification method for non-separable wavelet support vector machines, comprising:

1、选取特征向量：1. Select the eigenvectors:

选取合适的特征向量，是对P2P网络流量进行识别的重要方面，对P2P网络流量进行特征选择的时候，遵循有两个原则：(1)具有不同功能和提供不同服务的节点流量呈现出有差异的行为特征，所以尽可能的选择节点流量的行为特征。(2)特征的选取要能够反映出P2P流量和非P2P流量的区别从而起到缩短训练时间，提高识别的精确度的目的。当有足够多的特征向量，可以为分类器提供更精确的识别率，但是提供过多的特征会使训练的时间更长，计算复杂度加大，据统计，如果在基于机器学习的算法中进行全部流特征属性的选择，进行网络流量的识别，其准确率仅比进行特征属性选择出的准确率高2％，但是算法的执行效率要高出很多。所以对特征属性的选择在保证分类器性能的同时，尽可能的选择好特征向量是P2P流量识别的重要一步。Selecting an appropriate feature vector is an important aspect of identifying P2P network traffic. When selecting features for P2P network traffic, there are two principles to follow: (1) The traffic of nodes with different functions and providing different services shows differences. Behavioral characteristics, so choose the behavioral characteristics of node traffic as much as possible. (2) The selection of features should be able to reflect the difference between P2P traffic and non-P2P traffic, so as to shorten the training time and improve the accuracy of recognition. When there are enough feature vectors, it can provide a more accurate recognition rate for the classifier, but providing too many features will make the training time longer and increase the computational complexity. According to statistics, if in the algorithm based on machine learning The accuracy rate of selecting all flow characteristic attributes and identifying network traffic is only 2% higher than that of characteristic attribute selection, but the execution efficiency of the algorithm is much higher. Therefore, the selection of feature attributes is an important step in P2P traffic identification while ensuring the performance of the classifier and selecting feature vectors as much as possible.

实际网络中不同的节点有着不同的功能，有的节点起着服务器的功能，向网络其他节点提供资源传输服务，有的节点起着客户端的功能，接收服务器提供的各项服务。而P2P网络中的节点既可以作为服务器向其他对等节点提供服务，又可以作为客户端接收其他对等节点提供的服务。因此，具有不同的功能和提供不同服务的节点流量呈现出有差异的行为特征，下面分别对这些行为特征进行分析。Different nodes in the actual network have different functions. Some nodes function as servers, providing resource transmission services to other nodes in the network, and some nodes function as clients, receiving various services provided by the server. The nodes in the P2P network can serve as servers to provide services to other peer nodes, and can also serve as clients to receive services provided by other peer nodes. Therefore, the traffic of nodes with different functions and providing different services presents different behavioral characteristics, and these behavioral characteristics are analyzed separately below.

图1是P2P应用连接模式图，在P2P网络中，对等节点的连接方式和在传统的服务器／客户端模式下的连接方式不同，P2P网络中的任何一个节点充当着双重的角色，称为对等节点。P2P应用使用的连接在1024-65535之间的随机端口进行数据传输，在TCP协议下，进行连接时，一个源端节点和多个对端节点连接。相对于源端节点，对端节点的IP地址数量较多，对端节点的端口是随机端口，于是对端节点的IP地址数量和端口数量的比值接近1。这点和传统连接模式下的应用不同，从而作为P2P流量的识别特征。Figure 1 is a diagram of the P2P application connection mode. In the P2P network, the connection mode of peer nodes is different from that in the traditional server/client mode. Any node in the P2P network plays a dual role, called peer nodes. P2P applications use random ports between 1024-65535 for data transmission. Under the TCP protocol, when connecting, a source node connects with multiple peer nodes. Compared with the source node, the number of IP addresses of the peer node is larger, and the port of the peer node is a random port, so the ratio of the number of IP addresses of the peer node to the number of ports is close to 1. This is different from the application in the traditional connection mode, so it serves as the identification feature of P2P traffic.

经过从上面的数据包，网络流，节点连接三个方面进行的行为特征分析，采取的特征向量都能够体现出P2P网络和传统网络中流量的差异，也是P2P流量在真实网络中的特征体现，达到识别的要求。通过这三个方面的特征，本发明中采用如下三维特征向量：After analyzing the behavioral characteristics from the above three aspects of data packets, network flow, and node connections, the feature vectors adopted can reflect the difference between the traffic in the P2P network and the traditional network, and are also the characteristics of the P2P traffic in the real network. meet the identification requirements. Through the characteristics of these three aspects, the following three-dimensional feature vectors are adopted in the present invention:

Vector=<v1，v2，v3>；Vector = <v1, v2, v3>;

2、选择适当的核函数：2. Select the appropriate kernel function:

SVM利用了核函数的方法，保证较好泛化能力的同时，解决了训练样本特征空间的维数问题，通过选取不同的核函数，来处理非线性的问题，目前针对P2P网络流量的识别普遍采用了径向基(RBF)核函数，因为RBF核相比其它核函数具有较少的参数，计算难度较小，能够使用于所有分布的样本。P2P网络流量呈现出突发性，不确定的非线性流量特征，小波分析适合于信号的局部分析和突变信号的检测，结合小波分析引入多尺度的小波基函数来构造SVM的核函数，建立小波SVM的识别算法，能充分提高SVM的识别精度。引入小波基函数来构造SVM的核函数，并且用于P2P网络的流量识别，需要满足两个条件：(1)符合SVM核函数的构造的条件。(2)选择的小波基函数的计算复杂度不能太高，过多的参数设置会加大样本的训练时间。如图2所示。SVM uses the method of kernel function to ensure good generalization ability and solve the dimensionality problem of training sample feature space. By selecting different kernel functions, it can deal with nonlinear problems. At present, the identification of P2P network traffic is common. The radial basis (RBF) kernel function is adopted, because the RBF kernel has fewer parameters than other kernel functions, the calculation difficulty is less, and it can be used for samples of all distributions. P2P network traffic presents sudden and uncertain nonlinear traffic characteristics. Wavelet analysis is suitable for local analysis of signals and detection of mutation signals. Combined with wavelet analysis, multi-scale wavelet basis functions are introduced to construct the kernel function of SVM, and wavelet The recognition algorithm of SVM can fully improve the recognition accuracy of SVM. The wavelet basis function is introduced to construct the kernel function of SVM, and it is used for the traffic identification of P2P network, two conditions need to be met: (1) It meets the conditions of the construction of SVM kernel function. (2) The computational complexity of the selected wavelet basis function should not be too high, too many parameter settings will increase the training time of samples. as shown in picture 2.

3、选择增量训练算法：3. Select the incremental training algorithm:

传统的SVM，完成训练和分类的过程是一次完成的，而且在训练的时候需要求解二次规划，当训练的样本集比较大的时候，就需要占用较大的内存，而且收敛速度较慢，随着不断变化的网络数据信息，其数据集也呈现出了不平衡性和多样性的特点，因此现有的单一的分类器和增量算法训练出来的分类器识别精确度不是很理想。其实影响识别精确度的主要因素是大量错分样本的存在，集成学习中的Boosting算法是专门针对错分样本的一种分类方法。本发明提出一种基于小波SVM的Boosting算法应用于P2P流量识别，通过在学习过程中重点训练错分的样本，来提高学习机的泛化能力，进而提高识别的精确率。The traditional SVM completes the training and classification process once, and needs to solve the quadratic programming during training. When the training sample set is relatively large, it needs to occupy a large memory and the convergence speed is slow. With the ever-changing network data information, its data set also presents the characteristics of imbalance and diversity, so the recognition accuracy of the classifier trained by the existing single classifier and incremental algorithm is not very ideal. In fact, the main factor affecting the recognition accuracy is the existence of a large number of misclassified samples. The Boosting algorithm in ensemble learning is a classification method specifically for misclassified samples. The present invention proposes a wavelet SVM-based Boosting algorithm applied to P2P traffic identification, and improves the generalization ability of the learning machine by focusing on training misclassified samples during the learning process, thereby improving the accuracy of identification.

在SVM中，支持向量可以描述整个样本数据集的特性，对于确定好核函数的SVM，最优分类面只与其支持向量有关系，对整个样本数据集的分类能够相当于对支持向量的分类。也就是说去除样本训练集中支持向量以外的其他向量，重新进行训练，则训练的结果和在整个样本训练集中得到的结果是一致的。SVM增量训练算法的思想就是其决策函数是由支持向量决定的，将训练集中的支持向量全部保留下来，舍弃非支持向量，最终增量训练的结果是和未使用增量学习的结果是一致的。In SVM, the support vector can describe the characteristics of the entire sample data set. For the SVM with a determined kernel function, the optimal classification surface is only related to its support vector, and the classification of the entire sample data set can be equivalent to the classification of the support vector. That is to say, the vectors other than the support vectors in the sample training set are removed, and the training is performed again, then the training result is consistent with the result obtained in the entire sample training set. The idea of the SVM incremental training algorithm is that its decision function is determined by the support vectors. All the support vectors in the training set are retained, and the non-support vectors are discarded. The final incremental training result is consistent with the result of unused incremental learning. of.

增量训练算法如下：The incremental training algorithm is as follows:

步骤2：将SVs¹与新增样本集合并为新的训练集，经过训练后，将得到新的分类器f’(x)，新的支持向量集SVs²；Step 2: Merge SVs ¹ and the newly added sample set into a new training set. After training, a new classifier f'(x) and a new support vector set SVs ² will be obtained;

图3是本发明实施例中小波SVM的Boosting算法流程图，就是将小波SVM作为基分类器对样本进行训练，首先从整个P2P和非P2P样本集S中根据权重大小选择M个样本构成一个训练子集S_j，经过训练后得出一个基分类器WSVM_j，然后用WSVM_j测试样本集S，可以得出WSVM_j的分类精确度；然后对错分的样本给予较高的权重；最后依据调整后的权重大小再次从S中选择M个样本构成新的训练子集S_j+1，若S_j+1=S则退出，否则重复上面的步骤。经过训练t轮后，(t≤T，T是迭代的次数)，得到一个基于WSVM的识别函数序列WSVM₁，...，WSVM_j，同时WSVM_j也赋予权值，也就是对样本集S识别的准确率；最终通过采用有权重的投票的方式得到一个强分类器H(x)，用于P2P流量的识别。Fig. 3 is the boosting algorithm flow chart of wavelet SVM in the embodiment of the present invention, is to use wavelet SVM as base classifier to train samples, first select M samples according to the weight size from the whole P2P and non-P2P sample set S to form a training Subset S _j , get a base classifier WSVM _j after training, and then use WSVM _j to test the sample set S, you can get the classification accuracy of WSVM _j ; then give higher weight to the misclassified samples; finally according to After adjusting the weight size, select M samples from S again to form a new training subset S _j+1 , and exit if S _j+1 =S, otherwise repeat the above steps. After t rounds of training, (t≤T, T is the number of iterations), a WSVM-based recognition function sequence WSVM ₁ ,...,WSVM _j is obtained, and WSVM _j is also given a weight, that is, for the sample set S Accuracy of recognition; finally, a strong classifier H(x) is obtained by weighted voting, which is used for P2P traffic recognition.

具体算法描述如下：The specific algorithm is described as follows:

(2)for(j=1；j≤T；j++)(2) for (j=1; j≤T; j++)

{{

6)调整支持向量和样本集S中的错分的样本权重为D_i=D_1+i；6) Adjust the support vector and the misclassified sample weight in the sample set S to be D _i =D _1+i ;

}}

(3)输出。决策函数序列h={WSVM₁，...，WSVM_t}，其权重α＝{α₁，...，α_t}，t≤T，最终的决策函数是：(3) output. Decision function sequence h={WSVM ₁ ,...,WSVM _t }, its weight α={α ₁ ,...,α _t }, t≤T, the final decision function is:

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也视为本发明的保护范围。The above description is a preferred embodiment of the present invention, and it should be pointed out that for those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also considered Be the protection scope of the present invention.

Claims

1. a kind of peer-to-peer network method for recognizing flux of inseparable wavelet support vector machines it is characterised in that：Comprise the following steps：

(1) selected characteristic vector：Two principles are followed：A () has difference in functionality and with the node flow providing different services is in Reveal discrepant behavioural characteristic, so selecting the behavioural characteristic of node flow as far as possible；B the selection of () feature is wanted can be anti- Mirroring the difference of P2P flow and non-P2P flow thus playing the shortening training time, improving the purpose of the accuracy of identification；Select Packet, network flow, node connect three aspects as characteristic vector；The feature of described packet aspect：Average including bag Length, the maximum length of bag, the minimum length of bag, and variance statistic feature；The feature of described network flow aspect：By convection current Original statistical nature, the time started, the end time, COS obtains flowing related statistical nature：Mean flow lasting when Between, average transmission rate, the average byte number of stream, time interval and variance that bag reaches；Described node connects the spy of aspect Levy：By the connection status of TCP, the correlated characteristic that node is connected counts, include symmetry that connection presents and IP address, port identity；Using following three-dimensional feature vector：Vector=<V1, v2, v3>；Wherein, V1 represents data package size The mean square deviation of change, V2 represents the ratio of up-downgoing speed at node, and V3 represents the ratio of IP address quantity and port number； When being identified to network traffics, using three-dimensional feature vector as input vector；

(2) select suitable kernel function：Introduce wavelet basis function to construct the kernel function of SVM, and the flow for P2P network Identification, needs to meet two conditions：A () meets the condition of the construction of SVM kernel function；(2) calculating of the wavelet basis function selecting Complexity is low, and excessive parameter setting can increase the training time of sample；

(3) select incremental training algorithm：The thought of SVM incremental training algorithm is exactly its decision function is to be determined by supporting vector , the supporting vector in training set is all remained, gives up non-supporting vector, the result of final incremental training is and does not make It is consistent with the result of incremental learning；

(4) the Boosting algorithm of the P2P flow identification of small echo SVM：Small echo SVM is trained to sample as base grader, M sample is selected to constitute a training subset S according to weight size first from whole P2P and non-P2P sample set S_j, Jing Guoxun A base grader WSVM is drawn after white silk_j, then use WSVM_jTest sample collection S is it can be deduced that WSVM_jClassification accuracy；So Give higher weight to wrong point of sample afterwards；It is last that according to the weight size after adjustment, M sample of selection from S is constituted again New training subset S_j+1If, S_j+1=S then exits, and otherwise repeats above step；After training t wheel, (t≤T, T are iteration Number of times), obtain recognition function sequence WSVM based on WSVM₁..., WSVM_j, WSVM simultaneously_jAlso weights are given, also It is the accuracy rate to sample set S identification；Eventually through obtaining strong classifier H (x) by the way of the ballot having weight, use Identification in P2P flow.

2. the peer-to-peer network method for recognizing flux of a kind of inseparable wavelet support vector machines according to claim 1, it is special Levy and be：Incremental training algorithm steps in described step (3) are as follows：

Step 1：Preliminary classification device f (x) of SVM, SVs are obtained through training on initial training set¹Represent f (x) support to Quantity set；

Step 2：By SVs¹Merge into new training set with newly-increased sample set, after training, new grader f ' (x) will be obtained, New supporting vector collection SVs²；

Step 3：Make SVs¹=SVs², return to step 2.

3. the peer-to-peer network method for recognizing flux of a kind of inseparable wavelet support vector machines according to claim 1, it is special Levy and be：The Boosting arthmetic statement of the P2P flow identification of the small echo SVM in described step (4) is as follows：

(1) input：Base grader WSVM；P2P and non-P2P flow training sample set S={ (x₁, y₁) ..., (x_n, y_n), wherein x_i∈R^d, y ∈ { -1,1 }, 1≤i≤n；The iterations T of training；Sample initial weight Di=0 (1≤i≤n)；

(2) for (j=1；j≤T；j++)

{

1) size according to weight chooses M sample successively from training sample set S, obtains training sample subset S_j；

2) if S_j=S_j-1(j ＞ 1) then exits circulation；

3) use WSVM Algorithm for Training S_j, obtain a base grader WSVM_j；

4) use WSVM_jClassified sample set S, obtaining error rate is e_j；

5)WSVM_jWeight be designated as α_j=1-ej；

6) sample weights that the mistake in adjustment supporting vector and sample set S is divided are D_i=D_1+i；

}

(3) export：Decision function sequences h={ WSVM₁..., WSVM_t, its weight α={ α₁..., α_t, t≤T, final determines Plan function is：

H (x) = s i g n (Σ_{i = 1}^{t} α_{i} h_{i} (x)) .