CN101795214B - Behavior-based P2P detection method under large traffic environment - Google Patents

Behavior-based P2P detection method under large traffic environment Download PDF

Info

Publication number
CN101795214B
CN101795214B CN2010101005288A CN201010100528A CN101795214B CN 101795214 B CN101795214 B CN 101795214B CN 2010101005288 A CN2010101005288 A CN 2010101005288A CN 201010100528 A CN201010100528 A CN 201010100528A CN 101795214 B CN101795214 B CN 101795214B
Authority
CN
China
Prior art keywords
node
hash table
traffic
detection
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010101005288A
Other languages
Chinese (zh)
Other versions
CN101795214A (en
Inventor
李芝棠
刘峰
周丽娟
柳斌
涂浩
马晓静
周智昊
彭晓天
王世福
黄立辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN2010101005288A priority Critical patent/CN101795214B/en
Publication of CN101795214A publication Critical patent/CN101795214A/en
Application granted granted Critical
Publication of CN101795214B publication Critical patent/CN101795214B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a traffic behavior-based P2P detection method under large traffic environment, belonging to the technical field of internets. The method is characterized by sampling the traffic based on time and space under large traffic environment, filtering out the redundant traffic, reducing the time and space complexity of detection, then carrying out role identification on network nodes, identifying the network nodes with server characteristics, then setting up the related eigenvectors from the four traffic characteristics such as connection duration, connection payload, ratio of request connections and response connections and service ratio and identifying the P2P node by adopting a multiple support vector machine. The method is suitable for detecting P2P under large traffic environment and can detect P2P file sharing application, P2P streaming media application and other current main P2P applications.

Description

Under a kind of large traffic environment based on the P2P detection method of behavior
Technical field
The invention belongs to Internet technical field, be specifically related to a kind of P2P detection method.
Background technology
Along with developing rapidly of Internet, (Peer-to-Peer P2P) has become Internet and has gone up one of important use the peer-to-peer network technology.Traditional file is downloaded or streaming media service all is client/server (C/S) pattern, i.e. user's Connection Service device, and server is pushed to the user to data with mode of unicast then.If adopt above-mentioned pattern, all clients all are connected on the same station server, and then the pressure of server can be very big, can influence user's audio visual effect.Content distributing network (Content Delivery Network, CDN) though technology expedited data transmission to a certain extent, but its core remains the framework based on centralized servers, in peak time still there is certain defective in aspects such as the adaptability of burst flow, fault-tolerance.The introducing of P2P technology has brought new opportunity.Under the P2P mode, each peer-entities (peer) is a service providers, is again the client of service.Be distributed among numerous peer by load, thereby the load and the network bandwidth that alleviate server effectively take server.Research based on the transfer of data of P2P mode has also progressively caused people's attention, and correlation technique or prototype system constantly occur, as BitTorrent, and EMule, PPStream, PPLive etc.Under the promotion of risk investment, existing a plurality of commercial systems put into effect, and the Spring Festival Party of 2006 Chinese Central Television (CCTV) also brings into use the P2P stream media technology live to the whole world, use the P2P technology to become the trend of network applications evolve.
Yet the P2P technology has also been brought secret worry bring amusement and simultaneously easily to people.Because numerous peer have replaced original centralized servers that service is provided, this has brought huge difficulty with regard to the supervision of giving administrative department; Increase day by day along with P2P uses the flow that is produced simultaneously, occupied a large amount of the Internet bandwidth, the service quality of other application has also been formed threat, discerning therefore that P2P uses and it is monitored becomes problem demanding prompt solution.
Find by investigation various P2P flow detection technology in the world, various P2P flow detection technology can be summed up as following three class detection techniques: based on the detection technique (Port based Identification) of port, deep layer packet detection technique (DPI, Deep Packet Inspection) and based on the detection technique (Traffic Behavior based Identification) of flow behavioural characteristic.
Detection technique based on port is exactly to judge the technology of applicating category according to the port numbers of using.Its drawback is, if the port that P2P uses by dynamic change transmits data, just powerless based on the method detection of port so, though therefore this method efficient is than higher, accuracy is too poor, is not used widely.
Deep layer packet detection technique mainly is to carry out characteristic matching by the application layer data to packet to discern the P2P flow.Deep layer packet detection technique has developed very ripely at present.People such as Subhabrata Sen propose the P2P flow rate testing methods based on application signature at the beginning of 2004, be actually a kind of of deep layer packet detection method, this method is divided into constant offset amount (fixed offset) feature and varying offset amount (variable offset) feature to load (Payload) feature, the first step is checked the constant offset amount, second step was checked the varying offset amount, had all obtained gratifying effect on performance and precision.But use for the P2P that has adopted content-encrypt, just turned turtle based on deep layer packet detection technique.In addition, detect the environment that can not be applicable to big flow based on the deep layer packet.
Utilize traffic characteristic (as the IP address exactly based on the detection technique of flow behavioural characteristic, ports etc.) information is carried out the technology that P2P detects, this technology is without any need for the information about application layer protocol, be primarily aimed at network traffics some statistical nature in a period of time and detect and analyze, detect P2P based on this.People such as Thomas Karagiannis proposed a kind of P2P flow rate testing methods based on transport layer characteristics in 2004 behind the transport layer characteristics of having scrutinized the P2P flow, this method is characterized as foundation with the P2P flow in two kinds of generality that transport layer was showed, in conjunction with traditional port detection technique, can detect the P2P application that new P2P uses and encrypts effectively, but this method is too complicated and can not be applicable to domestic P2P applied environment.At home, though all carry out like a raging firely for P2P Study on Technology and application at present, the detection method that proposes mainly also is based on deep layer packet detection technique basically.Domestic subnetwork device fabrication merchant has released the Related product of P2P traffic monitoring, as the network management software---the network insight of SecPath 1800F fire compartment wall and Eudemon500/1000 fire compartment wall and national capital industrial (CAPTECH) of Huawei.What these products adopted all is deep layer packet detection technique.
In sum, at present the practical application of the identification of P2P and detection technique mainly is based on port or based on the detection of deep layer packet, some recognition methods accuracy and performances based on the flow behavioural characteristic all can not meet the demands.Therefore under large traffic environment (the above flow of 1Gbit/s), carry out the research based on the P2P recognition technology of flow behavioural characteristic, the real time detection algorithm based on the flow behavioural characteristic of design energy practical application has very important theory significance and practical value.
Summary of the invention
The objective of the invention is to overcome existing based on port and based on deep layer packet detection method not enough and current based on behavioural characteristic detection method real-time difference and can not be suitable for shortcoming such as large traffic environment, adopt feedback algorithm, a kind of new P2P node detection method based on the flow behavioural characteristic is provided.
A kind of P2P detection method based on the flow behavior comprises the steps:
(1) network traffics is sampled;
(2) flow to sampling connects the structure of (Connection) information Hash table, makes up node (IP) information Hash table according to the link information Hash table then;
(3) node in the nodal information Hash table (IP) is carried out role's identification, find out node with server characteristics;
(4) utilize many SVMs (Multi-SVM) algorithm to detect the P2P node to above-mentioned node with server characteristics.
Further, described many algorithm of support vector machine are realized based on a plurality of traffic characteristic vectors, wherein, traffic characteristic comprises: connect the duration, connect load, be connected into than and service rate, each traffic characteristic vector is a maximum by the statistical attribute of corresponding discharge feature, minimum value, average and quantile are formed.
Further, describedly utilize the detection of many algorithm of support vector machine to comprise:
(A) off-line training step: by the various P2P softwares of actual motion in network, generate training set after catching corresponding P2P flow, each described traffic characteristic vector is made up a SVMs (SVM), test the rate of false alarm and the rate of failing to report of this SVMs by training set, and determine the weight parameter of described SVMs (SVM) according to described rate of failing to report and rate of false alarm;
(B) online detection-phase: each SVMs (SVM) and the weight parameter thereof of utilizing off-line training step to determine make up many SVMs F (x 1, x 2, x 3, x 4):
F (x 1, x 2, x 3, x 4)=π 1F 1(x 1)+π 2F 2(x 2)+π 3F 3(x 3)+π 4F 4(x 4), when satisfying F (x 1, x 2, x 3, x 4)>0 o'clock, this node is the P2P node,
Wherein, x iRepresent the traffic characteristic vector, F i(x i) SVM that constitutes of the every feature vectors of expression, i=1,2,3,4, π iThe weight parameter of representing each SVM, and
Further, the building process of described link information Hash table is: at first set up the Hash table ConHashTable of a sky, realize the renewal of described Hash table ConHashTable again by the circulation of following steps, construct described link information Hash table:
(I) from the flow of sampling, read a packet;
(II) whether the link information of searching this packet correspondence in Hash table ConHashTable exists, and forwards step (III) to if exist, otherwise forwards step (IV) to;
(III) upgrade corresponding link information among the Hash table ConHashTable, change step (I);
(IV) a newly-built link information writes down and is inserted among the Hash table ConHashTable, changes step (I).
Further, the building process of described node Hash table is: at first set up above-mentioned link information Hash table copy within a certain period of time, and set up the Hash table IPHashTable of a sky, realize the renewal of described Hash table IPHashTable again by the circulation of following steps, finally construct described node Hash table:
1) from described copy, obtains a link information record;
2) obtain source IP and purpose IP in the described link information record;
3) judge whether described source IP exists in Hash table IPHashTable, change step 4), otherwise change step 5) if exist;
4) upgrade node corresponding information record among the Hash table IPHashTable, change step 6);
5) a newly-built nodal information writes down and is inserted among the IPHashTable;
6) judge whether described purpose IP exists in Hash table IPHashTable, change step 7), otherwise change step (8) if exist;
7) upgrade node corresponding information record among the Hash table IPHashTable, change step 9);
8) a newly-built IP information writes down and is inserted among the Hash table IPHashTable;
9) judge whether described copy travels through and finish, if change step 10), otherwise change step 2);
10) described node Hash table makes up and finishes.
Further, the sampling of network traffics is comprised time-based traffic sampling and based on the traffic sampling in space, thereby filter out redundant flow, reduce the time and space complexity that P2P detects.
Further, described time-based traffic sampling promptly adopts the method for interval sampling, can detect the P2P flow under the situation that reduces the data acquisition amount, and blanking time, t determined by following formula:
R ( t ) = P 0 * Σ i = 0 t / t 0 - 1 λ i * G ( t - i * t 0 ) i Σ i = 0 t / t 0 - 1 λ i
Wherein, the functional relation of R (t) expression system's verification and measurement ratio and assay intervals time t, G (X) is the CCDF of P2P node line duration, λ iBe optional unit interval t 0The arrival number of interior nodes, P oBe discrimination to the P2P node in the flow of having caught.
Further, described traffic sampling based on the space refers on the aspect in space, adopts the method for setting up Hash table to reduce the additive decrementation that the data comparison brings, and is specially:
Set up non-P2P node Hash table and P2P node Hash table, be recorded in the sense cycle detected non-P2P node respectively and detected P2P node in last one-period, again by data in data to be tested and two Hash tables are searched contrast respectively, if search success then represent and be identified as non-P2P node or P2P node in the detection formerly of this node, need not to carry out duplicate detection, so with this data packet discarding, in two Hash tables, all search less than the IP bag then detect.
Further, described role comprises as client, as server or simultaneously as client and server, and described network node with server characteristics refers to as server or simultaneously as the network node of client and server.
Further, for improving the performance that detects, this method also comprises feedback step, that is: in last round of, detect the P2P node after, testing result is fed back in described non-P2P node Hash table and the P2P node Hash table, it is upgraded, begin the detection of a new round again.
The quantity that detects the just feasible node that need detect from server capability reduces greatly, helps to improve the efficient of monitoring.Method based on Multi-SVM has improved the accuracy rate that detects.
Description of drawings
The two Hash table filtrations of Fig. 1
The two Hash table topology update methods of Fig. 2
Fig. 3 filter node Hash table structure
In the two Hash list structures of Fig. 4, node arrives the formation renewal process
Fig. 5 entire system structure chart
Fig. 6 network data mirror image schematic diagram
Fig. 7 connects the structure and the renewal of (Connection) Hash table
Fig. 8 connects (Connection) Hash table table copy and sets up and the analysis sequential chart
Fig. 9 node (IP) Hash table is set up flow chart
Many SVMs of Figure 10 training process
Many SVMs of Figure 11 testing process
Embodiment
The invention will be further described below in conjunction with the drawings and specific embodiments.
Because the most basic thought of P2P, also be that it and the most significant difference of CLIENT (C/S) are that the node in the network both can obtain the resource or the service of other nodes simultaneously, be again resource or service providers simultaneously, promptly have the dual identity of client (Client) and server (Server) concurrently.At first, catch the data that various P2P use by the mode of the various P2P softwares of actual motion, based on these data the traffic characteristic that P2P uses is analyzed, then, represent traffic characteristic with stochastic variable, and try to achieve the probability-distribution function of these stochastic variables by the method for statistical modeling.The rights and duties that each node had in the general P2P network all are reciprocity, comprise communication, service and resources consumption.Therefore, find the node with server characteristics in the multinode of only need comforming, the different characteristic by server node and P2P node in the differentiation C/S model just can more easily detect the P2P node then.Therefore,,, adopt many SVMs (Multi-SVM) algorithm, to carrying out the judgement of behavioural characteristic, to determine that it is P2P node or non-P2P node by the flow of gateway device based on these traffic statistics features from server side.In addition, the present invention has considered the extensibility of system, can use under the environment of big flow in order to make it, has adopted a kind of new sampling algorithm based on time and space, makes system have extensibility.
Purpose of the present invention reaches by following measure:
1.P2P the statistical modeling of traffic characteristic
To each traffic characteristic, as connect the duration (Connection Duration), connect load (Connection Payload) etc., the method that fits of all adopt distributing will test cumulative distribution function (cumulative distribution function, CDF) curve and theoretical CDF curve mutually the mode of match ask its probability-distribution function.At first, adopt the CDF of various theoretical distribution (as Weibull, Pareto, Log-Logistic etc.) to come match experiment CDF successively, adopt maximum-likelihood method to calculate the parameter value of theory of correspondences CDF; Select a kind of theoretical distribution that fits the best theoretical CDF of experiment CDF as respective standard by Ke's Er Monuofu-Smirnov test (Kolmogorev-Smirnov test, K-S test) at last.If the level of signifiance that the K-S of theoretical distribution model detects, just thinks that it can distribute in the match experiment less than 5%.The difference that distributes and to distribute with experiment for proof theory, (Quantile-Quantile, Q-Q) figure compares the quantile-quantile by drawing both.If theoretical CDF match fully experiment CDF, Q-Q figure just should be roughly straight line so.
2. improve system speed by traffic sampling algorithm and feedback algorithm, satisfy and under the Gbit traffic environment, carry out real-time P2P detection.
Current P2P detection mode based on the flow behavioural characteristic is to adopt the mode of catching total data to carry out P2P to analyze substantially, and this has had a strong impact on the performance and the real-time demand of system.The present invention carries out traffic filtering by sampling two aspects based on time and space, introduces the mode of feedback simultaneously, has increased substantially the treatment effeciency of system.
(1) time-based traffic sampling
According to the characteristics of P2P stream longer duration, the present invention adopts the method for interval sampling, makes the P2P detection algorithm can detect the P2P flow under the situation that reduces the data acquisition amount.Time-based traffic sampling method basic thought is as follows: under the situation that continuous sampling is analyzed, use A iRepresent the set of the i time detected P2P node, write down each detected P2P node, suppose to have carried out altogether n time and detect, the set of detected P2P node is designated as (A respectively 1, A 2A 3... A n).If the registration of the P2P node of record during relatively each is gathered is A 1UA 2U...UA n=A 1UA n, U represents the union of sets computing, measuring for the 2nd~n-1 time so just is redundant measurement.Therefore, the present invention is according to the distribution characteristics of P2P node line duration and node arrival rate, and reach the reduction redundant computation blanking time of choosing suitable traffic sampling, improves the purpose of systematic function.
If (Complementary Cumulative Distribution Function CCDF) is G (X)=P (X>x) to the CCDF of node line duration.If unit interval t 0The arrival number of interior (being 30s in practice) node is λ t, P oBe the discrimination of P2P detection algorithm to the P2P node in the flow of having caught.The present invention has been given to the functional relation of system's verification and measurement ratio (ratio of detected P2P node and in esse P2P node) with assay intervals time t:
R ( t ) = P 0 * Σ i = 0 t / t 0 - 1 λ i * G ( t - i * t 0 ) i Σ i = 0 t / t 0 - 1 λ i - - - ( 1 )
Especially, when the time interval of double sampling is 0, According to the definition G (0)=1 of CCDF function, at this moment R 0=P 0, be illustrated under the situation of continuous sampling, captured all P2P flows, the verification and measurement ratio R (t) of P2P node only with the discrimination P of concrete detection algorithm 0Relevant.By formula (1) as can be seen, adopt the statistical modeling method of P2P traffic characteristic, only need know the statistical distribution functions of P2P node line duration and P2P node arrival rate, just can select the proper sampling interval time according to formula (1).
(2) based on the traffic sampling in space
In the process of P2P node detection, on the aspect in space, can adopt the method for traffic filtering, remove redundant flow, reduce the time complexity of flow detection.The present invention adopts the method for setting up Hash (Hash) table to reduce the additive decrementation that the data comparison brings.Specifically, mainly be that following two kinds of flows are filtered: a) to the filtration of known non-P2P flow, promptly the flow of confirmed other application filters in the network, Mail for example, FTP, HTTP, DNS etc.B), promptly detected P2P flow is before filtered to the filtration of detected P2P flow.
Two Hash surface low amount filtration figure as shown in Figure 1.It is that non-P2P node table (Non-P2P Table) and P2P node table (Known P2PTable) are recorded in the sense cycle detected non-P2P node respectively and detected P2P node in last one-period that the present invention adopts two Hash table respectively.When a packet arrives, at first the Hash operation is carried out in its IP address, and then in first Hash table Non-P2P Table, search, if search success then represent to be identified as in the detection formerly of this node non-P2P node, then need not to carry out duplicate detection, directly with this data packet discarding, otherwise according to searching among this hash value to the second Hash table Known P2P table, if search success then represent in this IP detection formerly and be identified as the P2P node, also need not to carry out duplicate detection, so with this data packet discarding.At last, in two hash table, all search less than the IP bag then enter the P2P detection algorithm and detect.Known non-P2P node and in testing process detected P2P node numerous, all nodes all can not be kept among first Hash table Non-P2P Table.Simultaneously, along with the operation of detection system, the increasing of detected P2P node, second Hash table Known P2P Table will store bulk information in also, and this can increase the occupancy of internal memory and the time complexity that Hash mates.Therefore, according to principle of locality, first Hash table Non-P2P table and second Hash table Known P2P tale only preserve interior IP value of nearest a period of time.
3. based on the Multi-SVM detection method of server feature
Current P2P detection algorithm based on behavior is many to be discerned from the behavioural characteristic of P2P node during as client, and the present invention then discerns from the server end feature of P2P node, can obtain better to detect effect.The rights and duties that each node had in the general P2P network all are reciprocity, comprise communication, service and resources consumption.Therefore, find the node with server characteristics in the multinode of only need comforming, the different characteristic by server node and P2P node in the differentiation C/S model just can more easily detect P2P then.
Because server node will be less than client node far away in the C/S structure, so at first get rid of pure client node, the present invention proposes a kind of new node role recognition method and be used for judging a role that network node is current, as server (Server), as client (Client) or simultaneously as server and client side (Server﹠amp; Client).Then, get rid of pure network node by this method as client role.Then, those had the network node of server feature, analysis is based on the traffic characteristic that connects (Connection): connect the duration (Connection Duration), connect load (Connection Payload), be connected into than (Ratio of Request Connections and Response Connections, RQP) and service rate (Service Ratio).These four traffic characteristics can well reflect the characteristics of the role server of node.Then, these four traffic characteristics are represented with four traffic characteristic vectors respectively that each traffic characteristic vector is formed (maximum, minimum value, average, quantile) by the statistical attribute of traffic characteristic separately.At last, the present invention proposes a kind of new P2P node recognition methods based on Multi-SVM.This method is divided into two stages---training stage and cognitive phase.In the training stage, make up the training set and the test set of P2P flow.Each traffic characteristic vector is all made up a SVM, test its rate of false alarm and rate of failing to report by training set, and determine its weight parameter according to rate of failing to report and the rate of false alarm situation of each SVM, make the low more SVM of rate of false alarm and rate of failing to report in Multi-SVM detects, play big more effect.So just constituted a detection method based on the Multi-SVM of different weights.
Specific implementation method is as follows:
The overall structure of system as shown in Figure 5.See that on the whole it is three big modules that system is divided into: traffic sampling module, P2P node identification module and feedback module.The traffic sampling module has mainly adopted based on the time with based on the traffic sampling method in space; P2P node identification module has mainly adopted the Multi-SVM detection method based on the server feature; Feedback module then feeds back to detected non-P2P node of a last sense cycle and P2P node in the traffic sampling module.
Step 1. network traffics mirror image
In order monitor network itself not to be impacted, system adopts the mode of passive measurement (traffic mirroring) to detect the P2P flow.In addition, for the ease of the filtration of flow and the structure of hash table in next step, as shown in Figure 6, system adopt out go into (TX RX) both direction respectively the strategy of mirror image catch network traffics from core switch.Wherein core switch is between in-house network and Internet, and the flow of TX/RX both direction is all by this switch.
The sampling of step 2. network traffics and filtration
In the process that P2P detects, can adopt traffic sampling method based on the time and space, filter out redundant flow, reduce the time and space complexity of flow detection.
The time-based method of sampling of step 2.1
In the process of mirror image data flow, the mode of taking the interval certain hour to catch network traffics is carried out.Can set concrete blanking time with reference to formula (1) and actual network traffic conditions.For example, can establish blanking time in and environment that the hardware handles ability is stronger less relatively at flow is 0, promptly catches continuously; When in the bigger environment of flow, can set the long time interval to satisfy the requirement of algorithm process according to formula (1).
Step 2.2 is based on the traffic sampling in space
For the mirror image of flow each time, the present invention adopts the method for setting up two Hash filter table (Non-P2P Table, Know-P2P Tbale) to filter out a part of network traffics, reduces the additive decrementation that the data comparison brings from the angle in space.Specifically, mainly be that following two kinds of flows are filtered: (1) to the filtration of known non-P2P flow, and promptly the flow that confirmed other non-P2P use in the network filters, Mail for example, FTP, HTTP, DNS etc.; (2) to the filtration of detected P2P flow.As shown in Figure 1.
Two Hash tables filter and renewal process: after the filtration of network traffics to be detected by two Hash tables, enter the P2P detection module and carry out the P2P node detection, and the output testing result, according to this result two Hash tables are upgraded at last.This process as shown in Figure 2.
In order to prevent that two Hash tables to the change of the unlimited expansion of EMS memory occupation and node state (for example, become normal node of a normal node or the right side from the P2P node and become a P2P node), the present invention has set a upper limit with the node number of storage in two hash tables and simultaneously each node of each Hash table has been set overtime shifting out the time limit.According to the principle of first in first out, when detected P2P and non-P2P node upgrade the Hash table in next round detects, only replace that part of node that enters the Hash table at first.In specific implementation, can adopt chain structure that node in the Hash table is together in series, and keep head, tail pointer, as shown in Figure 3.Simultaneously, be trapped in for a long time in the hash table for fear of node, node is set timeout mechanism, the node that promptly surpasses a specified time need shift out two hash table filtrations.Concrete renewal process:
1), forms a nodal information memory block (PeerInfo) at the detected P2P node of detection module.PeerInfo is mainly by recognition time, compositions such as node IP address and node release mark position.
2) this PeerInfo is inserted in the Hash table filtration.
3) it is joined node and arrive rear of queue.
4) judge whether node number in the chained list has surpassed the upper limit that is provided with,, return 1 if not have then the node number in the Hash table is added 1), if above the upper limit then delete the node of team's head, but wouldn't Free up Memory, return 1).
Renewal process as shown in Figure 4.Especially, shift out in the process at node, directly the memory space of this nodes records is not discharged, but adopt a node release mark position, it is labeled as needs release condition, carries out batch release then behind the certain interval of time.
Step 3: based on the structure (detection algorithm) of the Hash table that connects (Connection)
The present invention adopts self-defined ways of connecting, makes it can contain bigger amount of information and has more specific aim, thereby can better analyze data.
Article one, flow (Flow) by source IP (SIP), purpose IP (DIP), source port (Sport), destination interface (Dport) and agreement (Proc) are represented.That is: Flow (SIP, Sport, DIP, Dport, Proc).
In the definition of stream, a stream is made up of 5 elements, and stream is directive as can be seen in the definition of stream.The flow point that we will initiate the stream that connects and response is not defined as to connect initiates stream (reqFlow) and is connected response to flow (repFlow).
Article two, flow Flow1 and Flow2, if Flow1.SIP=Flow2.DIP is ﹠amp; Flow1.Sport=Flow2.Dport ﹠amp; Flow1.DIP=Flow2.SIP﹠amp; Flow2.Dport=Flow1.Sport ﹠amp; Flow1.Proc=Flow2.Proc, we are called that stream of initiating among Flow1 and the Flow2 to be connected and initiate stream reqFlow so, and the stream of another response is called to respond and flows repFlow.Will (reqFlow repFlow) is called a connection (Connection), promptly Connection (reqFlow, repFlow).
Step 3.1 connects the structure of Hash table (ConHashTable)
The building process of the Hash of link information table is at first set up the link information Hash table (ConHashTable) of a sky, realizes the renewal of described Hash table ConHashTable again by the circulation of following steps, constructs described link information Hash table:
1) from the flow of sampling, reads a packet;
2) whether the link information of searching this packet correspondence in ConHashTable exists, and forwards (3) to if exist, otherwise forwards (4) to;
3) upgrade this link information, as: connect the duration, uplink traffic, downlink traffics etc. change (1);
4) a newly-built link information writes down and is inserted among the ConHashTable, changes (1).Based on the structure of the Hash table that connects as shown in Figure 7.
The connect copy of Hash table (ConHashTable) of step 3.2.
The present invention was the cycle to carry out data analysis with five minutes.After obtaining 5 minutes data in real time and generate connecting Hash table, open connect Hash table copy and further analyzing of new thread, meanwhile, main thread is still caught the connection Hash table that data are set up a new round in real time.Finished in five minutes with analysis as long as can satisfy the establishment that connects the Hash table copy, even under the situation that continuous flow is caught, the still real-time operation of system is gone down, as shown in Figure 8.
Step 4 is set up node Hash table (IPHashTable)
Owing to will on the node aspect, carry out the detection of P2P, therefore also need on the node aspect, analyze.System adopts the node Hash table to store each nodal information.The building process of node Hash table is as follows:
At first set up the nodal information Hash table (IPHashTable) of a sky, realize the renewal of described Hash table IPHashTable again by the circulation of following steps, finally construct described node Hash table:
1) from described copy, obtains a link information record;
2) obtain source IP and purpose IP in the described link information record;
3) judge whether described source IP exists in Hash table IPHashTable, change step 4), otherwise change step 5) if exist;
4) upgrade node corresponding information record among the Hash table IPHashTable, change step 6);
5) a newly-built nodal information writes down and is inserted among the IPHashTable;
6) judge whether purpose IP exists in Hash table IPHashTable, change step 7), otherwise change step (8) if exist;
7) upgrade this nodal information record, change step 9);
8) a newly-built IP information writes down and is inserted among the Hash table IPHashTable;
9) judge whether described copy travels through and finish, if change step 10), otherwise change step 2);
10) described node Hash table makes up and finishes.
Building process as shown in Figure 9.
Step 5 node role identification
The present invention adopts a new method to come the role of recognition node: as server, as client or simultaneously as the server and client side.
For a node (representing), N with the IP address SportAnd N DportRepresent respectively the open port number of this node self with and the port number of other IP of connecting, N IPRepresent node therewith to set up the different IP sum of annexation, determine two ratio P by following formula sAnd R d:
R s=N sport/N IP
R d=N dport/N IP
Under traditional C/S structure.The open usually certain port of server (Server), for example 80,25, wait for the connection of client.The then open a series of continuous ports of client (Client) initiate to connect to server.Find server node R under the C/S model in practice sValue generally in (0,1) scope, and R dValue usually greater than 1; And the R of client node sThe value then usually greater than 1, the Rd value is in (0,1) scope.In P2P uses, R sAnd R dThen usually entirely greater than 1 or entirely less than 1.In actual applications, we choose the node with server feature emphatically from alternative node, judge that with the different of common server node a node is server node or the P2P node in the common C/S model by the P2P node then.In practice, we only analyze several 10 the nodes that surpass of connection (Connection).Simultaneously, the present invention filters simple client joint by the mode of setting thresholding.The specific client end node filters thresholding and is set as follows:
R s>10?or?R d<5?or(R s>1?and?R d<0)
Step 6 is based on the detection algorithm of many SVMs (Multi-SVM)
On the basis of node Hash table, the present invention adopts the many SVMs (Multi-SVM) based on the server end feature) the P2P node find algorithm, main contents comprise: 1) choose 2 based on the server feature of node) adopt the mode of Multi-SVM to detect to judge the P2P node.
Server flows measure feature (Features) of step 6.1 node chooses
The present invention chooses following four traffic characteristics:
(1) connects the duration (Connection Duration)
The connection duration is represented the life period of a connection.Connection.repflow.Endtime represents to respond in the connection time that stream (repFlow) finishes, and connection.reqFlow.Endtime represents to initiate the time that stream (reqFlow) finishes in the connection.It also is the time that this connection begins that Connection.reqFlow.Begintime represents to initiate in the connection to flow the time that begins.Max represents to get maximum, and so, the connection duration just can calculate by following formula:
max(connection.repFlow.Endtime,connection.reqFlow.Endtime)-connection.reqFlow.Begintime.
(2) connect load (Connection Payload)
Connect load and be meant the data total amount of being transmitted in the connection.
(3) be connected into than (RQP)
For some network nodes, it is comprising two types connection: 1) by the connection of self initiating (request connections); 2) connection of initiating by other nodes (response connections).If use N Request connectionsThe number of expression request connections, N Response connectionsThe number of expression response connections.Being connected into of that main frame than may be defined as:
RQP = N request connections N response connections
(4) service rate (Service Ratio)
What service rate was mainly described a main frame provides the situation of service to other nodes.Main by the response scenarios of a main frame to the connection (response connections) of other main frames initiations.N Responded response connectionsRepresent the response number of times of this main frame to response connections, that is, all set up the response connections.N of connection in this main frame Total response connectionsThe number of representing the response connections that this main frame is all.The service rate of a main frame may be defined as so:
SR = N responded response connections N total response connections
Step 6.2 adopts the mode of Multi-SVM to detect judgement P2P node.
Formula (2) and formula (3) have been described SVM machine more than, and formula (4) is with public
Formula (5) has then provided the computational methods of each SVM weights.
F(x 1,x 2,x 3,x 4)=π 1F 1(x 1)+π 2F 2(x 2)+π 3F 3(x 3)+π 4F 4(x 4),(π 1234=1) (2)
F ( x 1 , x 2 , x 3 , x 4 ) = &Sigma; i = 1 4 &pi; i F ( x i ) > 0 P 2 P &Sigma; i = 1 4 &pi; i F ( x i ) < 0 not P 2 P - - - ( 3 )
&pi; i = 1 - &gamma; i &Sigma; j = 1 4 ( 1 - &gamma; j &prime; ) - - - ( 4 )
&gamma; i = &alpha; i + &beta; i 2 - - - ( 5 )
x iFour feature vectors of representing the present invention to choose, F i(x i) SVM that constitutes of the every feature vectors of expression, π iThe weights of representing each SVM, α i, β iRepresent rate of false alarm and rate of failing to report when each SVM detects the P2P node separately respectively, i=1,2,3,4.
P2P node determination methods based on Multi-SVM mainly is made of two stages: 1) training stage, 2) detection-phase.
The training process stage of Multi-SVM mainly is to determine that by training set each SVM is F i(x i) the weights π of (i=1,2,3,4) and each SVM i(i=1,2,3,4).Wherein training set generates by the various P2P softwares of actual motion in network and after catching correlative flow, and the weights of SVM are calculated by formula (4) and (5).So just constituted a Multi-SVM:F (x 1, x 2, x 3, x 4). as shown in Equation (2).The main flow process of training stage is as follows:
1) catches corresponding discharge by moving actual real P2P software.
2) calculate 4 traffic characteristic vector x i(i=1,2,3,4).Particularly, the statistics that connects the duration was formed with all duration that are connected of certain node in 5 minutes; The statistics that connects load is made up of the load data of all relevant with this node in 5 minutes connections; Be connected into that than statistics be in 5 minutes, every 30s calculates once to this node, forms one group of data.The service rate statistics is in 5 minutes the every 30s of this node to be calculated once, forms one group of data.Each traffic characteristic vector is formed (maximum by the statistical attribute of the statistics of traffic characteristic separately, minimum value, average, quantile), promptly get the maximum of the statistics that connects the duration, minimum value, average and quantile are formed corresponding traffic characteristic vector, and its excess-three group statistics in like manner also forms corresponding traffic characteristic vector.
3) train corresponding SVM:F according to different characteristic vectors i(x i) (i=1,2,3,4).
4) calculate the weights π of each SVM i(i=1,2,3,4).
The flow process of training stage as shown in figure 10.
The detection-phase of Multi-SVM then be with unknown P2P to be detected by step 1,2,3,4,5 handle the back generates 4 characteristic vectors, enters Multi-SVM, if F (x 1, x 2, x 3, x 4)>0 this node of expression is the P2P node, otherwise is non-P2P node, as shown in Equation (3).The main flow process of detection-phase is as follows:
1) obtains the value x of the traffic characteristic vector of each node to be measured according to abovementioned steps i(i=1,2,3,4).
2) to each node to be measured, enter Multi-SVM and test, if F is (x 1, x 2, x 3, x 4)>0 determines that it is the P2P node, if F is (x 1, x 2, x 3, x 4)<0 determines that it is non-P2P node.
The flow process of detection-phase as shown in figure 11.
Here need to prove that the training stage carries out outside testing process, that is to say each SVM and weights thereof determine calculate by the mode of off-line.And then carry out the online detection of P2P node.
Step 7. system feedback
The present invention utilizes principle of locality, utilizes the mode of feedback to carry out the detection of P2P node to improve the performance of detection system.Idiographic flow is as shown in Figure 5: detect the P2P node in last round of after, testing result is fed back in non-P2P node Hash table and the P2P node Hash table, it is upgraded, begin the detection of a new round again.

Claims (8)

1. the P2P detection method based on the flow behavior comprises the steps:
(1) network traffics is sampled;
(2) flow to sampling carries out the structure of link information Hash table, makes up the nodal information Hash table according to the link information Hash table then;
(3) node in the nodal information Hash table is carried out role's identification, find out node with server characteristics; Wherein, described role comprises as client, as server or simultaneously as client and server, and described node with server characteristics refers to as server or simultaneously as the node of client and server;
(4) utilize many algorithm of support vector machine that above-mentioned node with server characteristics is detected, find out P2P node wherein, finish detection;
Wherein, described many algorithm of support vector machine realize that based on a plurality of traffic characteristic vectors described detection comprises:
(A) off-line training step: by the various P2P softwares of actual motion in network, generate training set after catching corresponding P2P flow, each described traffic characteristic vector is made up a SVMs, test the rate of false alarm and the rate of failing to report of this SVMs by training set, and determine the weight parameter of this SVMs according to described rate of failing to report and rate of false alarm;
(B) online detection-phase: each SVMs and the weight parameter thereof of utilizing off-line training step to determine make up many SVMs F (x 1, x 2, x 3, x 4):
F (x 1, x 2, x 3, x 4)=π 1F 1(x 1)+π 2F 2(x 2)+π 3F 3(x 3)+π 4F 4(x 4), when satisfying F (x 1, x 2, x 3, x 4)>0 o'clock, this node is the P2P node,
Wherein, x iRepresent the traffic characteristic vector, F i(x i) the vectorial SVMs that constitutes of every kind of traffic characteristic of expression, i=1,2,3,4, π iThe weight parameter of representing each SVMs, and
&Sigma; i = 1 4 &pi; i = 1 .
2. method according to claim 1 is characterized in that, each traffic characteristic vector is a maximum by the statistical attribute of corresponding discharge feature, minimum value, average and quantile are formed, and described traffic characteristic comprises: connect the duration, connect load, be connected into than and service rate.
3. method according to claim 1, it is characterized in that, the building process of described link information Hash table is: the Hash table ConHashTable that at first sets up a sky, realize the renewal of described Hash table ConHashTable again by the circulation of following steps, construct described link information Hash table:
(I) from the flow of sampling, read a packet;
(II) whether the link information of searching this packet correspondence in Hash table ConHashTable exists, and forwards step (III) to if exist, otherwise forwards step (IV) to;
(III) upgrade corresponding link information among the Hash table ConHashTable, change step (I);
(IV) a newly-built link information writes down and is inserted among the Hash table ConHashTable, changes step (I).
4. method according to claim 3, it is characterized in that, the building process of described nodal information Hash table is: at first set up above-mentioned link information Hash table copy within a certain period of time, and set up the Hash table IPHashTable of a sky, realize the renewal of described Hash table IPHashTable again by the circulation of following steps, finally construct described nodal information Hash table:
1) from described copy, obtains a link information record;
2) obtain source IP and purpose IP in the described link information record;
3) judge whether described source IP exists in Hash table IPHashTable, change step 4), otherwise change step 5) if exist;
4) upgrade node corresponding information record among the Hash table IPHashTable, change step 6);
5) a newly-built nodal information writes down and is inserted among the Hash table IPHashTable;
6) judge whether described purpose IP exists in Hash table IPHashTable, change step 7), otherwise change step (8) if exist;
7) upgrade node corresponding information record among the Hash table IPHashTable, change step 9);
8) a newly-built IP information writes down and is inserted among the Hash table IPHashTable;
9) judge whether described copy travels through and finish, if change step 10), otherwise change step 2);
10) described nodal information Hash table makes up and finishes.
5. according to the described method of one of claim 1-4, it is characterized in that, the sampling of network traffics is comprised time-based traffic sampling and based on the traffic sampling in space, thereby filter out redundant flow, reduce the time and space complexity that P2P detects.
6. method according to claim 5 is characterized in that described time-based traffic sampling promptly adopts the method for interval sampling, can detect the P2P flow under the situation that reduces the data acquisition amount, and blanking time, t determined by following formula:
R ( t ) = P 0 * &Sigma; i = 0 t / t 0 - 1 &lambda; i * G ( t - i * t 0 ) i &Sigma; i = 0 t / t 0 - 1 &lambda; i
Wherein, R (t) is system's verification and measurement ratio, the ratio of promptly detected P2P node and in esse P2P node, and G (X) is the CCDF of P2P node line duration, λ iBe optional unit interval t 0The arrival number of interior nodes, P oBe discrimination to the P2P node in the flow of having caught.
7. method according to claim 5 is characterized in that, described traffic sampling based on the space refers on the aspect in space, adopts the method for setting up Hash table to reduce the additive decrementation that the data comparison brings, and is specially:
Set up non-P2P node Hash table and P2P node Hash table, be recorded in the sense cycle detected non-P2P node respectively and detected P2P node in last one-period, again by data in data to be tested and two Hash tables are searched contrast respectively, if search success then represent and be identified as non-P2P node or P2P node in the detection formerly of this node, need not to carry out duplicate detection, so with this data packet discarding, in two Hash tables, all search less than the IP bag then detect.
8. require 7 described methods according to aforesaid right, it is characterized in that, for improving the performance that detects, this method also comprises feedback step, that is: in last round of, detect the P2P node after, testing result is fed back in described non-P2P node Hash table and the P2P node Hash table, it is upgraded, begin the detection of a new round again.
CN2010101005288A 2010-01-22 2010-01-22 Behavior-based P2P detection method under large traffic environment Expired - Fee Related CN101795214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101005288A CN101795214B (en) 2010-01-22 2010-01-22 Behavior-based P2P detection method under large traffic environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101005288A CN101795214B (en) 2010-01-22 2010-01-22 Behavior-based P2P detection method under large traffic environment

Publications (2)

Publication Number Publication Date
CN101795214A CN101795214A (en) 2010-08-04
CN101795214B true CN101795214B (en) 2011-11-30

Family

ID=42587646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101005288A Expired - Fee Related CN101795214B (en) 2010-01-22 2010-01-22 Behavior-based P2P detection method under large traffic environment

Country Status (1)

Country Link
CN (1) CN101795214B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102075388A (en) * 2011-01-13 2011-05-25 华中科技大学 Behavior-based peer-to-peer (P2P) streaming media node identification method
CN102143083A (en) * 2011-04-02 2011-08-03 南京邮电大学 Method for designing and realizing double buffer in Ares protocol analysis system
CN102857474A (en) * 2011-06-29 2013-01-02 句容博通科技咨询服务有限公司 Method for identifying and classifying P2P (peer-to-peer) traffic on basis of SVM (support vector machine) technology
CN102404396B (en) * 2011-11-14 2014-04-02 北京星网锐捷网络技术有限公司 Method, device and system for identifying peer-to-peer (P2P) flow and equipment
CN106100781B (en) * 2016-05-20 2018-02-13 中国南方电网有限责任公司电网技术研究中心 Clock tracing method and system based on E1 passages
CN106713371B (en) * 2016-12-08 2020-04-21 中国电子科技网络信息安全有限公司 Fast Flux botnet detection method based on DNS abnormal mining
CN109190664A (en) * 2018-07-28 2019-01-11 天津大学 The classification method of radar electromagnetic interference echo based on GASVM algorithm
CN110417801B (en) * 2019-08-06 2022-02-01 北京智维盈讯网络科技有限公司 Server side identification method and device, equipment and storage medium

Also Published As

Publication number Publication date
CN101795214A (en) 2010-08-04

Similar Documents

Publication Publication Date Title
CN101795214B (en) Behavior-based P2P detection method under large traffic environment
CN102045363B (en) Establishment, identification control method and device for network flow characteristic identification rule
CN101026509B (en) End-to-end low available bandwidth measuring method
CN100514921C (en) Network flow abnormal detecting method and system
CN102510395B (en) Flash video scheduling method based on peer to server peer (P2SP)
Yang et al. Predicting internet end-to-end delay: an overview
CN103686946B (en) The system of selection of mobile P 2 P node and system in a kind of heterogeneous wireless network
CN114499979B (en) SDN abnormal flow cooperative detection method based on federal learning
Spoto et al. Analysis of PPLive through active and passive measurements
Bhatia et al. Identifying P2P traffic: A survey
CN105357071A (en) Identification method and identification system for network complex traffic
Gu et al. Pricing Incentive Mechanism based on Multi-stages Traffic Classification Methodology for QoS-enabled Networks
Cem et al. ProFID: Practical frequent items discovery in peer-to-peer networks
CN103533048B (en) Popular seed file acquisition methods in a kind of BT networks
CN106357479B (en) A kind of whole network flow monitoring method
Kutseva Adaptation of seven-layered IOT architecture for energy efficiency management in Smart House
CN101883030B (en) Detection method of P2P nodes based on random measure of IP addresses
CN101184081A (en) Method and device for identifying point-to-point service in communication network
CN102724193B (en) Control method aiming at streaming service survivability in IP (Internet protocol) network environment
Gonçalves et al. Characterizing dynamic properties of the SopCast overlay network
Zhong et al. Topological model and analysis of the P2P BitTorrent protocol
CN102075388A (en) Behavior-based peer-to-peer (P2P) streaming media node identification method
Aly et al. A presence-based architecture for a gateway to integrate vehicular ad-hoc networks (VANETs), IP multimedia subsystems (IMS) and wireless sensor networks (WSNs)
Qi et al. Analyzing bittorrent traf? c across large network
CN110858308A (en) P2P flow analysis method based on large-scale data storage and processing technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111130

Termination date: 20160122

EXPY Termination of patent right or utility model