CN108494620B

CN108494620B - Network service flow characteristic selection and classification method

Info

Publication number: CN108494620B
Application number: CN201810169202.7A
Authority: CN
Inventors: 董育宁; 张咪
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2021-07-27
Anticipated expiration: 2038-02-28
Also published as: CN108494620A

Abstract

The invention discloses a service flow feature selection and classification method based on a multi-target adaptive evolution algorithm. The self-adaptive intersection and variation keep the diversity of the population and ensure the convergence capability of the algorithm. Meanwhile, the invention classifies six multimedia service flows of online standard definition live video, webpage browsing (Baidu), online audio, webpage browsing (sina), network voice chat and online standard definition non-live video by using a designed three-layer KNN classifier model. The experimental result shows that the method has higher classification accuracy than the existing method.

Description

Network service flow characteristic selection and classification method

Technical Field

The invention belongs to the technical field of pattern recognition and classification, and particularly relates to a network service flow feature selection and classification method based on a multi-target adaptive evolution algorithm.

Background

In recent years, with the rapid development of the internet, accurate and efficient network flow classification is an important basis for network management. The diversity of network multimedia traffic stream types presents a significant challenge to their classification and identification. The traditional flow classification method mainly comprises three methods: port-based methods, deep packet inspection methods, and methods based on statistical characteristics of multimedia streams. However, with the advent of data encryption, new applications, and the use of dynamic ports, the first two classification methods are no longer applicable. Today, most researchers focus on machine learning classification methods including decision trees, svm (supportvectormachine), and C5.0.

In practical application, the feature dimension is often very high, and the existence of irrelevant and redundant features easily results in long time and high complexity for model training, and is not easy to popularize. The feature selection can filter out irrelevant and redundant features, so that the rapid dimension reduction is realized, and the model accuracy is improved. The feature selection algorithm can be classified into a Filter type (Filter), an encapsulation type (Wrapper) and an embedded type (Embed) according to different evaluation functions. The process of filtering type feature selection is independent and independent of the specific classifier. The encapsulation type is to combine feature selection with the design of a classifier and use classification accuracy to evaluate the selected features to select the optimal subset. The embedded type is that a feature selection method is used as a part of classifier training, and a subset is selected by analyzing the classification result of the obtained model. The current common feature selection methods include information Gain Ratio (GR), Pearson correlation coefficient, chi-square statistics, and the like. When the feature dimension is too high, efficiency needs to be improved by means of a search algorithm, and in recent years, many search algorithms have been applied to feature selection, such as Sequence Forward Selection (SFS), Sequence Backward Selection (SBS), and L-added-to-R selection algorithms. At present, an intelligent optimization search algorithm becomes a research hotspot, and the intelligent optimization search algorithm, such as an Evolution Algorithm (EA), a particle swarm algorithm and the like, is widely applied in the aspect of feature selection. However, these methods only consider a single criterion when searching for feature subsets, and do not consider the cardinality of the selected feature subsets, and they all belong to single target feature selection methods.

The multi-objective optimization can evaluate the quality of the feature subsets from multiple angles and optimize the evaluation indexes as objective functions simultaneously. Inspired by the natural biological evolution process, researchers have proposed multi-objective evolutionary algorithms, such as the non-dominated radial basis evolution algorithm (ENORA), for solving multi-objective optimization problems. However, uncorrelated and redundant features can increase the temporal complexity of multi-objective optimization when the feature dimension is high. For the evolutionary algorithm, the final classification accuracy and the convergence capability of the algorithm are reduced due to the fact that population initialization, crossing and mutation probability are not properly selected. And one objective function of most of the existing multi-objective feature selection algorithms is the accuracy of the classifier, so that the convergence speed is low and the running time is long.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

in order to overcome the defects of the algorithms, the invention provides a network service flow characteristic selection and classification method based on a multi-target self-adaptive evolution algorithm.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a network service flow characteristic selection and classification method based on a multi-target self-adaptive evolution algorithm, which comprises the following steps:

(1) data collection and preprocessing: collecting data flow samples of various multimedia services on the Internet, and then carrying out preprocessing operation;

(2) feature selection and analysis: analyzing the statistical characteristics of the network data flow samples, and selecting the characteristic combination which effectively distinguishes the service flows;

(3) classifying and checking the service flow: and carrying out classification experiments on the network multimedia service flows by utilizing the three layers of KNN classifiers to obtain classification results and calculate the integral classification accuracy.

Further, the method for selecting and classifying network multimedia service stream features based on the multi-objective adaptive evolution algorithm provided by the invention specifically comprises the following steps:

(2.1) capturing required multimedia service flow data through network packet analysis software WireShark in an open internet environment, and then converting the original data into a standard five-tuple text format, wherein the five-tuple text format comprises the arrival time of a data packet, a source IP address, a destination IP address, a protocol and the packet size of the data packet;

(2.2) performing basic statistical feature calculation on a standard five-tuple file of the original multimedia service stream, wherein the statistical features comprise: uplink/downlink packet size, entropy of uplink/downlink packet size information, overall packet size, uplink/downlink packet arrival time interval, downlink data packet rate, downlink byte rate, and ratio of uplink and downlink byte number.

Further, the method for selecting and classifying the network multimedia service stream features based on the multi-target adaptive evolution algorithm provided by the invention specifically comprises the following steps:

(3.1) sorting all the characteristics by adopting the information gain rate, and filtering the characteristics lower than the correlation threshold value;

(3.2) code selection: selecting binary codes with the length of the characteristic number N,each coding unit consists of a string of bits; any bit has two values, the value of 1 represents that the characteristic is selected, and the value of 0 represents that the characteristic is not selected; each individual is represented as:

wherein

c_IAnd m_IRespectively representing discrete parameters for performing adaptive crossover and mutation in each coding individual;

(3.3) population initialization: initializing an empty population P₀When the number of individuals in the population is less than the population size popsize, the loop is executed in [1, N ]]Initializing the value of q randomly within the range, selecting q characteristics with the information gain rate ranked at the top by an individual, namely setting the corresponding front q position as 1 and the positions from q +1 to N as 0, and adding the individual into a population P₀；

(3.4) there are two fitness functions f per individual I₁(I) And f₂(I) Two objective functions corresponding to multi-objective optimization; wherein f is₁(I) As the rate of inconsistency, f₂(I) Representing the number of the selected features;

(3.5) selecting a parent: selecting a parent based on the crowding distance of the individual;

(3.6) adaptive interleaving:

fixed cross probability p_cFor any two individuals I and J of the t generation, if the Bernoulli random variable is represented by p_cIf the probability of (1) is taken, c is_JRandomly set to 0 or 1, and c_JIs given to c_I(ii) a If c is_JIf the value of (1) is 0, no crossing is performed, and if the value of (1) is 1, uniform crossing is performed;

adding new cross-generated individuals into the auxiliary population Q_tPerforming the following steps;

(3.7) adaptive mutation:

fixed mutation probability p_mFor t generation individuals I, if the Bernoulli random variable is represented by p_mIf the probability of (1) is taken, m is_ISetting the value to 0 or 1 randomly; if m is_IIf the value of (1) is 0, no mutation is performed, and if the value of (1) is 1, single-point flipping is performedPerforming mutation;

adding new individuals generated by variation to Q_tIn the generation group, and the parent P_tAnd Q_tAre combined into an auxiliary population R_t；

For population R_tAll the individuals in the system are sorted according to the grade and crowding distance of the target function, and the front popsize individuals are selected to survive to the next generation P_t+1；

Executing t ═ t + 1;

(3.8) if the maximum iteration number gen is met or the inconsistency rate is kept unchanged in the iteration process, outputting an optimal feature subset; otherwise, repeating the steps (3.4) to (3.7).

Further, the method for selecting and classifying network multimedia service stream features based on the multi-target adaptive evolution algorithm provided by the invention comprises the following steps: a feature combination in a sample instance is called a pattern, the number of inconsistency of all patterns of a feature subset is the total number of samples of the pattern occurrence minus the number of samples of a certain type of label with the largest number of occurrences, and the inconsistency rate is equal to the number of inconsistency divided by the total number of samples.

Further, the method for selecting and classifying network multimedia service stream features based on the multi-target adaptive evolution algorithm, provided by the invention, has the correlation threshold value of 0.4 in the step (3.1), three layers of classifiers corresponding to N in the step (3.2) are 25, 26 and 13 in sequence, and the cross probability p in the step (3.6)_cAnd the mutation probability p in step (3.7)_mBoth are 0.1, popsize in step (3.7) is 100, and maximum number of iterations gen in step (3.8) is 10.

Further, the invention provides a method for selecting and classifying network multimedia service flow characteristics based on a multi-target self-adaptive evolution algorithm, wherein the service flow classification step specifically comprises the following steps:

(5.1) selecting the characteristics of the original multimedia service flow by adopting a characteristic selection method, classifying the multimedia flow into 4 types by a first-layer KNN: c1, C2, C3, C4; wherein C1 is online audio, C2 is online video, C3 is web browsing, and C4 is network voice chat;

(5.2) carrying out feature selection on the video stream features of the C2 obtained by the previous layer of classification again by using a feature selection method, and carrying out KNN classification of a second layer to obtain classification results C21 and C22;

(5.3) carrying out feature selection on the data stream features of the classification result C3 in the step (5.1) by using a feature selection method again, and carrying out second KNN classification of a second layer to obtain classification results C31 and C32;

and (5.4) counting the output result of the classification and calculating the accuracy of the whole classification.

Further, in the method for selecting and classifying the characteristics of the network multimedia service stream based on the multi-target adaptive evolution algorithm, the classification result C21 is an online live video, and C22 is an online non-live video; the content of the C31 webpage is characters and pictures, and the content of the C32 webpage is characters, pictures and videos.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. compared with a single-target feature selection algorithm, the multi-target feature selection method adopting the multi-target self-adaptive evolution algorithm not only considers the classification accuracy rate, but also considers the number of the selected features; compared with the existing multi-target feature selection algorithm, the method has the advantages of lower calculation complexity, higher convergence rate, capability of effectively reducing the time and space overhead in the feature selection process and improvement on the efficiency of feature selection.

2. The invention adopts a multi-layer classification method for multimedia services, designs a three-layer KNN cascade classifier, firstly selects effective characteristic combinations by using the characteristic selection method of the invention, and then classifies by using the three-layer classifier of the invention. Compared with the existing multi-layer SVM classification method, the method has better classification accuracy.

3. The method selects characteristics of six multimedia service flows of online standard definition live video, webpage browsing (Baidu), online audio, webpage browsing (sina), network voice chat and online standard definition non-live video, and then classifies the service flows by using a three-layer KNN classifier. The experimental result shows that the method has higher recognition rate compared with GR, EA and ENORA feature selection algorithms, and the total accuracy is 98.6%.

Drawings

FIG. 1 is a flow chart of the classification method of the present invention.

Fig. 2 is a valid verification diagram of the feature combination selected in the present invention, in which (a) is a two-dimensional distribution diagram of four network traffic flows on the maximum value of the downstream packet size and the number of upstream bytes, (b) is a two-dimensional distribution diagram of two video types on the maximum value of the downstream packet size and the number of upstream bytes, and (c) is a two-dimensional distribution diagram of two web browsing types on the maximum value of the downstream packet size and the downstream byte rate skewness.

Fig. 3 is a comparison chart of the accuracy of the GR, EA and ENORA for six multimedia service classifications according to the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As shown in fig. 1, the present invention provides a method for selecting and classifying multimedia service stream features based on a multi-objective adaptive evolution algorithm, the method comprises data acquisition and preprocessing of multimedia service streams, multimedia service stream feature selection based on the multi-objective adaptive evolution algorithm, three-layer KNN cascade classification output statistical results, and the like, and comprises the following steps:

step 1: the method comprises the following steps of data collection and pretreatment:

(1) in an open internet environment, capturing required multimedia service stream data through network packet analysis software WireShark, and then converting original data into a standard five-tuple text format, namely arrival time of a data packet, a source IP address, a destination IP address, a protocol and packet size of the data packet;

(2) performing basic statistical feature calculation on a standard five-tuple file of an original multimedia service stream, wherein the features comprise: uplink/downlink packet size, entropy of uplink/downlink packet size information, overall packet size, uplink/downlink packet arrival time interval, downlink data packet rate, downlink byte rate, and ratio of uplink and downlink byte number.

Step 2: the method comprises the following specific steps of:

(1) sorting all the features by adopting an information gain rate, and filtering the features lower than a correlation threshold value of 0.4;

(2) and (3) coding selection: a binary code is chosen, each individual consisting of a string of bits (length is a characteristic number N). Any bit has two values, the value of 1 represents that the characteristic is selected, and the value of 0 represents that the characteristic is not selected;

(3) population initialization: in [1, N ]]Randomly initializing the value of q in the range, and selecting q characteristics with the information gain rate ranking at the top as an initialization population P₀Setting the corresponding front q position as 1 and the positions from q +1 to N as 0;

(4) each individual I has two fitness functions f₁(I) And f₂(I) Corresponding to two objective functions of the multi-objective optimization. Wherein f is₁(I) For the inconsistency rate, a feature combination in the sample instance is called a pattern, the inconsistency number of all patterns in the feature subset is equal to the total number of samples appearing in the pattern minus the number of samples of a certain type of label with the largest number of occurrences, and the inconsistency rate is equal to the inconsistency number divided by the total number of samples; f. of₂(I) Representing the number of the selected features;

(5) selecting a parent: selecting a parent based on the crowding distance of the individual;

(6) self-adaptive intersection: first fix the cross probability p_c0.1, then for P_tAny two individuals of generation I and J, if the Bernoulli random variable is represented by p_cIf the probability of (1) is taken, c is_JRandomly set to 0 or 1, and c_JIs given to c_I. If c is_JIf the value of (1) is 0, no crossover is performed, and if 1, uniform crossover is performed. Adding new individuals generated by crossing into P_t+1In the generation group;

(7) self-adaptive mutation: first, the mutation probability p is fixed_m0.1 for P_tGeneration I if the Bernoulli random variable is p_mIf the probability of (1) is taken, m is_IIs randomly set to 0 or 1 if m_IIf the value of (1) is 0, the mutation is not performed, and if the value of (1) is 1, the one-point inversion mutation is performed. Adding new individuals generated by mutation into P_t+1In the generation group, executing t as t + 1;

(8) if the maximum iteration number gen is 10 or the inconsistency rate is kept unchanged in the iteration process, outputting an optimal feature subset; otherwise, repeating the step (4) to the step (7).

In an experiment, a three-layer KNN cascade classifier model is designed, and the model can identify certain specific types of application services in each level of classifier by using the feature combination selected by the method. The KNN classifier of the first layer is mainly used for identifying online audio (QQ Music), online video (live broadcast and non-live broadcast), webpage browsing and network voice chat (Skype), and the optimal characteristic combination is the maximum value of the size of a downlink packet and the number of uplink bytes. For the convenience of observation, we have done a log operation on (a) of fig. 2. As can be seen from fig. 2 (a), Skype belongs to interactive audio, so the number of bytes in the upstream is higher than that of web browsing and QQ Music but lower than that of live video in the online, and Skype and QQ Music can be efficiently identified by using the maximum value of the size of the downstream packet and the number of bytes in the upstream.

And the KNN classifier of the second layer further divides the video obtained by classifying the first layer into an online live video and an online non-live video. The best combination of features: and (4) uplink byte number. The CBox belongs to a live video type, and the interactive data between the client and the server is obviously more than that of the non-live video, youku video, so that the live video and the non-live video can be completely separated by characteristic upstream byte number from (b) of fig. 2.

The KNN classifier at the third layer further divides the webpage browsing obtained by the classification at the first layer into Baidu (the webpage content is characters and pictures) and sina (the webpage content is characters, pictures and videos). The best combination of features: maximum value of downlink packet size and downlink byte rate skewness. Since the video-class service data packet is larger than other service flows, and the sina browsing content comprises video-class, the maximum value of the downstream packet size of sina is slightly larger than the Baidu service flow. As shown in FIG. 2 (c), the characteristic maximum value of the downstream packet size and the downstream byte rate skewness can accurately identify sina and Baidu traffic flows

And 3, classifying and checking the service flow, which comprises the following specific steps:

(1) performing feature selection on an original multimedia service stream by adopting a feature selection method, performing first-layer KNN classification, and classifying the multimedia stream into 4 types of C1, C2, C3 and C4; wherein, C1 is online audio (QQ Music), C2 is online video (live broadcast and non-live broadcast), C3 is web browsing, C4 is network voice chat (Skype);

(2) carrying out feature selection on the video stream features of the C2 obtained by the previous layer of classification by using a feature selection method again, and carrying out KNN classification of the second layer to obtain classification results C21 and C22; wherein, C21 is an online live video, and C22 is an online non-live video;

(3) performing feature selection on the data stream features of the classification result C3 in the step (1) again by using a feature selection method, and performing second KNN classification on a second layer to obtain classification results C31 and C32; wherein, C31 is Baidu (the webpage content is characters and pictures), C32 is sina (the webpage content is characters, pictures and videos);

(4) and counting the classification output result and calculating the integral classification accuracy.

In the experiment, two-fold cross validation is adopted, and the classification result of the invention is compared with the results of GR, EA and ENORA. As can be seen from FIG. 3, the method of the present invention has the highest overall classification accuracy, which is as high as 98.6%.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The network service flow characteristic selection and classification method based on the multi-target self-adaptive evolution algorithm is characterized by comprising the following steps of:

(2) feature selection and analysis: analyzing the statistical characteristics of the network data flow samples, and selecting a characteristic combination which effectively distinguishes the service flows, specifically comprising the following steps:

(3.2) code selection: selecting binary codes with the length of the characteristic number N, wherein each code individual consists of a string of bits; any bit has two values, the value of 1 represents that the characteristic is selected, and the value of 0 represents that the characteristic is not selected; each individual is represented as:

wherein

c_I∈{0，1}，m_I∈{0，1}；c_IAnd m_IRespectively representing discrete parameters for performing adaptive crossover and mutation in each coding individual;

(3.6) adaptive interleaving:

fixed cross probability p_cFor any two individuals of the t generation I andj, if p_cIf the probability of (1) is taken, c is_JRandomly set to 0 or 1, and c_JIs given to c_I(ii) a If c is_JIf the value of (1) is 0, no crossing is performed, and if the value of (1) is 1, uniform crossing is performed;

adding new individuals generated in a crossed way into an auxiliary population Q_tPerforming the following steps;

(3.7) adaptive mutation:

fixed mutation probability p_mFor t generation of individual I, if p_mIf the probability of (1) is taken, m is_ISetting the value to 0 or 1 randomly; if m is_IIf the value of (1) is 0, performing no mutation, and if the value of (1) is 1, performing single-point inversion mutation;

Executing t ═ t + 1;

(3.8) if the maximum iteration number gen is met or the inconsistency rate is kept unchanged in the iteration process, outputting an optimal feature subset; otherwise, repeating the step (3.4) to the step (3.7);

2. The method for selecting and classifying network traffic flow characteristics based on multi-objective adaptive evolution algorithm according to claim 1, wherein the data collection and preprocessing operation specifically comprises:

3. The method for selecting and classifying network traffic flow characteristics based on multi-objective adaptive evolution algorithm according to claim 1, wherein the inconsistency rate is: a feature combination in a sample instance is called a pattern, the number of inconsistency of all patterns of a feature subset is the total number of samples of the pattern occurrence minus the number of samples of a certain type of label with the largest number of occurrences, and the inconsistency rate is equal to the number of inconsistency divided by the total number of samples.

4. The method for selecting and classifying network traffic flow characteristics based on multi-objective adaptive evolution algorithm according to claim 1, wherein the correlation threshold in step (3.1) is 0.4, the three-tier classifiers corresponding to N in step (3.2) are 25, 26 and 13 in sequence, and the cross probability p in step (3.6) is_cAnd the mutation probability p in step (3.7)_mBoth are 0.1, popsize in step (3.7) is 100, and maximum number of iterations gen in step (3.8) is 10.

5. The method for selecting and classifying network traffic flow characteristics based on multi-objective adaptive evolution algorithm according to claim 1, wherein the traffic flow classification step specifically comprises:

6. The method for selecting and classifying characteristics of network traffic streams based on multi-objective adaptive evolution algorithm according to claim 5, wherein the classification result C21 is an online live video and C22 is an online non-live video; the content of the C31 webpage is characters and pictures, and the content of the C32 webpage is characters, pictures and videos.