CN115051955B - Online flow classification method based on triple feature selection and incremental learning - Google Patents

Online flow classification method based on triple feature selection and incremental learning Download PDF

Info

Publication number
CN115051955B
CN115051955B CN202210714868.2A CN202210714868A CN115051955B CN 115051955 B CN115051955 B CN 115051955B CN 202210714868 A CN202210714868 A CN 202210714868A CN 115051955 B CN115051955 B CN 115051955B
Authority
CN
China
Prior art keywords
feature
flow
online
characteristic data
decision model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210714868.2A
Other languages
Chinese (zh)
Other versions
CN115051955A (en
Inventor
王兴伟
赵伟莨
王卓楠
贾杰
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202210714868.2A priority Critical patent/CN115051955B/en
Publication of CN115051955A publication Critical patent/CN115051955A/en
Application granted granted Critical
Publication of CN115051955B publication Critical patent/CN115051955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Abstract

The application relates to an online flow classification method based on triple feature selection and incremental learning, which comprises the following steps: based on pre-collected network flow sample data, adopting a triple feature selection scheme to perform feature selection, and constructing an initial decision model by using an Huo Fuding arbitrary time tree increment learning method offline training mode; based on real-time traffic, acquiring a feature data set for online training, processing feature data in the feature data set, and updating an initial decision model through an online mode of a Huo Fuding arbitrary time tree based on the processed feature data set to obtain a decision model for online classification of network traffic. By the method, fine-grained flow classification of long and short flows can be achieved in real time, identification types comprise cheetah flows, tortoise flows, porcupine flows and elephant flows, and a universal online flow classification framework is provided.

Description

Online flow classification method based on triple feature selection and incremental learning
Technical Field
The application belongs to the technical field of data processing, and particularly relates to an online flow classification method based on triple feature selection and incremental learning.
Background
The network flows have significant heavy-tail distribution characteristics, with a smaller number of long flows occupying a large portion of the network traffic. Therefore, the knowledge of the long flow information can realize the overall knowledge of all network flows passing through the link, is convenient for the management, monitoring and analysis of the network flows, plays a great role in engineering applications such as network flow charging, safety detection, flow regulation and control and the like, and can effectively reduce the data volume processed and stored by recognizing the long flow, thereby improving the processing efficiency and the resource utilization rate of the system.
In the first aspect, in the conventional long and short stream classification scheme, the sampling-based method is simple to implement, but has a large error. The long-short stream classification scheme based on the LRU is based on the characteristics of long duration, large length and frequent access to the cache, long streams are left in the LRU cache, but in the actual measurement process, when a large number of bursty short streams arrive, the cache space of the LRU is filled due to the large number of short streams, so that long stream objects are replaced by the LRU cache space. Hash collisions may occur in long and short stream classification schemes based on hash tables, and further, the maintenance of hash tables requires significant overhead.
In the second aspect, the flow classification scheme based on the artificial intelligence technology has a certain advantage for quick and accurate flow identification, but for the high-efficiency sampling and flow classification scheme, the naive Bayes and C4.5-based elephant flow and mouse flow classification algorithm and the gating cycle unit-based long and short flow classification all obtain a decision model through offline training, all training examples are required to be stored in a memory at the same time by the method, so that the number of training samples is severely limited. With the increase of the data scale, the existing model cannot be updated in real time, and a new decision model can be obtained only through retraining. If the network environment changes greatly and the generalization capability of the model is insufficient, the original model is not necessarily capable of effectively performing traffic identification and classification.
In view of this, the present application provides an online flow classification method with triple feature selection and incremental learning, which can perform real-time online classification on flow based on a constructed decision model, so as to meet the requirement of quality of service.
Disclosure of Invention
First, the technical problem to be solved
In view of the foregoing drawbacks and deficiencies of the prior art, the present application provides an online flow classification method based on triple feature selection and incremental learning.
(II) technical scheme
In order to achieve the above purpose, the present application adopts the following technical scheme:
in a first aspect, the present application provides an online model acquisition method based on triple feature selection and incremental learning, including:
s10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using an Huo Fuding arbitrary time tree increment learning method offline training mode;
s20, acquiring a characteristic data set for online training based on real-time traffic, processing characteristic data in the characteristic data set, and updating an initial decision model through an online mode of a Huo Fuding arbitrary time tree based on the processed characteristic data set to obtain a decision model for online classification of network traffic;
the network flow sample data is offline mode processing by adopting a software tool FNP-flowmeter, and an offline characteristic data set is constructed;
and the characteristic data set of the online training is an online characteristic data set extracted by analyzing real-time flow by adopting an FNP-flowmeter online mode in a preset timing period.
Optionally, the step S10 includes:
s11, based on pre-collected network flow sample data, adopting FNP-flow meter offline mode processing to construct an offline characteristic data set, and marking the type of the flow according to the byte number of the flow and the type of the flow in a long-time division manner;
s12, based on the offline feature data of the marked flow category, a marked feature data set is constructed according to a triple feature selection algorithm, and an initial decision model is constructed by adopting an offline training mode of Huo Fuding arbitrary time tree increment learning method.
Optionally, constructing the marked feature data set according to the triple feature selection algorithm in S12 includes:
s12-1, preprocessing the offline characteristic data of the marked stream class;
the pretreatment comprises the following steps: calculating the relevance between each feature data and the category by adopting the symmetry uncertainty to obtain a feature data subset of the relevance;
processing for removing redundant features is carried out on the feature data subsets with relevance to the category so as to obtain feature data subsets without redundant features;
s12-2, reducing an initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector to obtain a feature data subset after dimension reduction;
s12-3, screening all feature sequences in the feature data subset after dimension reduction within a threshold value m by adopting a feature occurrence frequency selector, counting the occurrence frequency of each feature in all selected feature sequences, returning feature sequences SF which are arranged in a descending order and have the occurrence frequency larger than 1 and the occurrence frequency FQ of each feature, marking the features obtained through screening, and constructing a marked feature data set based on the marked features.
Optionally, after S11, before S12, the method further includes: and carrying out balancing treatment on the offline characteristic data set.
Optionally, the S20 includes:
s21: initializing a timer;
s22: analyzing the real-time flow through an FNP-flowmeter on-line mode, and extracting on-line characteristic data;
s23: based on the online characteristic data, obtaining marked online characteristic data according to the byte number of the stream and the type of the stream of the continuous long-term division of the stream;
s24: judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, otherwise, executing S25;
s25: the marked online characteristic data is updated to an initial decision model through an online mode of Huo Fuding any time tree, and a decision model for online classification of network traffic is obtained.
Optionally, the method includes:
if S l < S and T l < T, the flow type is a Leopard flow;
if S l < S and T l The flow type is tortoise flow;
if S l More than or equal to S and T l The flow type is porcupine flow;
if S l More than or equal to S and T l The stream type is elephant stream;
wherein S is l Representing the number of stream bytes in the online feature data, S representing the stream byte number threshold, T l Representing the flow duration in the online feature data, T representing the flow duration threshold.
In a second aspect, the present application provides an online flow classification method based on triple feature selection and incremental learning, including:
a01, acquiring a trained decision model for classifying network traffic on line based on the pre-acquired network traffic and the network traffic on line;
a02, capturing the flow in real time, and classifying the captured flow according to a preset quintuple mode;
the five-tuple comprises: source IP address, destination IP address, source port number, destination port number, transport layer protocol;
a03, carrying out characteristic pretreatment on the classified data packet to obtain pretreated characteristic data;
a04, classifying the preprocessed characteristic data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified five-tuple and the class information of each group;
the decision model for classifying the network traffic online is a decision model obtained based on the online model obtaining method provided in the first aspect.
Optionally, the a03 includes:
a03-1, carrying out feature extraction on the stream meeting the preset condition according to the feature sequence selected by the triple feature selection algorithm to obtain a feature data item;
the preset conditions are as follows: aiming at the flow captured in real time, the number of packets in the flow reaches a preset value N;
and A03-2, taking the characteristic data entry as an input variable of the decision model, and dividing the flow type of the real-time captured flow.
(III) beneficial effects
The method comprises the steps of carrying out fine-granularity classification on long and short streams based on an online decision tree (the traditional mode is divided into elephant streams and mouse streams, the fine granularity is divided into leopard streams, tortoise streams, porcupine streams and elephant streams), deleting irrelevant features and redundant features in network flow data through a triple feature selection algorithm, carrying out dimension reduction processing on d-dimensional feature sequences acquired based on the network flow data, carrying out statistics on occurrence frequencies and frequencies of various features in different feature sequences meeting prediction precision thresholds, and returning feature sequences which are arranged in descending order and meet preset conditions; furthermore, the training time of the network traffic data can be greatly shortened, high-precision prediction can be performed by using the least possible features, the dimension disaster can be effectively prevented, the generalization capability of the established decision model can be enhanced, and the occurrence of the over-fitting condition can be prevented.
Drawings
The application is described with the aid of the following figures:
FIG. 1 is a flow diagram of an online model acquisition method based on triple feature selection and incremental learning;
FIG. 2 is a schematic diagram of a construction flow of an initial decision model;
FIG. 3 is a flow chart of processing a pcap file based on FNP-flowmeter;
FIG. 4 is a schematic diagram of the FNP-flowmeter offline mode;
FIG. 5 is a schematic diagram of a triple feature selection algorithm;
FIG. 6 is a schematic diagram of a flow chart for constructing an online classification decision model;
FIG. 7 is a schematic diagram of the FNP-flowmeter on-line mode;
FIG. 8 is a flow diagram of an online flow classification method based on triple feature selection and incremental learning.
Detailed Description
The invention will be better explained by the following detailed description of the embodiments with reference to the drawings. It is to be understood that the specific embodiments described below are merely illustrative of the related invention, and not restrictive of the invention. In addition, it should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other; for convenience of description, only parts related to the invention are shown in the drawings.
An embodiment one provides an online model acquisition method based on triple feature selection and incremental learning, as shown in fig. 1, and the specific method is as shown in steps S10 and S20:
and S10, performing feature selection by adopting a triple feature selection scheme based on pre-acquired network traffic sample data, and constructing an initial decision model by using an offline training mode of Huo Fuding arbitrary time tree increment learning method.
S20, acquiring a characteristic data set for online training based on real-time traffic, processing characteristic data in the characteristic data set, and updating an initial decision model through an online mode of a Huo Fuding arbitrary time tree based on the processed characteristic data set to obtain a decision model for online classification of network traffic.
In this embodiment, the network traffic sample data is: the offline mode of the software tool FNP-flowmeter provided by the application is adopted for processing, and an offline characteristic data set is constructed.
In this embodiment, the feature data set of online training is an online feature data set extracted by analyzing real-time traffic using an FNP-flow meter online mode in a preset timing period.
The first embodiment of the application discloses that fine granularity classification is performed on long and short streams based on a Huo Fuding arbitrary time tree algorithm, and feature selection is performed through a triple feature selection algorithm, so that training time of network traffic data is greatly shortened, high-precision prediction is performed by using as few features as possible, dimension disasters are effectively prevented, generalization capability of an established decision model is enhanced, and overfitting is prevented.
The second embodiment provides an online model acquisition method based on triple feature selection and incremental learning, which comprises the following specific steps:
and S10, performing feature selection by adopting a triple feature selection scheme based on pre-acquired network traffic sample data, and constructing an initial decision model by using an offline training mode of Huo Fuding arbitrary time tree increment learning method.
Based on the foregoing step S10, as shown in fig. 2, based on steps S11 to S12, so as to implement the construction of the initial decision model, specific steps are described as follows:
s11, based on pre-collected network flow sample data, adopting FNP-flow meter offline mode processing to construct an offline characteristic data set, and marking the type of the flow according to the byte number of the flow and the type of the flow in a long-term manner during the continuous time of the flow.
Based on the foregoing step S11, for the flow feature extraction tool FNP-flowmeter, it should be noted that:
the FNP-flowmeter can be divided into an offline mode and an online mode as a flow characteristic extraction software tool provided by the application, wherein the offline mode of the FNP-flowmeter is used for acquiring characteristic data information of a flow from a page file captured by a flow capture tool such as a wireshark and the like, constructing a characteristic data set, and FIG. 3 is a flow schematic diagram of processing a pcap file based on the FNP-flowmeter.
In this embodiment, as shown in fig. 3, by constructing empty tcpflowly and udpflowly, so as to store stream information; reading data packets one by one from the pcap file, and reserving the data packets with a transmission layer protocol of TCP or UDP by filtering the traffic; and judging whether the read data packet is empty or not, if so, directly ending, and if not, further judging whether the transmission layer is a TCP layer or not.
In this embodiment, as shown in fig. 3, if the transport layer is a TCP layer, it is further determined whether the ordered five-tuple is in tcpflowly, and if the transport layer is not a TCP layer, it is further determined whether the ordered five-tuple is in udpflowly.
In this embodiment, as shown in fig. 3, if the ordered five-tuple is in tcpflowly, the corresponding five-tuple is searched to add packet information, and it is determined whether the TCP flow accords with 4 waving processes or if the RST flag bit is 1, if the ordered five-tuple is not in tcpflowly, the ordered five-tuple is added to tcpflowly, and the packet information is added, then the packets are read one by one from the pcap file again, and the packets of TCP or UDP are reserved by filtering the flow.
In this embodiment, as shown in fig. 3, if the TCP flow accords with the 4 times of waving process or the RST flag bit is 1, the packet level and flow level feature statistics is performed to return to the feature sequence, the data packets are read from the pcap file one by one again, and the data packets of the TCP or the UDP are reserved by filtering the flow; if the TCP stream does not accord with the 4 times waving process or RST is 1, the data packets are directly read from the pcap file one by one again, and the data packets of the TCP or the UDP are reserved through filtering the traffic.
In this embodiment, as shown in fig. 3, if the ordered five-tuple is in udpflowly, the packet information is added by searching the corresponding five-tuple, and whether the UDP stream is overtime is judged, if the ordered five-tuple is not in udpflowly, the ordered five-tuple is added to udpflowly, the packet information is added, the packets are read one by one from the pcap file again, and the packets of TCP or UDP are reserved by filtering the flow; if the sequenced quintuple is in UDPfiowdirected, searching the corresponding quintuple to add packet information, and judging whether the UDP flow is overtime.
In this embodiment, as shown in fig. 3, if the UDP stream is overtime, the packet level and stream level feature statistics is performed to return the feature sequence, and the packets are read from the pcap file one by one again, the packets of TCP or UDP are reserved by filtering the traffic, and if the UDP stream is not overtime, the packets are read from the pcap file one by one directly again, and the packets of TCP or UDP are reserved by filtering the traffic.
In this embodiment, an FNP-flowmeter offline mode is adopted to implement the construction of the offline feature data set, wherein a schematic diagram of the FNP-flowmeter offline operation mode is shown in FIG. 4.
In this embodiment, the offline feature data set is further required to be balanced, and then the feature data set is divided into a training set and a testing set.
In this embodiment, the balancing method may be SMOTE, but is not limited to SMOTE.
S12, based on the offline feature data of the marked flow category, a marked feature data set is constructed according to a triple feature selection algorithm, and an initial decision model is constructed by adopting an offline training mode of Huo Fuding arbitrary time tree increment learning method.
Based on the foregoing step S12, for the triple feature selection algorithm, it should be noted that:
in this embodiment, the triple feature selection algorithm includes a Fast Correlation-Based Filter (FCBF), a sequential feature selector (Sequential Feature Selector, SFS), and a feature occurrence frequency selector (Feature Occurrence Frequency Based Selector, FOFBS).
In this embodiment, the irrelevant features and redundant features are removed by the fast correlation filtering algorithm FCBF.
In this embodiment, the initial d-dimensional feature space is reduced to a k-dimensional feature subspace through the sequential feature selector SFS, and d is smaller than k, specifically, the sequential feature selector is constructed based on a search algorithm of greedy mechanisms such as sequential forward selection, sequential backward selection, sequential floating forward selection, sequential floating backward selection, and the like.
In this embodiment, the feature sequence and the frequency of feature occurrence in the feature sequence are counted and sorted in descending order by the feature occurrence frequency selector FOFBS.
The following details the application process of the triple feature selection algorithm, as shown in fig. 5:
in this embodiment, DS is a feature data set obtained by feature extraction and marking of a pre-collected network traffic sample, and the feature data set includes related features X under the condition that the feature data set is not screened, as shown in a formula relevant Extraneous feature X irrelevant Redundancy feature X redundancy The total number of features is N initial The method comprises the steps of carrying out a first treatment on the surface of the Wherein the feature data setThe expression is as follows:
the first point, for extracting the features by the fast correlation filtering algorithm FCBF, needs to be explained as follows:
the irrelevant feature X is selected through the feature selection of the fast correlation filtering algorithm FCBF relevant Redundancy feature X irrelevant Removing to obtain feature data subset F without redundancy relevant The feature quantity is N FCBF And satisfies the condition N FCBF ≤N initial The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the specific expression of the feature data subset without redundancy is as follows:
F relevant ={X relevant }
the second point, for extracting the features by the sequential feature selector SFS, needs to be explained as follows:
selecting features by executing an n-time sequential feature selector SFS, storing feature sequences obtained by executing the SFS each time and corresponding test accuracy by ACCfeaturelist, specifically, acc n Test accuracy, fk, obtained for each round of SFS execution n To be from F relevant A feature sequence formed by k selected features; the feature sequences stored by the ACCfeaturelist and the corresponding test accuracy are expressed as follows:
the third point is that the feature is extracted by the feature occurrence frequency selector FOFBS, and the following needs to be described:
and the feature occurrence frequency selector FOFBS is used for screening all feature sequences within the range of a threshold value m by taking the highest test accuracy as a benchmark, counting the occurrence frequency of each feature in all selected feature sequences, and returning feature sequences SF which are arranged in descending order and have the occurrence frequency larger than 1 and the occurrence frequency FQ of each feature.
Based on the foregoing step S12, for constructing the marked feature data set according to the triple feature selection algorithm, it should be noted that:
s12-1, preprocessing the offline characteristic data of the marked stream class.
In this embodiment, the preprocessing includes: calculating the relevance between each feature data and the category by adopting the symmetry uncertainty to obtain a feature data subset of the relevance; and processing the feature data subset with relevance to the category to remove redundant features to obtain a feature data subset without redundant features.
Based on the foregoing step S12-1, for the preprocessing process of the offline feature data, it should be noted that:
the first point, for the removal of irrelevant features, is to be explained:
in this embodiment, a symmetry uncertainty (Symmetric Uncertainty, SU) is used to calculate a correlation value SU between each feature data and a category, a threshold corresponding to the correlation value is set, feature data greater than the threshold is sorted in descending order according to the correlation value SU, and a set of feature data obtained by sorting is used as a feature data subset of the correlation.
The second point, for the removal of redundant features, is to be explained:
in this embodiment, from the first feature F in the feature data subset of the association 1 Initially, if F 1 Characteristic data F thereafter i Correlation value SU between categories i Less than the characteristic data F i And feature F 1 Correlation value SU between 1 Removing the feature data F from the associated feature data subset i The method comprises the steps of carrying out a first treatment on the surface of the The first feature F in the feature data subset with relevance 1 After the reference judgment is completed, the second feature F in the feature data subset with the relevance is needed 2 Continuing to judge for the reference, repeating the steps until the feature subset of the relevance is obtainedNo feature data is removed or all the feature data are judged.
S12-2, reducing an initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector, and obtaining a feature data subset after dimension reduction.
Based on the foregoing step S12-2, for the acquisition of the feature data subset after the dimension reduction, it is to be noted that:
and (3) reducing the dimension of the initial d-dimensional feature space into a k-dimensional feature subspace by executing the sequence feature selector for n times, and adding the feature sequence and the corresponding accuracy obtained by executing the sequence feature selector each time into the ACCfeaturelist.
S12-3, screening all feature sequences in the feature data subset after dimension reduction within a threshold value m by adopting a feature occurrence frequency selector, counting the occurrence frequency of each feature in all selected feature sequences, returning feature sequences SF which are arranged in a descending order and have the occurrence frequency larger than 1 and the occurrence frequency FQ of each feature, marking the features obtained through screening, and constructing a marked feature data set based on the marked features.
Based on the step S12-3, for feature screening, it should be noted that:
based on the accuracy corresponding to the feature sequence obtained in the step S12-2, obtaining the highest accuracy, setting a threshold m based on the highest accuracy, screening all the feature sequences within the range of the threshold m, and counting the occurrence frequency of each feature in all the screened feature sequences.
S20, acquiring a characteristic data set for online training based on real-time traffic, processing characteristic data in the characteristic data set, and updating an initial decision model through an online mode of a Huo Fuding arbitrary time tree based on the processed characteristic data set to obtain a decision model for online classification of network traffic.
Based on the foregoing step S20, as shown in fig. 6, the construction of the online classification decision model is realized based on steps S21 to S25, and the specific description steps are as follows:
s21: a timer is initialized.
S22: and analyzing the real-time flow through an FNP-flowmeter on-line mode, and extracting on-line characteristic data.
In this embodiment, the implementation process of the FNP-flowmeter online mode is shown in FIG. 7.
S23: based on the online characteristic data, marked online characteristic data is obtained from the number of bytes of the stream and the type of the stream that is long-divided by the duration of the stream.
Based on the foregoing step S23, the type classification procedure of the flow needs to be described as follows:
if S l < S and T l And < T, the flow type is a Leopard flow.
If S l < S and T l And if not less than T, the flow type is tortoise flow.
If S l More than or equal to S and T l And < T, the flow type is porcupine flow.
If S l More than or equal to S and T l And (3) not less than T, and the stream type is elephant stream.
In the present embodiment, S l Representing the number of stream bytes in the online feature data, S representing the stream byte number threshold, T l Representing the flow duration in the online feature data, T representing the flow duration threshold.
S24: judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, otherwise, executing S25.
S25: the marked online characteristic data is updated to an initial decision model through an online mode of Huo Fuding any time tree, and a decision model for online classification of network traffic is obtained.
The second embodiment of the application discloses that fine granularity classification is performed on long and short streams based on a Huo Fuding arbitrary time tree algorithm, feature selection is performed through a triple feature selection algorithm, specifically, d-dimensional feature sequences acquired based on network traffic data are subjected to dimension reduction processing by deleting irrelevant features and redundant features in the network traffic data, occurrence frequencies and frequencies of various features in different feature sequences meeting a prediction precision threshold are counted, and feature sequences which are arranged in a descending order and meet preset conditions are returned; furthermore, the training time of network flow data is greatly shortened, high-precision prediction is carried out by using as few features as possible, dimensional disasters are effectively prevented, the generalization capability of an established decision model is enhanced, and the occurrence of over-fitting condition is prevented.
An embodiment III provides an online flow classification method based on triple feature selection and incremental learning, as shown in fig. 8, and the specific method steps are as follows:
a01, acquiring a trained decision model for classifying the network traffic on line based on the network traffic acquired in advance and the network traffic on line.
A02, capturing the flow in real time, and classifying the captured flow according to a preset quintuple mode.
In this embodiment, the five-tuple includes: source IP address, destination IP address, source port number, destination port number, transport layer protocol.
And A03, carrying out characteristic preprocessing on the classified data packet to obtain preprocessed characteristic data.
Based on the foregoing a03, it should be noted that, for a stream satisfying a preset condition, feature extraction is performed on a feature sequence selected according to a triple feature selection algorithm, so as to obtain a feature data entry; and taking the characteristic data items as input variables of a decision model, and dividing the flow types of the real-time captured flow.
In this embodiment, the preset conditions are: for the real-time captured traffic, the number of packets in the stream reaches a preset value N.
And A04, classifying the preprocessed characteristic data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified five-tuple and the class information of each group.
In this embodiment, the decision model for classifying network traffic online is: the decision model obtained based on the online model obtaining method in any one of the foregoing embodiments 1 or 2.
In the third embodiment, the online model acquisition method based on triple feature selection and incremental learning described in steps a01 to a04 is tested based on the ISCXVPN2016 dataset processed by the FNP-flowmeter; when the feature selection is not performed by the triple feature selection algorithm, the flow classification accuracy corresponding to the real-time flow is 93.49%, however, after the feature selection is performed by the triple feature selection algorithm, the flow classification accuracy is improved to 96.04%, and particularly, the prediction accuracy of the cheetah flow and the elephant flow in the real-time flow is obviously improved.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Furthermore, it should be noted that in the description of the present specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with the embodiment or example being included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art upon learning the basic inventive concepts. Therefore, the appended claims should be construed to include preferred embodiments and all such variations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, the present invention should also include such modifications and variations provided that they come within the scope of the following claims and their equivalents.

Claims (7)

1. An online model acquisition method based on triple feature selection and incremental learning is characterized by comprising the following steps:
s10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using an Huo Fuding arbitrary time tree increment learning method offline training mode;
s10 comprises the following steps:
s11, based on pre-collected network flow sample data, adopting FNP-flow meter offline mode processing to construct an offline characteristic data set, and marking the type of the flow according to the byte number of the flow and the type of the flow in a long-time division manner;
s12, based on the offline feature data of the marked flow category, constructing a marked feature data set according to a triple feature selection algorithm, and constructing an initial decision model by adopting an offline training mode of a Huo Fuding arbitrary time tree increment learning method;
constructing a marked feature data set according to a triple feature selection algorithm in S12, including:
s12-1, preprocessing the offline characteristic data of the marked stream class;
the pretreatment comprises the following steps: calculating the relevance between each feature data and the category by adopting the symmetry uncertainty to obtain a feature data subset of the relevance;
processing for removing redundant features is carried out on the feature data subsets with relevance to the category so as to obtain feature data subsets without redundant features;
s12-2, reducing an initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector to obtain a feature data subset after dimension reduction;
s12-3, screening all feature sequences in the feature data subset after dimension reduction in a threshold m by adopting a feature occurrence frequency selector, counting the occurrence frequency of each feature in all selected feature sequences, returning feature sequences SF which are arranged in a descending order and have the occurrence frequency larger than 1 and the occurrence frequency FQ of each feature, marking the features obtained by screening, and constructing a marked feature data set based on the marked features;
s20, acquiring a characteristic data set for online training based on real-time traffic, processing characteristic data in the characteristic data set, and updating an initial decision model through an online mode of a Huo Fuding arbitrary time tree based on the processed characteristic data set to obtain a decision model for online classification of network traffic;
the characteristic data set of the online training is an online characteristic data set extracted by analyzing real-time flow in a preset timing time period by adopting an FNP-flowmeter online mode;
the S20 includes:
s21: initializing a timer;
s22: analyzing the real-time flow through an FNP-flowmeter on-line mode, and extracting on-line characteristic data;
s23: based on the online characteristic data, obtaining marked online characteristic data according to the byte number of the stream and the type of the stream of the continuous long-term division of the stream;
s24: judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, otherwise, executing S25;
s25: the marked online characteristic data is updated to an initial decision model through an online mode of Huo Fuding any time tree, and a decision model for online classification of network traffic is obtained.
2. The online model acquisition method according to claim 1, wherein after S11, before S12, further comprising: and carrying out balancing treatment on the offline characteristic data set.
3. The online model acquisition method according to claim 1, wherein the long-cut stream type according to the number of bytes of the stream and the duration of the stream comprises:
if S l < S and T l < T, the flow type is a Leopard flow;
if S l < S and T l The flow type is tortoise flow;
if S l More than or equal to S and T l The flow type is porcupine flow;
if S l More than or equal to S and T l The stream type is elephant stream;
wherein S is l Representing the number of stream bytes in the online feature data, S representing the stream byte number threshold, T l Representing the flow duration in the online feature data, T representing the flow duration threshold.
4. An online flow classification method based on triple feature selection and incremental learning is characterized by comprising the following steps:
a01, acquiring a trained decision model for classifying network traffic on line based on the pre-acquired network traffic and the network traffic on line;
a02, capturing the flow in real time, and classifying the captured flow according to a preset quintuple mode;
the five-tuple comprises: source IP address, destination IP address, source port number, destination port number, transport layer protocol;
a03, carrying out characteristic pretreatment on the classified data packet to obtain pretreated characteristic data;
a04, classifying the preprocessed characteristic data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified five-tuple and the class information of each group;
the decision model for classifying network traffic online is a decision model obtained by the online model obtaining method according to any one of claims 1 to 3.
5. The online flow classification method of claim 4, wherein the a03 comprises:
a03-1, carrying out feature extraction on the stream meeting the preset condition according to the feature sequence selected by the triple feature selection algorithm to obtain a feature data item;
the preset conditions are as follows: aiming at the flow captured in real time, the number of packets in the flow reaches a preset value N;
and A03-2, taking the characteristic data entry as an input variable of the decision model, and dividing the flow type of the real-time captured flow.
6. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-5.
7. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1-5.
CN202210714868.2A 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning Active CN115051955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210714868.2A CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210714868.2A CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Publications (2)

Publication Number Publication Date
CN115051955A CN115051955A (en) 2022-09-13
CN115051955B true CN115051955B (en) 2023-12-19

Family

ID=83163531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210714868.2A Active CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Country Status (1)

Country Link
CN (1) CN115051955B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN107609147A (en) * 2017-09-20 2018-01-19 珠海金山网络游戏科技有限公司 A kind of method and system that feature is automatically extracted from log stream
CN109871872A (en) * 2019-01-17 2019-06-11 西安交通大学 A kind of flow real-time grading method based on shell vector mode SVM incremental learning model
CN111144459A (en) * 2019-12-16 2020-05-12 重庆邮电大学 Class-unbalanced network traffic classification method and device and computer equipment
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN113505826A (en) * 2021-07-08 2021-10-15 西安电子科技大学 Network flow abnormity detection method based on joint feature selection
CN113591950A (en) * 2021-07-19 2021-11-02 中国海洋大学 Random forest network traffic classification method, system and storage medium
CN114116669A (en) * 2021-11-25 2022-03-01 燕山大学 Hough tree-based multi-label stream data classification method
CN114510732A (en) * 2022-01-28 2022-05-17 上海大学 Encrypted traffic classification method based on incremental learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210209514A1 (en) * 2020-01-06 2021-07-08 Electronics And Telecommunications Research Institute Machine learning method for incremental learning and computing device for performing the machine learning method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN107609147A (en) * 2017-09-20 2018-01-19 珠海金山网络游戏科技有限公司 A kind of method and system that feature is automatically extracted from log stream
CN109871872A (en) * 2019-01-17 2019-06-11 西安交通大学 A kind of flow real-time grading method based on shell vector mode SVM incremental learning model
CN111144459A (en) * 2019-12-16 2020-05-12 重庆邮电大学 Class-unbalanced network traffic classification method and device and computer equipment
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN113505826A (en) * 2021-07-08 2021-10-15 西安电子科技大学 Network flow abnormity detection method based on joint feature selection
CN113591950A (en) * 2021-07-19 2021-11-02 中国海洋大学 Random forest network traffic classification method, system and storage medium
CN114116669A (en) * 2021-11-25 2022-03-01 燕山大学 Hough tree-based multi-label stream data classification method
CN114510732A (en) * 2022-01-28 2022-05-17 上海大学 Encrypted traffic classification method based on incremental learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Computer-Aided Diagnosis Based on Extreme Learning Machine: A Review;Zhiqiong Wang等;《 IEEE Access》;全文 *
基于核极限学习机的多标签数据流分类方法研究;张海翔;《中国优秀硕士学位论文全文库》;全文 *
面向互联网的SDN流量多粒度处理机制;卢向敏等;《中国科学:信息科学》;全文 *

Also Published As

Publication number Publication date
CN115051955A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN109726744B (en) Network traffic classification method
CN109284606B (en) Data flow anomaly detection system based on empirical features and convolutional neural networks
CN107577688B (en) Original article influence analysis system based on media information acquisition
CN113612749B (en) Intrusion behavior-oriented tracing data clustering method and device
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
CN103631787A (en) Webpage type recognition method and webpage type recognition device
CN112463859B (en) User data processing method and server based on big data and business analysis
Ju et al. Point-level temporal action localization: Bridging fully-supervised proposals to weakly-supervised losses
CN103995828B (en) A kind of cloud storage daily record data analysis method
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN111831706A (en) Mining method and device for association rules among applications and storage medium
CN115795329A (en) Power utilization abnormal behavior analysis method and device based on big data grid
CN116150191A (en) Data operation acceleration method and system for cloud data architecture
CN115051955B (en) Online flow classification method based on triple feature selection and incremental learning
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
CN107133321B (en) Method and device for analyzing search characteristics of page
CN109740750B (en) Data collection method and device
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN113746707B (en) Encrypted traffic classification method based on classifier and network structure
CN109739840A (en) Data processing empty value method, apparatus and terminal device
CN115184674A (en) Insulation test method and device, electronic terminal and storage medium
CN110336817B (en) Unknown protocol frame positioning method based on TextRank
CN114328479A (en) Anomaly detection method oriented to financial stream data
CN113282686A (en) Method and device for determining association rule of unbalanced sample
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant