CN115051955A - Online flow classification method based on triple feature selection and incremental learning - Google Patents

Online flow classification method based on triple feature selection and incremental learning Download PDF

Info

Publication number
CN115051955A
CN115051955A CN202210714868.2A CN202210714868A CN115051955A CN 115051955 A CN115051955 A CN 115051955A CN 202210714868 A CN202210714868 A CN 202210714868A CN 115051955 A CN115051955 A CN 115051955A
Authority
CN
China
Prior art keywords
feature
online
flow
stream
feature data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210714868.2A
Other languages
Chinese (zh)
Other versions
CN115051955B (en
Inventor
王兴伟
赵伟莨
王卓楠
贾杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202210714868.2A priority Critical patent/CN115051955B/en
Publication of CN115051955A publication Critical patent/CN115051955A/en
Application granted granted Critical
Publication of CN115051955B publication Critical patent/CN115051955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Abstract

The application relates to an online flow classification method based on triple feature selection and incremental learning, which comprises the following steps: based on pre-collected network flow sample data, a triple feature selection scheme is adopted for feature selection, and an initial decision model is constructed by using a Hough arbitrary time tree increment learning method offline training mode; the method comprises the steps of obtaining a characteristic data set for online training based on real-time flow, processing characteristic data in the characteristic data set, updating an initial decision model through an online mode of a Hough arbitrary time tree based on the processed characteristic data set, and obtaining a decision model for online classification of network flow. By the method, the long and short flows can be classified into fine-grained flows in real time, the types of the flows are identified to include the panther flow, the tortoise flow, the porcupine flow and the elephant flow, and a universal online flow classification framework is provided.

Description

Online flow classification method based on triple feature selection and incremental learning
Technical Field
The application belongs to the technical field of data processing, and particularly relates to an online flow classification method based on triple feature selection and incremental learning.
Background
Network flows have significant heavy tail distribution characteristics, and a small number of long flows occupy a large portion of the network traffic. Therefore, the long flow information can be grasped to realize the overall understanding of all network flows passing through the link, the management, the monitoring and the analysis of the network flows are convenient, the important effects on engineering application such as network flow charging, safety detection, flow regulation and control and the like are played, the data volume of processing and storage can be effectively reduced by identifying the long flow, and the processing efficiency and the resource utilization rate of the system are improved.
In the first aspect, in the conventional long and short stream classification scheme, the sampling-based method is simple to implement, but has a large error. The long and short stream classification scheme based on the LRU leaves the long stream in the LRU cache based on the characteristics of long duration, long length and frequent access to the cache, but in the actual measurement process, when a large number of burst short streams arrive, the LRU cache space is filled due to the large number of short streams, so that the long stream object is replaced by the LRU cache space. The long and short flow classification scheme based on the hash table may generate hash collision, and in addition, the maintenance of the hash table also needs a large overhead.
In the second aspect, the flow classification scheme based on the artificial intelligence technology has certain advantages for fast and accurate flow identification, but for the currently proposed efficient sampling and flow classification scheme, the elephant flow and mouse flow classification algorithm based on naive Bayes and C4.5, and the long and short flow classification based on the gated cycle unit, the decision model is obtained through offline training, all the training examples are required to be stored in the memory at the same time, and therefore the number of the training samples is severely limited. With the increase of the data size, the existing model cannot be updated in real time, and only a new decision model can be obtained through retraining. If the network environment changes greatly and the generalization capability of the model is insufficient, the original model may not be able to effectively identify and classify the traffic.
In view of this, the present application provides an online flow classification method based on triple feature selection and incremental learning, which can perform real-time online classification on flow based on a constructed decision model to meet the requirement of service quality.
Disclosure of Invention
Technical problem to be solved
In view of the above-mentioned shortcomings and drawbacks of the prior art, the present application provides an online flow classification method based on triple feature selection and incremental learning.
(II) technical scheme
In order to achieve the purpose, the technical scheme is as follows:
in a first aspect, the present application provides an online model obtaining method based on triple feature selection and incremental learning, including:
s10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using a Hough arbitrary time tree incremental learning method offline training mode;
s20, acquiring a feature data set for online training based on real-time flow, processing feature data in the feature data set, updating an initial decision model through an online mode of a Hough arbitrary time tree based on the processed feature data set, and acquiring a decision model for online classification of network flow;
the network flow sample data is an offline characteristic data set which is processed by an offline mode of a software tool FNP-flowmeter and is constructed;
the feature data set of the online training is an online feature data set extracted by analyzing real-time flow in a preset timing time period by adopting an FNP-flowmeter online mode.
Optionally, the S10 includes:
s11, based on pre-collected network flow sample data, adopting FNP-flowmeter offline mode processing to construct an offline characteristic data set, dividing the flow type according to the number of bytes of the flow and the continuous time of the flow, and marking the flow type;
s12, constructing a marked feature data set according to a triple feature selection algorithm based on the marked off-line feature data of the stream types, and constructing an initial decision model by adopting a Hough arbitrary time tree incremental learning method off-line training mode.
Optionally, the constructing a labeled feature data set according to a triple feature selection algorithm in S12 includes:
s12-1, preprocessing the off-line characteristic data of the marked stream types;
the pretreatment comprises the following steps: calculating the correlation between each feature data and the category by adopting the symmetry uncertainty to obtain a correlated feature data subset;
carrying out redundant feature removal processing on the feature data subsets related to the categories to obtain feature data subsets without redundant features;
s12-2, reducing an initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector to obtain a feature data subset after dimension reduction;
s12-3, screening all feature sequences in the feature data subset after dimension reduction within a threshold value m by using a feature occurrence frequency selector, counting the occurrence frequency of each feature in all the selected feature sequences, returning the feature sequences SF with the occurrence frequency larger than 1 and the frequency FQ of each feature, which are arranged in a descending order, marking the features obtained by screening, and constructing a marked feature data set based on the marked features.
Optionally, after S11 and before S12, the method further includes: and carrying out balancing processing on the offline feature data set.
Optionally, the S20 includes:
s21: initializing a timer;
s22: analyzing the real-time flow through an FNP-flowmeter online mode, and extracting online characteristic data;
s23: based on the online characteristic data, obtaining marked online characteristic data according to the byte number of the stream and the type of the continuous long-time division stream of the stream;
s24: judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, otherwise, executing S25;
s25: and updating the initial decision model by the marked online characteristic data through an online mode of the Hough arbitrary time tree to obtain a decision model for online classification of network traffic.
Optionally, dividing the stream type according to the number of bytes of the stream and the duration of the stream includes:
if S l < S and T l If the flow type is less than T, the flow type is a panther flow;
if S l < S and T l If the current type is more than or equal to T, the current type is the turtle current;
if S l Not less than S and T l If the number is less than T, the stream type is porcupine stream;
if S l Not less than S and T l If the flow type is more than or equal to T, the flow type is elephant flow;
wherein S is l Representing the number of bytes of a stream in the on-line characteristic data, S representing a threshold number of bytes of a stream, T l Represents the stream duration in the online profile and T represents the stream duration threshold.
In a second aspect, the present application provides an online flow classification method based on triple feature selection and incremental learning, including:
a01, acquiring a trained decision model for online classification of network traffic based on pre-acquired network traffic and online network traffic;
a02, capturing flow in real time, and classifying the captured flow according to a preset quintuple mode;
the quintuple comprises: a source IP address, a destination IP address, a source port number, a destination port number, a transport layer protocol;
a03, performing characteristic preprocessing on the classified data packet to obtain preprocessed characteristic data;
a04, classifying the preprocessed feature data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified quintuple and the class information of each group;
the decision model for online classification of network traffic is obtained based on the online model obtaining method provided in the first aspect.
Optionally, the a03 includes:
a03-1, extracting the features of the streams meeting the preset conditions according to a feature sequence selected by a triple feature selection algorithm to obtain feature data items;
the preset conditions are as follows: aiming at the flow captured in real time, the number of packets in the flow reaches a preset numerical value N;
a03-2, using the characteristic data items as input variables of the decision model, and dividing the stream types of the real-time captured traffic.
(III) advantageous effects
The method comprises the steps of classifying long and short flows according to fine granularity on the basis of an online decision tree (the traditional mode is divided into elephant flow and mouse flow, and the fine granularity is divided into leopard flow, tortoise flow, porcupine flow and elephant flow), deleting irrelevant features and redundant features in network flow data through a triple feature selection algorithm, performing dimension reduction processing on a d-dimensional feature sequence obtained on the basis of the network flow data, counting the occurrence frequency and frequency of various features in different feature sequences conforming to a prediction precision threshold, and returning to the feature sequences which are arranged in a descending order and conform to preset conditions; furthermore, the training time of the network traffic data can be greatly shortened, high-precision prediction can be carried out by using features as few as possible, dimension disasters can be effectively prevented, the generalization capability of the established decision model can be enhanced, and the over-fitting situation can be prevented.
Drawings
The application is described with the aid of the following figures:
FIG. 1 is a schematic flow diagram of an online model acquisition method based on triple feature selection and incremental learning;
FIG. 2 is a schematic diagram of a process for constructing an initial decision model;
FIG. 3 is a schematic flow chart of processing a pcap file based on an FNP-flowmeter;
FIG. 4 is a schematic diagram of an FNP-flowmeter offline mode;
FIG. 5 is a schematic diagram of a triple feature selection algorithm;
FIG. 6 is a schematic diagram of a process for constructing an online classification decision model;
FIG. 7 is a schematic diagram of an FNP-flowmeter online mode;
FIG. 8 is a flow chart diagram of an online flow classification method based on triple feature selection and incremental learning.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings. It is to be understood that the following specific examples are illustrative of the invention only and are not to be construed as limiting the invention. In addition, it should be noted that, in the case of no conflict, the embodiments and features in the embodiments in the present application may be combined with each other; for convenience of description, only portions related to the invention are shown in the drawings.
An embodiment provides an online model obtaining method based on triple feature selection and incremental learning, as shown in fig. 1, and the specific method is shown in steps S10 and S20:
and S10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using an offline training mode of a Hough arbitrary time tree incremental learning method.
S20, acquiring a feature data set for online training based on real-time flow, processing feature data in the feature data set, updating an initial decision model through an online mode of a Hough arbitrary time tree based on the processed feature data set, and obtaining a decision model for online classification of network flow.
In this embodiment, the network traffic sample data is: and processing by adopting an offline mode of the software tool FNP-flowmeter provided by the application, and constructing an offline characteristic data set.
In this embodiment, the feature data set of the online training is an online feature data set extracted by analyzing the real-time flow in an online FNP-flowmeter mode within a preset timing period.
The first embodiment of the application discloses that the long and short flows are classified in a fine-grained manner based on a Hough arbitrary time tree algorithm, and the features are selected through a triple feature selection algorithm, so that the training time of network flow data is greatly shortened, high-precision prediction is performed by using features as few as possible, dimension disasters are effectively prevented, the generalization capability of an established decision model is enhanced, and the occurrence of over-fitting is prevented.
The second embodiment provides an online model obtaining method based on triple feature selection and incremental learning, which comprises the following specific steps:
and S10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using an offline training mode of a Hough arbitrary time tree incremental learning method.
Based on the foregoing step S10, as shown in fig. 2, based on steps S11 to S12, so as to implement the construction of the initial decision model, the specific steps are described as follows:
s11, based on pre-collected network flow sample data, adopting FNP-flowmeter offline mode processing to construct an offline feature data set, dividing the flow type according to the number of bytes of the flow and the duration of the flow, and marking the flow type.
Based on the foregoing step S11, for the flow characteristic extraction tool FNP-flowmeter, it should be noted that:
the FNP-flowmeter, as a flow characteristic extraction software tool provided by the present application, may be divided into an offline mode and an online mode, where the offline mode of the FNP-flowmeter is used to obtain flow characteristic data information from a pacp file captured by a flow capture tool such as wireshark, and a characteristic data set is constructed, and fig. 3 is a schematic flow diagram of processing a pcap file based on the FNP-flowmeter.
In this embodiment, as shown in fig. 3, empty TCPflowdict and UDPflowdict are constructed so as to store flow information; reading data packets one by one from the pcap file, and reserving the data packets of which the transmission layer protocol is TCP or UDP through filtering flow; and judging whether the read data packet is empty or not, if so, directly ending, and if not, further judging whether the transmission layer is a TCP layer or not.
In this embodiment, as shown in fig. 3, if the transport layer is a TCP layer, it is further determined whether the sorted five-tuple is in tcpflowrect, and if the transport layer is not a TCP layer, it is further determined whether the sorted five-tuple is in udpflowrect.
In this embodiment, as shown in fig. 3, if the sorted quintuple is in tcpflowrect, the corresponding quintuple is searched for and packet information is added, and it is determined whether the TCP stream conforms to the 4-time waving process or the RST flag bit is 1, if the sorted quintuple is not in tcpflowrect, the sorted quintuple is added to tcpflowrect, and information of the packet is added, then the data packets are read from the pcap file one by one again, and the data packets of TCP or UDP are retained by filtering the flow.
In this embodiment, as shown in fig. 3, if the TCP stream conforms to the 4-time hand waving process or the RST flag bit is 1, packet-level and stream-level feature statistics are performed to return a feature sequence, and data packets are read from the pcap file one by one again, and the data packets of the TCP or UDP are retained by filtering the traffic; if the TCP stream does not conform to the 4-time hand waving process or the RST is 1, directly reading the data packets from the pcap file one by one again, and keeping the data packets of the TCP or the UDP through filtering flow.
In this embodiment, as shown in fig. 3, if the sorted quintuple is in udpflowrect, the corresponding quintuple is searched for and packet information is added, and whether UDP stream is overtime is determined, if the sorted quintuple is not in udpflowrect, the sorted quintuple is added to udpflowrect, information of the packet is added, and data packets are read from the pcap file one by one again, and the data packets of TCP or UDP are retained by filtering traffic; if the sorted five-tuple is in UDPflow, searching the corresponding five-tuple to add the packet information, and judging whether the UDP flow is overtime.
In this embodiment, as shown in fig. 3, if the UDP stream is overtime, packet-level and stream-level feature statistics is performed to return a feature sequence, and the data packets are read from the pcap file one by one again, and the data packets of the TCP or UDP are retained through filtering the traffic, and if the UDP stream is not overtime, the data packets are read from the pcap file one by one directly again, and the data packets of the TCP or UDP are retained through filtering the traffic.
In this embodiment, an FNP-flowmeter offline mode is adopted to implement the construction of the offline feature data set, where a schematic diagram of the FNP-flowmeter offline operating mode is shown in fig. 4.
In this embodiment, the off-line feature data set needs to be balanced, and then the feature data set is divided into a training set and a testing set.
In this embodiment, the balancing method may be SMOTE, but is not limited to SMOTE.
S12, constructing a marked feature data set according to a triple feature selection algorithm based on the marked off-line feature data of the stream types, and constructing an initial decision model by adopting a Hough arbitrary time tree incremental learning method off-line training mode.
Based on the foregoing step S12, for the triple feature selection algorithm, it should be noted that:
in this embodiment, the triple Feature selection algorithm includes a Fast Correlation-Based Filter (FCBF), a Sequential Feature Selector (SFS), and a Feature Occurrence Frequency Selector (fbs).
In this embodiment, irrelevant features and redundant features are removed by a fast correlation filtering algorithm FCBF.
In this embodiment, the initial d-dimensional feature space is reduced to k-dimensional feature subspaces by the sequential feature selector SFS, where d is less than k, and specifically, the sequential feature selector is constructed based on a search algorithm of a greedy mechanism such as sequence forward selection, sequence backward selection, sequence floating forward selection, and sequence floating backward selection.
In this embodiment, the frequency and frequency of occurrence of the features in the feature sequence and the feature sequence are counted and sorted in descending order by the feature occurrence frequency selector fbs.
The following details an application process of the foregoing triple feature selection algorithm, as shown in fig. 5 in detail:
in this embodiment, the DS is a feature data set obtained by extracting and marking features of a pre-collected network traffic sample, and specifically, as shown in a formula, the feature data set includes a relevant feature X under the condition that the feature data set is not screened relevant Independent feature X irrelevant And redundant feature X redundancy Total number of features is N initial (ii) a Wherein the characteristic data set
Figure BDA0003708471240000091
The expression is as follows:
Figure BDA0003708471240000092
first, for extracting features through a fast correlation filtering algorithm FCBF, it should be noted that:
selecting the characteristics of fast correlation filtering algorithm FCBF to obtain the irrelevant characteristics X relevant And redundant feature X irrelevant Removing to obtain a characteristic data subset F without redundancy relevant Characteristic number N FCBF And satisfies the condition N FCBF ≤N initial (ii) a The characteristic data subsets without redundancy are specifically expressed as follows:
F relevant ={X relevant }
second, for extracting features by the sequential feature selector SFS, it should be noted that:
selecting the features by executing the sequential feature selector SFS n times, storing the sequence of features obtained by executing the SFS each time and the corresponding test accuracy by ACCFEATUREList, specifically, acc n Test accuracy, Fk, obtained for each run of SFS n Is from F relevant A feature sequence consisting of k selected features; the expression of the characteristic sequence stored by the ACCfeaturelist and the corresponding test accuracy is as follows:
Figure BDA0003708471240000101
third, for extracting the features by the feature occurrence frequency selector FOFBS, it should be noted that:
the feature occurrence frequency selector FOFBS takes the obtained highest test accuracy as a reference, screens all feature sequences within the range of a threshold value m, counts the occurrence frequency of each feature in all the selected feature sequences, and returns the feature sequences SF which are sorted in descending order and have the occurrence frequency greater than 1 and the frequency FQ of each feature.
Based on the foregoing step S12, for constructing the labeled feature data set according to the triple feature selection algorithm, it should be noted that:
s12-1, preprocessing the off-line characteristic data of the marked stream type.
In this embodiment, the preprocessing includes: calculating the correlation between each feature data and the category by adopting the symmetry uncertainty to obtain a correlated feature data subset; and carrying out redundant feature removal processing on the feature data subsets related to the categories to obtain the feature data subsets without redundant features.
Based on the foregoing step S12-1, for the preprocessing process of the offline feature data, it should be noted that:
first, for the removal of irrelevant features, it should be noted that:
in this embodiment, a relevance value SU between each feature data and each category is calculated by using a Symmetric Uncertainty (SU), a threshold corresponding to the relevance value is set, the feature data greater than the threshold are sorted in a descending order according to the relevance value SU, and a set of feature data obtained by sorting is used as a feature data subset of relevance.
Second, for removing the redundant features, it should be noted that:
in this embodiment, the first feature F in the subset of associated feature data is selected 1 At the beginning, if F 1 Characteristic data F thereafter i Correlation with categoriesSex number SU i Less than the characteristic data F i And feature F 1 Correlation value SU between 1 Removing the feature data F from the associated feature data subset i (ii) a First feature F in a subset of feature data in relevance 1 After the judgment is finished as the reference, the second feature F in the associated feature data subset is needed 2 And continuing to judge for the reference, repeating the steps, and so on until no feature data in the associated feature subset is removed or all the feature data are judged.
S12-2, reducing the initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector to obtain a feature data subset after dimension reduction.
Based on the foregoing step S12-2, for obtaining the feature data subset after the dimension reduction, it should be noted that:
and reducing the dimension of the initial d-dimensional feature space into a k-dimensional feature subspace by executing the sequence feature selector for a preset number of times of n times, and adding a feature sequence obtained by executing the sequence feature selector each time and a corresponding accuracy rate into ACCFeaturelist.
S12-3, screening all feature sequences in the feature data subset after dimension reduction within a threshold value m by using a feature occurrence frequency selector, counting the occurrence frequency of each feature in all the selected feature sequences, returning the feature sequences SF with the occurrence frequency larger than 1 and the frequency FQ of each feature, which are arranged in a descending order, marking the features obtained by screening, and constructing a marked feature data set based on the marked features.
Based on the step S12-3, for the feature screening, it should be noted that:
based on the accuracy corresponding to the feature sequence obtained in the step S12-2, obtaining the highest accuracy, setting a threshold m based on the highest accuracy, screening all feature sequences within the threshold m, and counting the frequency of occurrence of each feature in all the screened feature sequences.
And S20, acquiring a feature data set for online training based on real-time flow, processing feature data in the feature data set, updating an initial decision model through an online mode of a Hough arbitrary time tree based on the processed feature data set, and acquiring a decision model for online classification of network flow.
Based on the foregoing step S20, as shown in fig. 6, the construction of the online classification decision model is realized based on steps S21 to S25, specifically describing the steps as follows:
s21: a timer is initialized.
S22: and analyzing the real-time flow through an FNP-flowmeter online mode, and extracting online characteristic data.
In this embodiment, the implementation process of the FNP-flowmeter online mode is shown in fig. 7.
S23: and obtaining marked online characteristic data according to the byte number of the stream and the type of the continuous long-time division stream of the stream based on the online characteristic data.
Based on the foregoing step S23, the type division process of the stream includes:
if S l < S and T l If T, the stream type is Hunter stream.
If S l < S and T l If the flow type is more than or equal to T, the turtle flow type is the turtle flow.
If S l Not less than S and T l If the number is less than T, the stream type is porcupine stream.
If S l Not less than S and T l And if the flow type is more than or equal to T, the elephant flow is adopted.
In this example, S l Indicating the number of streaming bytes in the online profile, S indicating the threshold number of streaming bytes, T l Represents the stream duration in the online profile and T represents the stream duration threshold.
S24: and judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, and if not, executing S25.
S25: and updating the initial decision model by the marked online characteristic data through an online mode of the Hough arbitrary time tree to obtain a decision model for online classification of network traffic.
The second embodiment of the application discloses that fine-grained classification is performed on long and short flows based on a Hough arbitrary time tree algorithm, feature selection is performed through a triple feature selection algorithm, specifically, irrelevant features and redundant features in network traffic data are deleted, d-dimensional feature sequences obtained based on the network traffic data are subjected to dimension reduction processing, occurrence frequencies and frequencies of various features in different feature sequences conforming to a prediction precision threshold are counted, and feature sequences which are arranged in a descending order and conform to preset conditions are returned; furthermore, the training time of the network flow data is greatly shortened, high-precision prediction is carried out by using features as few as possible, dimension disasters are effectively prevented, the generalization capability of the established decision model is enhanced, and the over-fitting condition is prevented.
The third embodiment provides an online flow classification method based on triple feature selection and incremental learning, as shown in fig. 8, the specific method steps are as follows:
a01, acquiring a trained decision model for online classification of the network traffic based on the pre-collected network traffic and the online network traffic.
And A02, capturing the flow in real time, and classifying the captured flow according to a preset quintuple mode.
In this embodiment, the quintuple includes: source IP address, destination IP address, source port number, destination port number, transport layer protocol.
And A03, performing characteristic preprocessing on the classified data packet to obtain preprocessed characteristic data.
Based on the above a03, it should be noted that, feature extraction is performed on the stream that meets the preset condition according to the feature sequence selected by the triple feature selection algorithm, so as to obtain a feature data entry; and the characteristic data items are used as input variables of a decision model, and the stream types of the real-time captured flow are divided.
In this embodiment, the preset conditions are as follows: for the flow captured in real time, the number of packets in the stream reaches a preset value N.
And A04, classifying the preprocessed feature data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified quintuple and the class information of each group.
In this embodiment, the decision model for online classifying the network traffic is: a decision model obtained based on the online model obtaining method in any of the foregoing embodiments 1 or 2.
In the third embodiment, based on the iscxnvpn 2016 dataset processed by the FNP-flowmeter, the online model acquisition method based on triple feature selection and incremental learning described in steps a 01-a 04 is tested; when the feature selection is not performed through the triple feature selection algorithm, the flow classification accuracy rate corresponding to the real-time flow is 93.49%, however, after the feature selection is performed through the triple feature selection algorithm, the flow classification accuracy rate is improved to 96.04%, and particularly, the prediction accuracy of the panther flow and the elephant flow in the real-time flow is obviously improved.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims (10)

1. An online model acquisition method based on triple feature selection and incremental learning is characterized by comprising the following steps:
s10, based on pre-collected network flow sample data, performing feature selection by adopting a triple feature selection scheme, and constructing an initial decision model by using a Hough arbitrary time tree incremental learning method offline training mode;
s20, acquiring a feature data set for online training based on real-time flow, processing feature data in the feature data set, updating an initial decision model through an online mode of a Hough arbitrary time tree based on the processed feature data set, and acquiring a decision model for online classification of network flow;
the network flow sample data is an offline characteristic data set which is processed by an offline mode of a software tool FNP-flowmeter and is constructed;
the feature data set of the online training is an online feature data set extracted by analyzing real-time flow in a preset timing time period by adopting an FNP-flowmeter online mode.
2. The online model acquisition method according to claim 1, wherein the S10 includes:
s11, based on pre-collected network flow sample data, adopting FNP-flowmeter offline mode processing to construct an offline characteristic data set, dividing the flow type according to the number of bytes of the flow and the continuous time of the flow, and marking the flow type;
and S12, constructing a labeled feature data set according to a triple feature selection algorithm based on the labeled stream class offline feature data, and constructing an initial decision model by adopting a Hough arbitrary time tree increment learning method offline training mode.
3. The online model acquisition method of claim 1, wherein the constructing the labeled feature data set according to the triple feature selection algorithm in S12 comprises:
s12-1, preprocessing the offline feature data of the marked stream types;
the pretreatment comprises the following steps: calculating the correlation between each feature data and the category by adopting the symmetry uncertainty to obtain a correlated feature data subset;
carrying out redundant feature removal processing on the feature data subsets related to the categories to obtain feature data subsets without redundant features;
s12-2, reducing an initial d-dimensional feature space in the feature data subset without redundancy into a k-dimensional feature subspace by adopting a sequential feature selector to obtain a feature data subset after dimension reduction;
s12-3, screening all feature sequences in the feature data subset after dimension reduction within a threshold value m by using a feature occurrence frequency selector, counting the occurrence frequency of each feature in all the selected feature sequences, returning the feature sequences SF with the occurrence frequency larger than 1 and the frequency FQ of each feature, which are arranged in a descending order, marking the features obtained by screening, and constructing a marked feature data set based on the marked features.
4. The online model obtaining method of claim 1, wherein after S11 and before S12, the method further comprises: and carrying out balancing processing on the offline feature data set.
5. The online model acquisition method according to claim 1, wherein the S20 includes:
s21: initializing a timer;
s22: analyzing the real-time flow through an FNP-flowmeter online mode, and extracting online characteristic data;
s23: based on the online characteristic data, obtaining marked online characteristic data according to the byte number of the stream and the type of the continuous long-time division stream of the stream;
s24: judging whether the timer exceeds the time limit, if not, repeating the steps S22-S23, otherwise, executing S25;
s25: and updating the initial decision model by the marked online characteristic data through an online mode of the Hough arbitrary time tree to obtain a decision model for online classification of network traffic.
6. The online model acquisition method according to any one of claims 2 or 5, wherein dividing the stream type according to the number of bytes of the stream and the duration of the stream comprises:
if S l < S and T l If the flow type is less than T, the flow type is a panther flow;
if S l < S and T l If the current type is more than or equal to T, the current type is the turtle current;
if S l Not less than S and T l If the number is less than T, the stream type is porcupine stream;
if S l Not less than S and T l If the flow type is more than or equal to T, the elephant flow is adopted;
wherein S is l Indicating the number of streaming bytes in the online profile, S indicating the threshold number of streaming bytes, T l Represents the stream duration in the online profile and T represents the stream duration threshold.
7. An online flow classification method based on triple feature selection and incremental learning is characterized by comprising the following steps:
a01, acquiring a trained decision model for online classification of network traffic based on pre-acquired network traffic and online network traffic;
a02, capturing flow in real time, and classifying the captured flow according to a preset quintuple mode;
the quintuple comprises: a source IP address, a destination IP address, a source port number, a destination port number, a transport layer protocol;
a03, performing characteristic preprocessing on the classified data packet to obtain preprocessed characteristic data;
a04, classifying the preprocessed feature data by adopting the decision model, acquiring the flow type of the real-time captured flow, and outputting the classified quintuple and the class information of each group;
the decision model for online classification of network traffic is obtained based on the online model obtaining method of any one of claims 1 to 6.
8. The online flow classification method according to claim 7, characterized in that the A03 includes:
a03-1, extracting the features of the streams meeting the preset conditions according to a feature sequence selected by a triple feature selection algorithm to obtain feature data items;
the preset conditions are as follows: aiming at the flow captured in real time, the number of packets in the flow reaches a preset numerical value N;
a03-2, using the characteristic data items as input variables of the decision model, and dividing the stream types of the real-time captured traffic.
9. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 8.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method according to any one of claims 1 to 8.
CN202210714868.2A 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning Active CN115051955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210714868.2A CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210714868.2A CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Publications (2)

Publication Number Publication Date
CN115051955A true CN115051955A (en) 2022-09-13
CN115051955B CN115051955B (en) 2023-12-19

Family

ID=83163531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210714868.2A Active CN115051955B (en) 2022-06-22 2022-06-22 Online flow classification method based on triple feature selection and incremental learning

Country Status (1)

Country Link
CN (1) CN115051955B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN107609147A (en) * 2017-09-20 2018-01-19 珠海金山网络游戏科技有限公司 A kind of method and system that feature is automatically extracted from log stream
CN109871872A (en) * 2019-01-17 2019-06-11 西安交通大学 A kind of flow real-time grading method based on shell vector mode SVM incremental learning model
CN111144459A (en) * 2019-12-16 2020-05-12 重庆邮电大学 Class-unbalanced network traffic classification method and device and computer equipment
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
US20210209514A1 (en) * 2020-01-06 2021-07-08 Electronics And Telecommunications Research Institute Machine learning method for incremental learning and computing device for performing the machine learning method
CN113505826A (en) * 2021-07-08 2021-10-15 西安电子科技大学 Network flow abnormity detection method based on joint feature selection
CN113591950A (en) * 2021-07-19 2021-11-02 中国海洋大学 Random forest network traffic classification method, system and storage medium
CN114116669A (en) * 2021-11-25 2022-03-01 燕山大学 Hough tree-based multi-label stream data classification method
CN114510732A (en) * 2022-01-28 2022-05-17 上海大学 Encrypted traffic classification method based on incremental learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN107609147A (en) * 2017-09-20 2018-01-19 珠海金山网络游戏科技有限公司 A kind of method and system that feature is automatically extracted from log stream
CN109871872A (en) * 2019-01-17 2019-06-11 西安交通大学 A kind of flow real-time grading method based on shell vector mode SVM incremental learning model
CN111144459A (en) * 2019-12-16 2020-05-12 重庆邮电大学 Class-unbalanced network traffic classification method and device and computer equipment
US20210209514A1 (en) * 2020-01-06 2021-07-08 Electronics And Telecommunications Research Institute Machine learning method for incremental learning and computing device for performing the machine learning method
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN113505826A (en) * 2021-07-08 2021-10-15 西安电子科技大学 Network flow abnormity detection method based on joint feature selection
CN113591950A (en) * 2021-07-19 2021-11-02 中国海洋大学 Random forest network traffic classification method, system and storage medium
CN114116669A (en) * 2021-11-25 2022-03-01 燕山大学 Hough tree-based multi-label stream data classification method
CN114510732A (en) * 2022-01-28 2022-05-17 上海大学 Encrypted traffic classification method based on incremental learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHIQIONG WANG等: "Computer-Aided Diagnosis Based on Extreme Learning Machine: A Review", 《 IEEE ACCESS》 *
卢向敏等: "面向互联网的SDN流量多粒度处理机制", 《中国科学:信息科学》 *
张海翔: "基于核极限学习机的多标签数据流分类方法研究", 《中国优秀硕士学位论文全文库》 *

Also Published As

Publication number Publication date
CN115051955B (en) 2023-12-19

Similar Documents

Publication Publication Date Title
CN109284606B (en) Data flow anomaly detection system based on empirical features and convolutional neural networks
CN111382623B (en) Live broadcast auditing method, device, server and storage medium
WO2018014610A1 (en) C4.5 decision tree algorithm-based specific user mining system and method therefor
CN111475680A (en) Method, device, equipment and storage medium for detecting abnormal high-density subgraph
Zhang et al. Proword: An unsupervised approach to protocol feature word extraction
CN104102700A (en) Categorizing method oriented to Internet unbalanced application flow
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
CN110532564A (en) A kind of application layer protocol online recognition method based on CNN and LSTM mixed model
CN108319672A (en) Mobile terminal malicious information filtering method and system based on cloud computing
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN109995611B (en) Traffic classification model establishing and traffic classification method, device, equipment and server
CN112667750A (en) Method and device for determining and identifying message category
CN112116168B (en) User behavior prediction method and device and electronic equipment
Ju et al. Point-level temporal action localization: Bridging fully-supervised proposals to weakly-supervised losses
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN115865483A (en) Abnormal behavior analysis method and device based on machine learning
CN112052154A (en) Test case processing method and device
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN111343127A (en) Method, device, medium and equipment for improving crawler recognition recall rate
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
CN115051955B (en) Online flow classification method based on triple feature selection and incremental learning
CN116684877A (en) GYAC-LSTM-based 5G network traffic anomaly detection method and system
CN111352820A (en) Method, equipment and device for predicting and monitoring running state of high-performance application
CN113746707B (en) Encrypted traffic classification method based on classifier and network structure
CN110336817B (en) Unknown protocol frame positioning method based on TextRank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant