CN109309630B - Network traffic classification method and system and electronic equipment - Google Patents

Network traffic classification method and system and electronic equipment Download PDF

Info

Publication number
CN109309630B
CN109309630B CN201811113686.XA CN201811113686A CN109309630B CN 109309630 B CN109309630 B CN 109309630B CN 201811113686 A CN201811113686 A CN 201811113686A CN 109309630 B CN109309630 B CN 109309630B
Authority
CN
China
Prior art keywords
network
data
address
network traffic
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811113686.XA
Other languages
Chinese (zh)
Other versions
CN109309630A (en
Inventor
叶可江
赵世林
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201811113686.XA priority Critical patent/CN109309630B/en
Priority to PCT/CN2018/112401 priority patent/WO2020062390A1/en
Publication of CN109309630A publication Critical patent/CN109309630A/en
Application granted granted Critical
Publication of CN109309630B publication Critical patent/CN109309630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]

Abstract

The application relates to a network traffic classification method, a network traffic classification system and electronic equipment. The method comprises the following steps: step a: collecting network flow data and labeling the network flow data; step b: extracting a bidirectional flow characteristic set according to the labeled network flow data; step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model. The network traffic is classified by utilizing the bidirectional flow characteristics in the network traffic data, a large number of new applications in the internet can be accurately identified and classified, the classification accuracy is improved, and the high precision and high performance of network traffic classification can be effectively guaranteed.

Description

Network traffic classification method and system and electronic equipment
Technical Field
The present application relates to the field of network traffic classification technologies, and in particular, to a method, a system, and an electronic device for classifying network traffic.
Background
With the high-speed popularity of the internet, modern network environments have become increasingly complex and diverse due to the emergence of a large number of new applications. Traffic classification and network application identification play an important role in network management services and security systems, such as quality of service, intrusion detection systems, and traffic management systems. If the flow in the network system can be accurately classified and applied and identified, the network safety and the network management service efficiency are greatly improved, and the system time and the memory overhead can be reduced.
At present, the existing network traffic classification method mainly includes:
firstly, classifying network traffic based on characterization learning: the method comprises the steps of preprocessing the acquired network traffic data, extracting the characteristics of the preprocessed network traffic data by using a characterization learning algorithm, generating network flow vectors from the network traffic data, and classifying the network traffic data according to the network flow vectors, so that the network traffic can be classified efficiently.
Secondly, network traffic classification based on semi-supervised learning: acquiring network flows of marked types and unmarked types, and extracting flow characteristics in each network flow according to a preset fixed quantity to obtain a network flow characteristic vector; according to the marked type of the network flow, calculating the information gain of each flow characteristic in a preset fixed quantity, and performing characteristic weighting on each flow characteristic according to the information gain; mixing the network flows of the marked type and the unmarked type, and clustering the mixed network flows by using a k-means algorithm to obtain k clusters; acquiring the number of marked network flow feature vectors in each cluster of the k clusters, and determining the proportion value of each type in each cluster; wherein the fraction value is equal to a ratio of a number of tagged network flow feature vectors of each type to a number of total tagged network flow feature vectors in the cluster; when the sum of the total number of the marked network flow characteristic vectors in each cluster is smaller than a preset network flow threshold value, judging the corresponding cluster as an unknown protocol cluster, otherwise, judging the corresponding cluster as a type with the largest proportion in the marked network flow characteristic vectors; repeating the two steps until the k clusters determine the flow cluster of the flow type; and taking the flow cluster with the judged flow type as training data to train a flow classifier on the line. The method utilizes the advantages of semi-supervised learning, and has better accuracy and stability compared with the traditional supervised learning algorithm which only uses labeled data to train the model.
Thirdly, self-adaptive semi-supervised network traffic classification: acquiring network flows of marked types and unmarked types, and extracting preset fixed quantity of flow characteristics in each network flow to obtain a network flow characteristic vector; calculating the centroid of the network flow feature vector set in each type according to the marked network flow feature vectors to obtain a vector set M; taking the vector set M as an initial central point of k-means clustering, carrying out self-adaptive semi-supervised k-means clustering on a mixed marked type and unmarked type network flow characteristic vector set X, and outputting clustering of k-means; mapping the obtained network flow in each type of cluster to the flow type according to the maximum posterior probability of the marked network flow characteristic vector of each cluster in the output cluster to obtain the flow cluster of the known type; and taking the known type of flow cluster as training data to train a flow classifier on the outlet.
In summary, the existing network traffic classification methods mainly focus on network traffic classification at the algorithm level, and all kinds of optimization and improvement algorithms are proposed for the classification algorithm part in the training phase, but the problem of how to extract a large number of relevant effective feature sets from network data packets is not solved, and a large number of new applications in the internet cannot be accurately identified and classified.
Disclosure of Invention
The application provides a network traffic classification method, a network traffic classification system and electronic equipment, and aims to solve at least one of the technical problems in the prior art to a certain extent.
In order to solve the above problems, the present application provides the following technical solutions:
a network traffic classification method comprises the following steps:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step a, the acquiring network traffic data and the labeling the network traffic data specifically include:
step a 1: selecting an application category in the network traffic;
step a 2: collecting a network flow data packet corresponding to each application and a system network log of a corresponding time period;
step a 3: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications;
step a 4: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step b, the extracting a bidirectional flow feature set according to the labeled network traffic data specifically includes:
step b 1: analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
step b 2: finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
step b 3: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
step b 4: and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the step b further comprises the following steps: and optimizing the bidirectional flow feature set by using a maximum variance interpretation mechanism.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the optimizing the bidirectional flow feature set by using the maximum variance interpretation mechanism specifically comprises:
step b 5: performing standard normalization on the network traffic data;
step b 6: on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
step b 7: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step b 8: calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
step b 9: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step b 10: projecting the network traffic data onto the N eigenvectors;
step b 11: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
Another technical scheme adopted by the embodiment of the application is as follows: a network traffic classification system comprising:
a data acquisition module: the system is used for collecting network flow data;
a data preprocessing module: the system is used for labeling the network flow data;
a feature extraction module: the bidirectional flow characteristic set is used for extracting a bidirectional flow characteristic set according to the network flow data subjected to the labeling processing;
a model construction module: and the bidirectional flow feature set is used for constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The technical scheme adopted by the embodiment of the application further comprises the following steps:
the data acquisition module specifically acquires network traffic data and comprises: selecting application types in network flow, and collecting a network flow data packet corresponding to each application and a system network log corresponding to a time period;
the data preprocessing module is used for labeling the network traffic data and specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the feature extraction module specifically extracts a bidirectional flow feature set according to the labeled network traffic data, and includes:
analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
The technical scheme adopted by the embodiment of the application further comprises a feature optimization module, wherein the feature optimization module is used for optimizing the bidirectional flow feature set by utilizing a maximum variance interpretation mechanism.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the feature optimization module specifically optimizes the bidirectional flow feature set by using a maximum variance interpretation mechanism, and comprises the following steps:
performing standard normalization on the network traffic data;
on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
projecting the network traffic data onto the N eigenvectors;
and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
The embodiment of the application adopts another technical scheme that: an electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the one processor to cause the at least one processor to perform the following operations of the network traffic classification method described above:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
Compared with the prior art, the embodiment of the application has the advantages that: the network traffic classification method, the network traffic classification system and the electronic equipment in the embodiment of the application classify the network traffic by using the bidirectional flow characteristics in the network traffic data, and can accurately identify and classify a large number of new applications in the internet; meanwhile, the method of the maximum variance interpretation mechanism is used for carrying out optimization association on the bidirectional flow characteristics, so that the high cohesion of the bidirectional flow characteristics is guaranteed, the classification accuracy is improved, and the high precision and the high performance of network flow classification can be effectively guaranteed.
Drawings
Fig. 1 is a flowchart of a network traffic classification method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a process of collecting and labeling network traffic data;
FIG. 3 is a schematic diagram of a bidirectional flow feature set extraction and optimization process;
fig. 4 is a schematic structural diagram of a network traffic classification system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a hardware device of a network traffic classification method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Please refer to fig. 1, which is a flowchart illustrating a network traffic classification method according to an embodiment of the present application. The network traffic classification method of the embodiment of the application comprises the following steps:
step 100: collecting network flow data and labeling the network flow data;
in step 100, the process of collecting and labeling network traffic data is shown in fig. 2, and the specific steps are as follows:
step 101: selecting an application category in the network traffic;
step 102: continuously capturing fixed application traffic through high-performance network monitoring software;
step 103: collecting a network flow data packet corresponding to each application type and a system network log of a corresponding time period;
step 104: analyzing the network flow data packet, and finding out the natural attribute of each application and key information communicated with other applications, such as an IP address, a transmission protocol and the like;
step 105: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining the IP address and the transmission protocol to finish the labeling processing of the network flow data.
Step 200: extracting a bidirectional flow characteristic set from the labeled network flow data, and optimizing the bidirectional flow characteristic set by using a maximum variance interpretation mechanism;
in step 200, the process of extracting and optimizing the bidirectional flow feature set is shown in fig. 3, and specifically includes the following steps:
step 201: analyzing according to the labeled network flow data, and respectively counting bidirectional (forward and reverse) network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network flow data, wherein each pair of { source IP address, destination IP address } has two network flow information in opposite directions;
step 202: finding forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets F1 in each forward network flow;
step 203: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets F2 in each reverse network flow;
step 204: combining forward and reverse network flow feature sets { F1, F2} between each pair of { source IP address, destination IP address }, to form a bidirectional flow feature set F of M-dimensional features, denoted as F { F1, F2 };
in step 204, a uniform optimization is performed by combining all the forward and reverse network flow feature sets.
Step 205: performing standard normalization on the network flow data, and normalizing the network flow data set into a data set with a mean value of 0 and a variance of 1; the normalized formula is: x ═ x/δ, where u is the mean of all network traffic data and δ is the standard deviation of all network traffic data;
step 206: on the network flow data, the average value of each feature on a bidirectional flow feature set F is obtained;
step 207: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step 208: calculating a covariance matrix of a bidirectional flow feature set F, and sequencing the covariance matrix from small to large according to a variance value of each feature on a main diagonal in the covariance matrix to obtain an N-dimensional feature with the highest and closest relevance in the bidirectional flow feature set F;
in step 208, the covariance between every two features is on the main diagonal, and the covariance is greater than 0, which indicates that the two features are in a positive correlation trend; the covariance is less than 0, which indicates that the two characteristics are in a negative correlation trend; covariance equal to 0, indicating independence between the two features; the larger the absolute value of the covariance, the tighter the connection between two features and vice versa. According to the 5 conditions, the N-dimensional features with the highest and closest relevance in the bidirectional flow feature set F can be calculated. The method and the device utilize a maximum variance interpretation mechanism to perform priority combination on the features with the closest association degree on the bidirectional network flow feature sets in the network flow data, and screen out the feature sets which can most embody the network flow categories.
Step 209: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step 210: projecting the network flow data to the selected N eigenvectors: assuming that the sample number of the network traffic data is p, the feature number is q, a sample matrix obtained by subtracting a feature mean value from the network traffic data is DataTransform (p × q), a covariance matrix of a bidirectional flow feature set is p × q, and a matrix formed by N selected feature vectors is EigenVectors (q × N), the projected network traffic data is: OptimizeData (p × N) ═ DataTransform (p × q) X EigenVectors (q × N);
in step 210, by projecting the network traffic data onto the feature vector corresponding to the optimized bidirectional flow feature, the degree of polymerization of the data can be improved, the influence of noise data can be reduced, and the classification accuracy can be improved.
Step 211: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
Step 300: based on the optimized bidirectional flow characteristic set, a classification model is constructed by adopting a random forest algorithm of supervised learning, and a classification result of the network flow data is output through the classification model;
in step 300, a random forest algorithm of supervised learning is adopted for modeling, the optimized bidirectional flow feature set is input into a classification model for classification training, and the performance of the classification model is optimized through performance evaluation of the classification model. The trained classification model is tested by using the test data set in the verification stage, and the test result shows that the classification model constructed based on the optimized bidirectional flow characteristic set obviously has very high classification precision, so that the classification efficiency can be improved on the premise of ensuring higher classification accuracy, and the overall performance is improved.
Please refer to fig. 4, which is a block diagram of a network traffic classification system according to an embodiment of the present application. The network traffic classification system comprises a data acquisition module, a data preprocessing module, a feature extraction module, a feature optimization module and a model construction module.
A data acquisition module: the system is used for collecting network flow data; the network flow data acquisition mode comprises the following steps: selecting application types in the network flow, continuously capturing fixed application type flow through high-performance network monitoring software, and collecting network flow data packets corresponding to each application type and system network logs corresponding to a time period.
A data preprocessing module: the system is used for labeling the network flow data; the labeling process of the network traffic data specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and key information communicated with other applications, such as an IP address, a transmission protocol and the like; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining the IP address and the transmission protocol to finish the labeling processing of the network flow data.
A feature extraction module: the bidirectional flow feature set is used for extracting the bidirectional flow feature set from the labeled network flow data; specifically, the bidirectional flow feature set extraction method includes:
a. analyzing according to the labeled network flow data, and respectively counting bidirectional (forward and reverse) network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network flow data, wherein each pair of { source IP address, destination IP address } has two network flow information in opposite directions;
b. finding forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets F1 in each forward network flow;
c. finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets F2 in each reverse network flow;
d. the forward and reverse network flow feature sets { F1, F2} between each pair of { source IP address, destination IP address } are combined to form a bi-directional flow feature set F of M-dimensional features, denoted as F { F1, F2 }.
A feature optimization module: the device is used for optimizing the extracted bidirectional flow characteristic set by utilizing a maximum variance interpretation mechanism; specifically, the bidirectional flow feature set optimization method includes:
a. performing standard normalization on the network flow data, and normalizing the network flow data set into a data set with a mean value of 0 and a variance of 1; the normalized formula is: x ═ x/δ, where u is the mean of all network traffic data and δ is the standard deviation of all network traffic data;
b. on the network flow data, the average value of each feature on a bidirectional flow feature set F is obtained;
c. subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
d. calculating a covariance matrix of a bidirectional flow feature set F, and sequencing the covariance matrix from small to large according to a variance value of each feature on a main diagonal in the covariance matrix to obtain an N-dimensional feature with the highest and closest relevance in the bidirectional flow feature set F; the main diagonal line is the covariance between every two characteristics, the covariance is greater than 0, and the two characteristics show positive correlation trend; the covariance is less than 0, which indicates that the two characteristics are in a negative correlation trend; covariance equal to 0, indicating independence between the two features; the larger the absolute value of the covariance, the tighter the connection between two features and vice versa. According to the 5 conditions, the N-dimensional features with the highest and closest relevance in the bidirectional flow feature set F can be calculated. The method and the device utilize a maximum variance interpretation mechanism to perform priority combination on the features with the closest association degree on the bidirectional network flow feature sets in the network flow data, and screen out the feature sets which can most embody the network flow categories.
e. Calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
f. projecting the network flow data to the selected N eigenvectors: assuming that the sample number of the network traffic data is p, the feature number is q, a sample matrix obtained by subtracting a feature mean value from the network traffic data is DataTransform (p × q), a covariance matrix of a bidirectional flow feature set is p × q, and a matrix formed by N selected feature vectors is EigenVectors (q × N), the projected network traffic data is: OptimizeData (p × N) ═ DataTransform (p × q) X EigenVectors (q × N);
g. and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
A model construction module: the method comprises the steps that a classification model is constructed by adopting a random forest algorithm of supervised learning based on an optimized bidirectional flow characteristic set, and a classification result of network flow data is output through the classification model; the method comprises the steps of modeling by adopting a random forest algorithm of supervised learning, inputting an optimized bidirectional flow characteristic set into a classification model for classification training, and optimizing the performance of the classification model through performance evaluation of the classification model. The trained classification model is tested by using the test data set in the verification stage, and the test result shows that the classification model constructed based on the optimized bidirectional flow characteristic set obviously has very high classification precision, so that the classification efficiency can be improved on the premise of ensuring higher classification accuracy, and the overall performance is improved.
Fig. 5 is a schematic structural diagram of a hardware device of a network traffic classification method according to an embodiment of the present application. As shown in fig. 5, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further include: an input system and an output system.
The processor, memory, input system, and output system may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications and data processing of the electronic device, i.e., implements the processing method of the above-described method embodiment, by executing the non-transitory software program, instructions and modules stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processing system over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.
The one or more modules are stored in the memory and, when executed by the one or more processors, perform the following for any of the above method embodiments:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium having stored thereon computer-executable instructions that may perform the following operations:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the following:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The network traffic classification method, the network traffic classification system and the electronic equipment in the embodiment of the application classify the network traffic by using the bidirectional flow characteristics in the network traffic data, and can accurately identify and classify a large number of new applications in the internet; meanwhile, the method of the maximum variance interpretation mechanism is used for carrying out optimization association on the bidirectional flow characteristics, so that the high cohesion of the bidirectional flow characteristics is guaranteed, the classification accuracy is improved, and the high precision and the high performance of network flow classification can be effectively guaranteed.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A network traffic classification method is characterized by comprising the following steps:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network traffic data through the classification model;
the step b further comprises the following steps: optimizing the bidirectional flow feature set by using a maximum variance interpretation mechanism;
the optimizing the bidirectional flow feature set by using the maximum variance interpretation mechanism specifically comprises:
step b 5: performing standard normalization on the network traffic data;
step b 6: on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
step b 7: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step b 8: calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
step b 9: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step b 10: projecting the network traffic data onto the N eigenvectors;
step b 11: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
2. The method for classifying network traffic according to claim 1, wherein in the step a, the collecting network traffic data and labeling the network traffic data specifically include:
step a 1: selecting an application category in the network traffic;
step a 2: collecting a network flow data packet corresponding to each application and a system network log of a corresponding time period;
step a 3: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications;
step a 4: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
3. The method according to claim 2, wherein in the step b, the extracting a bidirectional flow feature set according to the labeled network traffic data specifically includes:
step b 1: analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
step b 2: finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
step b 3: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
step b 4: and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
4. A network traffic classification system, comprising:
a data acquisition module: the system is used for collecting network flow data;
a data preprocessing module: the system is used for labeling the network flow data;
a feature extraction module: the bidirectional flow characteristic set is used for extracting a bidirectional flow characteristic set according to the network flow data subjected to the labeling processing;
a model construction module: the bidirectional flow feature set is used for constructing a classification model based on the bidirectional flow feature set, and a classification result of the network flow data is output through the classification model;
the system also comprises a feature optimization module, wherein the feature optimization module is used for optimizing the bidirectional flow feature set by utilizing a maximum variance interpretation mechanism;
the feature optimization module specifically optimizes the bidirectional flow feature set by using a maximum variance interpretation mechanism, and comprises the following steps:
performing standard normalization on the network traffic data;
on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
projecting the network traffic data onto the N eigenvectors;
and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
5. The network traffic classification system of claim 4,
the data acquisition module specifically acquires network traffic data and comprises: selecting application types in network flow, and collecting a network flow data packet corresponding to each application and a system network log corresponding to a time period;
the data preprocessing module is used for labeling the network traffic data and specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
6. The network traffic classification system according to claim 5, wherein the extracting, by the feature extraction module, the bidirectional flow feature set according to the labeled network traffic data specifically includes:
analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
7. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of classifying network traffic of any of claims 1 to 3.
CN201811113686.XA 2018-09-25 2018-09-25 Network traffic classification method and system and electronic equipment Active CN109309630B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811113686.XA CN109309630B (en) 2018-09-25 2018-09-25 Network traffic classification method and system and electronic equipment
PCT/CN2018/112401 WO2020062390A1 (en) 2018-09-25 2018-10-29 Network traffic classification method and system, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811113686.XA CN109309630B (en) 2018-09-25 2018-09-25 Network traffic classification method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN109309630A CN109309630A (en) 2019-02-05
CN109309630B true CN109309630B (en) 2021-09-21

Family

ID=65225067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811113686.XA Active CN109309630B (en) 2018-09-25 2018-09-25 Network traffic classification method and system and electronic equipment

Country Status (2)

Country Link
CN (1) CN109309630B (en)
WO (1) WO2020062390A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097120B (en) * 2019-04-30 2022-08-26 南京邮电大学 Network flow data classification method, equipment and computer storage medium
CN110149280B (en) * 2019-05-27 2020-08-28 中国科学技术大学 Network traffic classification method and device
CN110365603A (en) * 2019-06-28 2019-10-22 西安交通大学 A kind of self adaptive network traffic classification method open based on 5G network capabilities
CN111698223B (en) * 2020-05-22 2022-02-22 哈尔滨工程大学 Encrypted WEB fingerprint identification method based on automatic feature engineering
CN113746686A (en) * 2020-05-27 2021-12-03 阿里巴巴集团控股有限公司 Network flow state determination method, computing device and storage medium
CN111817971B (en) * 2020-06-12 2023-03-24 华为技术有限公司 Data center network flow splicing method based on deep learning
CN111970305B (en) * 2020-08-31 2022-08-12 福州大学 Abnormal flow detection method based on semi-supervised descent and Tri-LightGBM
CN112448868B (en) * 2020-12-02 2022-09-30 新华三人工智能科技有限公司 Network traffic data identification method, device and equipment
CN112839055B (en) * 2021-02-04 2022-08-23 北京六方云信息技术有限公司 Network application identification method and device for TLS encrypted traffic and electronic equipment
CN112804253B (en) * 2021-02-04 2022-07-12 湖南大学 Network flow classification detection method, system and storage medium
CN113098735B (en) * 2021-03-31 2022-10-11 上海天旦网络科技发展有限公司 Inference-oriented application flow and index vectorization method and system
CN113114672B (en) * 2021-04-12 2023-02-28 常熟市国瑞科技股份有限公司 Video transmission data fine measurement method
CN113141357B (en) * 2021-04-19 2022-02-18 湖南大学 Feature selection method and system for optimizing network intrusion detection performance
CN112995063B (en) * 2021-04-19 2021-10-08 北京智源人工智能研究院 Flow monitoring method, device, equipment and medium
CN113315721B (en) * 2021-05-26 2023-01-17 恒安嘉新(北京)科技股份公司 Network data feature processing method, device, equipment and storage medium
CN113556317B (en) * 2021-06-07 2022-10-11 中国科学院信息工程研究所 Abnormal flow detection method and device based on network flow structural feature fusion
CN114928560B (en) * 2022-05-16 2023-01-31 珠海市鸿瑞信息技术股份有限公司 Big data based network flow and equipment log cooperative management system and method
WO2024065185A1 (en) * 2022-09-27 2024-04-04 西门子股份公司 Device classification method and apparatus, electronic device, and computer-readable storage medium
CN116647877B (en) * 2023-06-12 2024-03-15 广州爱浦路网络技术有限公司 Flow category verification method and system based on graph convolution model
CN116662817B (en) * 2023-07-31 2023-11-24 北京天防安全科技有限公司 Asset identification method and system of Internet of things equipment
CN117197591B (en) * 2023-11-06 2024-03-12 青岛创新奇智科技集团股份有限公司 Data classification method based on machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394827A (en) * 2011-11-09 2012-03-28 浙江万里学院 Hierarchical classification method for internet flow
CN104052639A (en) * 2014-07-02 2014-09-17 山东大学 Real-time multi-application network flow identification method based on support vector machine
CN106874879A (en) * 2017-02-21 2017-06-20 华南师范大学 Handwritten Digit Recognition method based on multiple features fusion and deep learning network extraction
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7827011B2 (en) * 2005-05-03 2010-11-02 Aware, Inc. Method and system for real-time signal classification
CN103973589B (en) * 2013-09-12 2017-04-12 哈尔滨理工大学 Network traffic classification method and device
CN104767692B (en) * 2015-04-15 2018-05-29 中国电力科学研究院 A kind of net flow assorted method
CN106487535B (en) * 2015-08-24 2020-04-28 中兴通讯股份有限公司 Method and device for classifying network traffic data
US10785247B2 (en) * 2017-01-24 2020-09-22 Cisco Technology, Inc. Service usage model for traffic analysis
WO2018160136A1 (en) * 2017-03-02 2018-09-07 Singapore University Of Technology And Design Method and apparatus for determining an identity of an unknown internet-of-things (iot) device in a communication network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394827A (en) * 2011-11-09 2012-03-28 浙江万里学院 Hierarchical classification method for internet flow
CN104052639A (en) * 2014-07-02 2014-09-17 山东大学 Real-time multi-application network flow identification method based on support vector machine
CN106874879A (en) * 2017-02-21 2017-06-20 华南师范大学 Handwritten Digit Recognition method based on multiple features fusion and deep learning network extraction
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow

Also Published As

Publication number Publication date
WO2020062390A1 (en) 2020-04-02
CN109309630A (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN109309630B (en) Network traffic classification method and system and electronic equipment
CN110213287B (en) Dual-mode intrusion detection device based on integrated machine learning algorithm
WO2017124990A1 (en) Method, system, device and readable storage medium for realizing insurance claim fraud prevention based on consistency between multiple images
CN109117634B (en) Malicious software detection method and system based on network traffic multi-view fusion
US8873840B2 (en) Reducing false detection rate using local pattern based post-filter
WO2017113691A1 (en) Method and device for identifying video characteristics
CN109525508B (en) Encrypted stream identification method and device based on flow similarity comparison and storage medium
CN113435546B (en) Migratable image recognition method and system based on differentiation confidence level
CN114492768B (en) Twin capsule network intrusion detection method based on small sample learning
CN109639734B (en) Abnormal flow detection method with computing resource adaptivity
CN113489685B (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
WO2022199185A1 (en) User operation inspection method and program product
CN111915437A (en) RNN-based anti-money laundering model training method, device, equipment and medium
WO2020155790A1 (en) Method and apparatus for extracting claim settlement information, and electronic device
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
CN114553591B (en) Training method of random forest model, abnormal flow detection method and device
US20230215125A1 (en) Data identification method and apparatus
CN115600128A (en) Semi-supervised encrypted traffic classification method and device and storage medium
CN111191720B (en) Service scene identification method and device and electronic equipment
CN110956123B (en) Method, device, server and storage medium for auditing rich media content
WO2019100348A1 (en) Image retrieval method and device, and image library generation method and device
CN116150688A (en) Lightweight Internet of things equipment identification method and device in smart home
CN116662817B (en) Asset identification method and system of Internet of things equipment
WO2024000822A1 (en) Text classification annotation sample anomaly detection method and apparatus, device, and medium
CN114692778A (en) Multi-modal sample set generation method, training method and device for intelligent inspection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant