CN111614514A

CN111614514A - Network traffic identification method and device

Info

Publication number: CN111614514A
Application number: CN202010366325.7A
Authority: CN
Inventors: 胡博; 陈山枝; 朱轶凡; 汪劲希
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-01
Anticipated expiration: 2040-04-30
Also published as: CN111614514B

Abstract

The invention provides a network flow identification method and a device, wherein the method comprises the following steps: obtaining a continuous flow chart by adopting multivariate correlation analysis based on the obtained continuous characteristics of the flow to be identified; based on the obtained discrete characteristics of the flow to be identified, obtaining a discrete flow graph by adopting single-hot coding and entity embedding; inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value; inputting the discrete flow diagram into a factor decomposition machine to obtain a discrete flow diagram value; and inputting the continuous flow chart value and the discrete flow chart value into a normalized exponential function to obtain the type of the flow to be identified. The method provided by the invention is not only suitable for the original network flow in the byte form, but also suitable for the non-byte network flow after feature extraction and combination are carried out on the basis of the original network flow, thereby enlarging the application range of the network flow identification method and improving the accuracy of network flow identification.

Description

Network traffic identification method and device

Technical Field

The present invention relates to the field of communications network technologies, and in particular, to a network traffic identification method and apparatus.

Background

With the development of network technology, this brings a great challenge to the diversification of service types. Identifying the type of each service, i.e., identifying the type of each network traffic, becomes a key concern for network academic research and deployment operations.

The network flow is an important carrier for recording and reflecting the network and the user activities thereof, and the network flow identification can be used for evaluating the network situation, developing and analyzing the application program, finely operating and the like. With diversification of service types, the occurrence of dynamic ports and encrypted traffic limits the technology of network traffic identification based on ports or payloads, and intelligent network traffic identification is a current key research idea, such as network traffic identification by a traditional machine learning method. The traditional machine learning method realizes the identification of network traffic based on the extraction and combination of statistical characteristics, avoids the limitation of dynamic ports and encrypted traffic technology, but the traditional machine learning method needs to artificially and subjectively determine the selection and combination mode of the characteristics, so that the efficiency and the accuracy of traffic identification are lower.

Therefore, a new research idea at present is to introduce image classification in deep learning into flow identification, and to implement conversion from flow characteristics to pixel points by using characterization learning, so as to implement network flow identification, for example, to express an original flow data packet in a byte form in a form of a gray scale map, and to implement classification of network flow by classifying the gray scale map. The method is simple to implement, but when the input data is the non-byte network traffic after feature extraction and combination are carried out on the basis of the original network traffic, the method cannot accurately identify the type of the network traffic, so that the application range of the method is small, and the accuracy of network traffic identification is low.

Disclosure of Invention

The invention aims to provide a network traffic identification method and a network traffic identification device, which are used for enlarging the application range of the network traffic identification method and improving the accuracy of network traffic identification. The specific technical scheme is as follows:

in order to achieve the above object, an embodiment of the present invention provides a network traffic identification method, where the method includes:

based on the obtained continuous features of the flow to be identified, mapping the continuous features into pixel values by adopting multivariate correlation analysis to obtain a continuous flow chart;

mapping the discrete features into pixel values by adopting one-hot coding and entity embedding based on the acquired discrete features of the flow to be identified to obtain a discrete flow graph;

inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value, wherein the convolutional neural network is used for carrying out high-order combination on the continuous features;

inputting the discrete flow diagram into a factorization machine to obtain a discrete flow diagram value, wherein the factorization machine is used for carrying out low-order combination on the discrete features;

and inputting the continuous flow chart value and the discrete flow chart value into a normalized exponential function to obtain the type of the flow to be identified.

Optionally, the step of mapping the continuous features into pixel values by using multivariate correlation analysis based on the obtained continuous features of the flow to be identified to obtain a continuous flow chart includes:

inputting each continuous characteristic of the flow to be identified into a standard deviation standardization function to obtain a standardized continuous characteristic;

calculating the correlation between every two normalized continuous features;

determining a correlation matrix corresponding to each two normalized continuous features according to the correlation between the continuous features;

mapping the value of each element in the correlation matrix to the value range of a pixel point in a gray scale image to obtain a pixel value corresponding to each element in the correlation matrix;

and generating the continuous flow chart by using the pixel value corresponding to each element in the correlation matrix.

Optionally, the step of calculating the correlation between each two normalized continuous features includes:

the correlation between each two normalized consecutive features is calculated using the following formula:

wherein r is_i，jAs a continuous feature x_iNormalized value and continuous feature x_jCorrelation between normalized values, x_{normalization_i}For said continuous feature x_iNormalized value, x_{normalization_j}As a continuous feature x_jAnd taking a normalized value, wherein p is the total number of the continuous features of the flow to be identified.

Optionally, the step of mapping the discrete features into pixel values by using unique hot coding and entity embedding based on the obtained discrete features of the flow to be identified to obtain a discrete flow graph includes:

vectorizing a first class of features in the discrete features by adopting entity embedding to obtain a first processed feature value, wherein the first class of features are discrete features of which the value number is greater than a preset value number threshold;

coding a second type of characteristics in the discrete characteristics by adopting one-hot coding to obtain a second processed characteristic value, wherein the second type of characteristics is discrete characteristics with the value number less than or equal to the preset value number threshold;

mapping the first processed characteristic value and the second processed characteristic value to a value range of a pixel point in a gray scale image respectively to obtain a pixel value corresponding to each discrete characteristic;

and generating the discrete flow chart by using the pixel value corresponding to each discrete feature.

Optionally, the step of performing vectorization processing on the first type of features in the discrete features by using entity embedding to obtain a first processed feature value includes:

calculating the vector dimension of each first-class feature according to the value quantity of each first-class feature;

obtaining a vector element value of the value of each first type feature on the vector dimension;

and normalizing the vector element value corresponding to each first-class feature to obtain a normalized vector element value, wherein the normalized vector element value is between the value ranges of the pixel points in the gray level image, and the normalized vector element value is the first-processed feature value.

Optionally, the step of calculating the vector dimension of each first-class feature according to the number of values of each first-class feature includes:

for each first class feature, calculating the vector dimension of the first class feature by using the following formula:

wherein dimensions is the vector dimension of the first class of features, and the passive values is the value number of the first class of features,

is to round up upwards;

the step of normalizing the vector element value corresponding to each first type feature to obtain a normalized vector element value includes:

for each first-class feature, the vector element value corresponding to the first-class feature is normalized by using the following formula to obtain a normalized vector element value corresponding to the first-class feature:

wherein v is_{normalization_i_k}Is a feature x of the first kind_iThe kth value of (a) corresponds to the normalized vector element value, v_{i_k}Is x of said first class of features_iCorresponding k-th valued vector element value, v_{i_min}Is the first type of feature x_iCorresponding minimum vector element value, v_{i_max}Is the first type of feature x_iThe corresponding maximum vector element value.

Optionally, the convolutional neural network includes a convolutional layer, a pooling layer, and a full-link layer;

the step of inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value comprises:

inputting the continuous flow graph into the convolutional neural network;

the convolution layer extracts continuous features in the continuous flow chart through a convolution kernel and conducts convolution operation on the continuous features in the continuous flow chart to obtain a convolution result;

the pooling layer performs characteristic screening on the convolution result to obtain a pooling result;

the full connection layer rectifies the pooling result and outputs a continuous flow chart value;

and acquiring the continuous flow chart value output by the full connection layer.

Optionally, the step of rectifying the pooling result by the full connection layer and outputting a continuous flow chart value includes:

calculating the continuous flow chart value by using the following formula:

y_out＝σ(y_in*ω+b)；

where σ is a parameter of the linear rectification ReLU function, y_inAs a result of said pooling, y_outAnd the continuous flow chart value is represented by omega, the weight of the pooling result is represented by b, and the offset term corresponding to the full connection layer is represented by b.

Optionally, the step of inputting the discrete flow chart into a factorization machine to obtain a discrete flow chart value includes:

calculating to obtain a discrete flow chart value by using the following formula:

wherein, y_discreteIs the discrete flow diagram value, omega is the parameter of the characteristic, d is the total discrete characteristic of the network flow to be processed, V_iAnd V_jFor said discrete feature x_iAnd the discrete feature x_jA parameter of the product of (c).

In order to achieve the above object, an embodiment of the present invention further provides a network traffic identification apparatus, where the apparatus includes:

the continuous flow chart generation module is used for mapping the continuous features into pixel values by adopting multivariate correlation analysis based on the obtained continuous features of the flow to be identified to obtain a continuous flow chart;

the discrete flow map generation module is used for mapping the discrete features into pixel values by adopting one-hot coding and entity embedding based on the acquired discrete features of the flow to be identified to obtain a discrete flow map;

the continuous flow chart value calculation module is used for inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value, and the convolutional neural network is used for performing high-order combination on the continuous features;

the discrete flow graph value calculation module is used for inputting the discrete flow graph into a factorization machine to obtain discrete flow graph values, and the factorization machine is used for carrying out low-order combination on the discrete features;

and the output module is used for inputting the continuous flow chart value and the discrete flow chart value into a normalized exponential function to obtain the type of the flow to be identified.

Optionally, the continuous flow chart generating module includes:

the standardization processing submodule is used for inputting each continuous characteristic of the flow to be identified into a standard deviation standardization function to obtain a standardized continuous characteristic;

the calculation submodule is used for calculating the correlation between every two normalized continuous features;

the correlation determination submodule is used for determining a correlation matrix corresponding to each two normalized continuous features according to the correlation between the continuous features;

the first mapping submodule is used for mapping the value of each element in the correlation matrix to the value range of a pixel point in a gray scale map to obtain a pixel value corresponding to each element in the correlation matrix;

and the continuous flow chart generation submodule is used for generating the continuous flow chart by utilizing the pixel value corresponding to each element in the correlation matrix.

Optionally, the calculation sub-module is specifically configured to:

Optionally, the discrete flow map generating module includes:

the first-class feature processing sub-module is used for carrying out vectorization processing on first-class features in the discrete features by adopting entity embedding to obtain a first processed feature value, wherein the first-class features are discrete features of which the value number is greater than a preset value number threshold;

the second-class feature processing sub-module is used for encoding a second-class feature in the discrete features by using one-hot encoding to obtain a second processed feature value, wherein the second-class feature is the discrete feature of which the value number is less than or equal to the preset value number threshold;

the second mapping submodule is used for mapping the first processed characteristic value and the second processed characteristic value to a value range of a pixel point in a gray scale map respectively to obtain a pixel value corresponding to each discrete characteristic;

and the discrete flow chart generation submodule is used for generating the discrete flow chart by utilizing the pixel value corresponding to each discrete feature.

Optionally, the first-class feature processing sub-module includes:

the calculation unit is used for calculating the vector dimension of each first-class feature according to the value quantity of each first-class feature;

the acquisition unit is used for acquiring a vector element value of each first-class feature, wherein the value of each first-class feature is on the vector dimension;

and the processing unit is used for carrying out standardization processing on the vector element value corresponding to each first-class feature to obtain a standardized vector element value, wherein the standardized vector element value is between the value ranges of the pixel points in the gray level image, and the standardized vector element value is the first-processed feature value.

Optionally, the computing unit is specifically configured to:

is to round up upwards;

the processing unit is specifically configured to:

Optionally, the continuous flow chart value calculating module includes:

an input submodule for inputting the continuous flow graph into the convolutional neural network if the convolutional neural network includes a convolutional layer, a pooling layer, and a fully-connected layer;

and the acquisition submodule is used for acquiring the continuous flow chart value output by the full connection layer.

Optionally, the input sub-module is specifically configured to:

calculating the continuous flow chart value by using the following formula:

y_out＝σ(y_in*ω+b)；

Optionally, the discrete flow map value calculation module is specifically configured to:

wherein, y_discreteFor the discrete flow graph value, ω is a parameter of a discrete feature, d is a total number of discrete features of the network flow to be processed, V_iAnd V_jFor said discrete feature x_iAnd the discrete feature x_jA parameter of the product of (c).

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, where the electronic device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above method steps.

To achieve the above object, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to perform any of the above method steps.

The technical scheme provided by the embodiment of the invention has the beneficial effects that:

the method can acquire continuous characteristics and discrete characteristics of flow to be identified, process the continuous characteristics through a multivariate correlation analysis technology to obtain a continuous flow graph, process the discrete characteristics through a one-hot coding technology and an entity embedding technology to obtain a discrete flow graph, input the continuous flow graph into a convolutional neural network to obtain a continuous flow graph value, input the discrete flow graph into a factorization machine to obtain a discrete flow graph value, input the continuous flow graph value and the discrete flow graph value into a normalization index function, and output the type of the flow to be identified. In the technical scheme provided by the embodiment of the invention, the continuous characteristic and the discrete characteristic of the network flow are respectively and correspondingly processed, and the type of the flow to be identified is identified based on the processed continuous characteristic and discrete characteristic.

The embodiment of the invention realizes the conversion from the continuous characteristic and the discrete characteristic of the network flow to the pixel points and the combination learning of the pixel points, converts the identification problem of the network flow into the image classification problem and improves the identification accuracy of the network flow. In addition, the continuous features and the discrete features of the network traffic do not change with the processing of the network traffic, so the technical scheme provided by the embodiment of the invention for identifying the type of the traffic to be identified based on the continuous features and the discrete features is not only suitable for the original network traffic, but also suitable for the non-byte network traffic after feature extraction and combination based on the original network traffic, and the application range of the network traffic identification method is enlarged while the network traffic identification is realized, and the accuracy of the network traffic identification is further improved.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a network traffic identification method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for generating a continuous flow chart according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a method for generating a discrete flow chart according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for processing discrete features using entity embedding techniques according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a default recognition model according to an embodiment of the present invention;

fig. 6 is another schematic flow chart of a network traffic identification method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a flow rate identification device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to increase the application range of the network traffic identification method and improve the accuracy of network traffic identification while implementing network traffic identification, embodiments of the present invention provide a network traffic identification method, an apparatus, an electronic device, and a storage medium, and a detailed description will be given below of the network traffic identification method, the apparatus, the electronic device, and the storage medium provided in embodiments of the present invention with reference to fig. 1 to 8.

Referring to fig. 1, fig. 1 is a schematic flow chart of a network traffic identification method according to an embodiment of the present invention, where the method includes the following steps.

And step 101, mapping the continuous features into pixel values by adopting multivariate correlation analysis based on the obtained continuous features of the flow to be identified to obtain a continuous flow chart.

In this embodiment of the present invention, the traffic to be identified may include one or more data packets to be identified, where the identification of the traffic to be identified is a process of identifying the type of the one or more data packets to be identified, and associating the one or more data packets to be identified with a service type that generates the one or more data packets to be identified, for example, the service type of the data packet sent by the communication software may be classified as a communication service type.

The continuous characteristic is a characteristic that the value of the characteristic is continuous and regular, such as the size of the network flow, the time for generating the network flow and the like. Each flow to be identified may include a plurality of consecutive features.

In the embodiment of the invention, multivariate correlation analysis is used for mining the correlation between continuous features through Euclidean distance. For example, the absolute distance between two consecutive features is calculated by the euclidean distance formula, and the correlation between the two consecutive features is reflected by the absolute distance.

In one embodiment, as shown in fig. 2, step 101 may be subdivided into the following steps to more accurately determine the correlation between the continuous features and further more accurately identify and classify the network traffic corresponding to the continuous features.

Step 1011, inputting each continuous characteristic of the flow to be identified into a standard deviation standardization function to obtain the standardized continuous characteristic.

In the embodiment of the present invention, since the continuous features of the network traffic have different categories, such as the size of the network traffic, the number of packets corresponding to the network traffic, and the like, in order to process the continuous features in the same feature space, all the continuous features are standardized uniformly, and the continuous features are converted into numerical values which are in accordance with normal distribution and have no unit under a uniform standard, so as to process the continuous features of the network traffic under the uniform standard conveniently.

In one embodiment, the standard deviation normalization function may be:

wherein x is_{nomarlization}For normalized continuous features, x is the continuous feature, μAnd sigma is the average difference of all values of the continuous characteristic.

At step 1012, a correlation between each two normalized consecutive features is calculated.

In the embodiment of the present invention, the correlation between each two consecutive features is a correlation between each two consecutive features. The correlation between each two consecutive features may be represented by a distance between each two consecutive features, and the distance between each two consecutive features may be calculated by a euclidean distance formula.

In one embodiment, the correlation between each two normalized consecutive features may be calculated using the following formula:

wherein r is_i，jAs a continuous feature x_iNormalized value and continuous feature x_jCorrelation between normalized values, x_{normalization_i}As a continuous feature x_iNormalized value, x_{normalization_j}As a continuous feature x_jAnd (4) taking a normalized value, wherein p is the total number of continuous characteristics of the flow to be identified.

The above formula for calculating the correlation between two normalized features can also be understood as considering the distance between two normalized features as the length of the hypotenuse of the right triangle, and two normalized features are the two end points of the hypotenuse, so that the distance between two normalized features is calculated by determining the right triangle formed by the two normalized features.

Feature x sampled from input continuous features_iAnd feature x_jFor example, feature x is computed_iAnd feature x_jThe correlation between the features x is firstly determined_iAnd feature x_jData conversion is carried out to obtain the data with the characteristic x_iAnd feature x_jA right triangle formed by combining the features x_iAnd feature x_jProjected to the (i, j) thIn the two-dimensional Euclidean subspace, the feature x is determined by the following formula_iAnd feature x_jCoordinate points within the above two-dimensional euclidean subspace:

point_i，j＝[_i _j]^TX＝[x_ix_j]^T(1≤i，j≤p，i≠j)；

_i＝[e_i，1，e_i，2，…，e_i，p]^T；

_j＝[e_j，1，e_j，2，...，e_j，p]^T；

wherein point_i，jIs a characteristic x_iAnd feature x_jAt coordinate points within the above two-dimensional euclidean subspace,_iis a characteristic x_iA vector within a two-dimensional euclidean subspace,_jis a characteristic x of_jThe vector in the two-dimensional euclidean subspace, p being the total number of consecutive features of the sample, X being the input set of samples.

In one example, assume that in a vector_iIn, except e_i，1Is not 1, and the other values are all 0, and in the vector_jIn, except e_j，1When the values of (a) and (b) are all 0, the coordinate points of the feature j and the feature i in the two-dimensional Euclidean subspace can be expressed as (x)_i，x_j) Will (x)_i，x_j) The projection points mapped to the i axis and the j axis and the coordinate origin form a triangle with delta f_iOf_jSo as to determine the triangle formed by the feature i and the feature j as deltaf_iOf_j. Wherein O represents the origin of coordinates, f_iCoordinate point x representing feature i_iCorresponding vertex, f_jCoordinate point x representing feature j_jThe corresponding vertex.

In one embodiment, the correlation between each two normalized consecutive features can be calculated by using the following formula:

r_mn＝x_{normalization_m}*x_{normalization_n}(1≤m，n≤p，m≠n)；

wherein r is_mnAs a continuous feature x_mNormalized value and continuous feature x_nCorrelation between normalized values, x_{normalization_m}As a continuous feature x_mNormalized value, x_{normalization_n}As a continuous feature x_nAnd (4) taking a normalized value, wherein p is the total number of the normalized continuous features of the flow to be identified.

In the embodiment of the present invention, the correlation between every two normalized continuous features may also be calculated in other manners, which is not limited specifically. As long as the calculated correlation is guaranteed to lie between 0 and 255.

And 1013, determining a correlation matrix corresponding to the continuous features according to the correlation between every two normalized continuous features.

In the embodiment of the invention, if the sample has a plurality of continuous features, a plurality of normalized continuous features are correspondingly obtained, and the correlation between every two normalized continuous features in the plurality of normalized continuous features is calculated, so that the correlation matrix can be obtained.

Taking p continuous features of the sample as an example, p normalized continuous features are correspondingly obtained, and according to the correlation between every two normalized continuous features in the p normalized continuous features, a correlation matrix R can be obtained as follows:

the correlation matrix can be used for visually observing the correlation between the continuous features of the sample, so that the continuous features of the sample are subjected to imaging processing, and the network traffic is identified and classified by classifying the network traffic features.

And 1014, mapping the value of each element in the correlation matrix to the value range of the pixel point in the gray scale image to obtain the pixel value corresponding to each element in the correlation matrix.

In step 1015, a continuous flow chart is generated by using the pixel value corresponding to each element in the correlation matrix.

Because the correlation between the continuous features of different network flows is different, the correlation matrixes corresponding to different network flows are different, and therefore the identification and classification of the network flows can be realized according to the categories of the continuous flow graphs by processing and classifying the continuous flow graphs obtained by the correlation matrixes. Since the value of each element in the correlation matrix is mapped to the value range of the pixel point in the gray scale image, the continuous flow chart is also a gray scale image.

In the embodiment of the invention, the multivariate correlation analysis is adopted to process the continuous features, so that the correlation between every two continuous features can be obtained, meanwhile, the feature space is not enlarged too much, the complexity of network flow identification is reduced, and the identification efficiency of the network flow is improved.

And 102, mapping the discrete features into pixel values by adopting one-hot coding and entity embedding based on the acquired discrete features of the flow to be identified to obtain a discrete flow graph.

In the embodiment of the invention, the discrete characteristic is a characteristic with discrete and unordered characteristic values, such as an IP address of network traffic, a communication protocol of the network traffic and the like. Because the values of the discrete features are discrete and have no sequence, the values of the discrete features can be directly expanded to Euclidean space by adopting unique hot coding and entity embedding, so that the distance between the discrete features is conveniently calculated, and the correlation between the discrete features is obtained through the distance between the discrete features.

Where one-hot encoding, i.e. one-bit-efficient encoding, encodes N states, i.e. discrete feature values, using an N-bit state register, each value having an independent register bit, and only one of which is efficient at any time. For example, assuming that the communication protocol versions of the network traffic have three values, the three communication protocol versions can be sequentially represented in the forms of 001, 010, and 100 after being subjected to the one-hot coding. The values of the discrete features are coded by adopting the one-hot coding, so that the distance between every two values of the discrete features is equal, the calculated distance between the discrete features is reasonable, and the correlation between the calculated discrete features is reasonable.

The entity embedding endows different vector values to different categories, namely endows different values of the discrete features with different vector element values, and the distance between the discrete features with various values is calculated through the entity embedding, so that the calculation efficiency can be improved.

In one embodiment, to obtain the correlation between discrete features more accurately, the discrete features with different values are processed differently, as shown in fig. 3, and step 102 can be subdivided into the following steps.

And 1021, performing vectorization processing on a first class of features in the discrete features by adopting entity embedding to obtain a first processed feature value, wherein the first class of features are discrete features of which the value number is greater than a preset value number threshold value.

The first type of feature may be one or more. The first category is the one with a large number of values. Vectorization processing of the first class of features is to convert the first class of features into vector values. The entity embedding is adopted to process the features with a large number of values in the discrete features, for example, IP addresses, ports and the like corresponding to network traffic are processed, so that the burden of a training model can be reduced, and in addition, the feature dimension of the discrete features can be reduced through the entity embedding, so that the calculation difficulty of the distance between the discrete features is reduced. The preset value quantity threshold value may be set according to an actual situation, which is not limited in the embodiment of the present invention.

In one embodiment, as shown in fig. 4, step 1021 can be further subdivided into the following steps.

Step 10211, calculating the vector dimension of each first-class feature according to the value number of each first-class feature.

In the embodiment of the invention, the vector dimension of the first type of characteristics is related to the value number of the characteristics. And determining the vector dimension of each discrete feature with a large number of values, and assigning the values of the discrete features on the vector dimension, wherein a plurality of values of each discrete feature are positioned on the same vector dimension, so that the feature dimension of the discrete features is reduced, and the overlarge feature space of the discrete features is prevented.

In one embodiment, for each first class feature, the vector dimension of the first class feature may be calculated using the following formula:

is rounded up.

In one embodiment, for each first class feature, the vector dimension of the first class feature may be further calculated by using the following formula:

wherein dimensions is the vector dimension of the first type of feature, and the passive values are the value number of the first type of feature, λ is a preset parameter,

is rounded up.

In the embodiment of the present invention, the way of calculating the vector dimension of the first-class feature is not particularly limited.

Step 10212, obtain the vector element value of each first type feature in the vector dimension.

In entity embedding, the vector element value of each first-class feature value can be specified manually, and when a certain discrete feature has n values, the n values are different, and the value range of the n values is 1-n-1. Each first-type feature can have a plurality of values, and vector element values of all values of each first-type feature are in the same vector dimension.

Step 10213, standardizing the vector element value corresponding to each first-class feature to obtain a standardized vector element value, wherein the standardized vector element value is between the value ranges of the pixel points in the gray level image, and the standardized vector element value is the first-processed feature value.

Because the first type of features have a large number of values, when each value is given with a vector element value, the vector element value may exceed the value range (0-255) of a pixel point in the gray-scale image, and therefore, the vector element values corresponding to all the values of the features need to be uniformly compressed, so that the correlation between the vector element values corresponding to all the values is not changed, and the vector element values corresponding to all the values are within the value range of the pixel point in the gray-scale image, that is, between 0 and 255.

In an embodiment, for each first-class feature, the following formula may be used to normalize a vector element value corresponding to the first-class feature to obtain a normalized vector element value corresponding to the first-class feature:

wherein v is_{normalization_i_k}Is a feature x of the first kind_iThe kth value of (a) corresponds to the normalized vector element value, v_{i_k}X being a feature of the first kind_iCorresponding k-th valued vector element value, v_{i_min}Is a feature x of the first kind_iCorresponding minimum vector element value, v_{i_max}Is a feature x of the first kind_iThe corresponding maximum vector element value.

In an embodiment, for each first-class feature, the following formula may be further used to normalize a vector element value corresponding to the first-class feature to obtain a normalized vector element value corresponding to the first-class feature:

wherein v is_{normalization_i_k}Is a feature x of the first kind_iThe kth value of (a) corresponds to the normalized vector element value, v_{i_k}X being a feature of the first kind_iCorresponding k-th valued vector element value, v_{i_min}Is a feature x of the first kind_iCorresponding minimum vector element value, v_{i_max}Is a feature x of the first kind_iThe corresponding maximum vector element value, λ, is a preset parameter.

In the embodiment of the present invention, a manner of normalizing the vector element value corresponding to the first-type feature is not particularly limited.

And 1022, encoding a second type of feature in the discrete features by using unique hot encoding to obtain a second processed feature value, where the second type of feature is the discrete feature whose value number is less than or equal to a preset value number threshold.

The second type of feature may be one or more. The second type of feature is a feature with a small number of values. The discrete characteristics with small value quantity are processed by adopting the one-hot coding, for example, the communication protocol version of the network flow is processed, the value of the discrete flow can be directly coded by adopting the one-hot coding to obtain the coded value, and a discrete flow graph is generated by the coded value.

And 1023, respectively mapping the first processed characteristic value and the second processed characteristic value to the value range of the pixel point in the gray scale image to obtain the pixel value corresponding to each discrete characteristic.

And 1024, generating a discrete flow chart by using the pixel value corresponding to each discrete feature.

Because the values of the discrete features of different network flows are different, the values of the discrete features of different network flows after being coded are different, so that the discrete flow graphs corresponding to different network flows are different, and the network flows can be identified and classified according to the types of the discrete flow graphs.

And 103, inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value, wherein the convolutional neural network is used for performing high-order combination on the continuous features.

In one embodiment, a convolutional neural network may include a convolutional layer, a pooling layer, and a fully-connected layer.

Inputting the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value, wherein the continuous flow chart is input into the convolutional neural network, the convolutional layer extracts continuous features in the continuous flow chart through a convolutional kernel, and the continuous features in the continuous flow chart are subjected to convolution operation to obtain a convolution result; the pooling layer performs characteristic screening on the convolution result to obtain a pooling result; and the full connection layer rectifies the pooling result and outputs a continuous flow chart value. And then acquiring the continuous flow chart value output by the full connection layer.

The pooling layer is a lower layer structure of the convolutional layer and is used for extracting and screening convolution results output by the convolutional layer. The convolutional neural network may include one convolutional layer and one pooling layer, or may include a plurality of convolutional layers and a plurality of pooling layers.

In the embodiment of the invention, the convolutional layer extracts continuous characteristics in the continuous flow chart through the convolutional core. When the continuous flow chart is too large or the continuous features in the continuous flow chart are too many, the continuous features may be extracted in a process of dividing the continuous flow chart into n regions, extracting and performing convolution operation on the n regions in the continuous flow chart by the convolution kernel, and sending the processed n regions to the pooling layer for further processing.

The process of extracting and screening the convolution result output by the convolution layer by the pooling layer may be that after the convolution result output by the convolution layer is obtained, an important value in the convolution result is adopted to replace a value interval containing the value, thereby realizing the compression of the convolution result. The important value can be the maximum value in a value interval containing the value, and can also be a value with special meaning in the value interval.

In one embodiment, the continuous flow map values may be calculated using the following equations.

y_out＝σ(y_in*ω+b)；

Wherein σIs a parameter of a ReLU (Rectified Linear Unit) function, y_inAs a result of pooling y_outFor the continuous flow graph values, ω is the weight of the pooling result and b is the offset term for the fully connected layer.

In one embodiment, the continuous flow chart value may also be calculated using the following formula.

y_out＝σ(y_in*ω+b)+λ；

Wherein σ is a ReLU (Rectified Linear Unit) function parameter, y_inFor the processed continuous flow chart, y_outAnd the value is a continuous flow chart, omega is the weight of the processed continuous flow chart, b is an offset term corresponding to the full connection layer, and lambda is a preset parameter.

And 104, inputting the discrete flow diagram into a factorization machine to obtain a discrete flow diagram value, wherein the factorization machine is used for carrying out low-order combination on the discrete features.

In the embodiment of the present invention, the low-order combination may be to output the processed discrete flow rate map after performing first-order (linear interaction) and second-order (paired feature interaction) feature combination processing on the discrete flow rate map and the continuous flow rate map.

In one embodiment, the discrete flow map values may be calculated using the following equations.

Wherein, y_discreteIs a discrete flow diagram value, omega is a parameter of a discrete feature, d is the total number of the discrete features of the network flow to be processed, V_iAnd V_jAs a discrete feature x_iAnd discrete feature x_jA parameter of the product of (c).

In one embodiment, the discrete flow map values may also be calculated using the following equations.

Wherein, y_discreteAs discrete flow map valuesω is the parameter of the discrete characteristics, d is the total number of discrete characteristics of the network traffic to be processed, V_iAnd V_jAs a discrete feature x_iAnd discrete feature x_jλ is a preset parameter.

And 105, taking the continuous flow chart value and the discrete flow chart value as input normalization index functions to obtain the type of the flow to be identified.

In the embodiment of the invention, the continuous flow chart value and the discrete flow chart value are processed through the normalized exponential function, and the type of the flow to be identified is output. And realizing multi-classification of network traffic by normalizing the exponential function.

In one embodiment, steps 103-105 can be understood as inputting the continuous flow chart and the discrete flow chart into a preset identification model, so that the preset identification model performs high-order combination on the continuous flow chart to obtain the continuous flow chart value. And carrying out low-order combination on the discrete flow graph to obtain the discrete flow graph value. And processing the continuous flow chart value and the discrete flow chart value through the normalized index function, and outputting the type of the flow to be identified.

The preset identification model may be a combination model obtained by combining a convolutional neural network model and a factorization model. That is, a combined model obtained by combining a Deep (convolutional neural network) model with an FM (Factorization) model. Wherein, the Deep model is subjected to high-order combination, and the FM model is subjected to low-order combination.

The preset identification model may adopt a structure as shown in fig. 5, a plurality of circles in a dotted line frame of a discrete feature represent a plurality of discrete features extracted from a sample, a plurality of circles in a dotted line frame of a continuous feature represent a plurality of continuous features extracted from the sample, a plurality of circles in a single-hot-code/entity-embedded dotted line frame represent discrete features processed by single-hot-code/entity-embedding, a plurality of circles in a dotted line frame of a multivariate correlation analysis represent continuous features processed by multivariate correlation analysis, a plurality of circles in a dotted line frame of an FM layer represent discrete features combined by a low order, a plurality of circles in a dotted line frame of a Deep layer represent continuous features combined by a high order and a low order, and a circle in a dotted line frame of an output layer represents an output processing function, such as a normalized exponential function.

The network traffic identification method provided by the embodiment of the invention respectively carries out corresponding processing on the continuous characteristic and the discrete characteristic of the network traffic, and identifies the type of the traffic to be identified based on the processed continuous characteristic and the discrete characteristic. The network flow identification method provided by the embodiment of the invention realizes the conversion from continuous characteristics and discrete characteristics of the network flow to the pixel points and the combination learning of the pixel points, converts the identification problem of the network flow into the image classification problem, and improves the identification accuracy of the network flow. In addition, the continuous features and the discrete features of the network traffic do not change with the processing of the network traffic, so the technical scheme provided by the embodiment of the invention for identifying the type of the traffic to be identified based on the continuous features and the discrete features is not only suitable for the original network traffic, but also suitable for the non-byte network traffic after feature extraction and combination based on the original network traffic, and the application range of the network traffic identification method is enlarged while the network traffic identification is realized, and the accuracy of the network traffic identification is further improved.

In addition, the network traffic identification method provided by the embodiment of the invention divides the characteristics of the traffic to be identified into the continuous characteristics and the discrete characteristics, and respectively processes the continuous characteristics and the discrete characteristics by adopting different technologies, thereby reducing the identification difficulty of the network traffic and improving the identification efficiency of the network traffic.

As shown in fig. 6, fig. 6 is another schematic flow chart of a network traffic identification method according to an embodiment of the present invention, where the method includes:

step 601, processing the traffic to be identified to obtain continuous features and discrete features of the traffic to be identified.

And step 602, processing the continuous characteristics by adopting multivariate correlation analysis to obtain a continuous flow chart.

Step 603, discrete features are processed by adopting one-hot coding and entity embedding to obtain a discrete flow graph.

And step 604, performing high-order combination on the continuous flow chart to obtain a numerical value of the continuous flow chart.

And 605, performing low-order combination on the discrete flow graph to obtain a discrete flow graph value.

And 606, processing the continuous flow chart value and the discrete flow chart value through a normalized exponential function, and outputting the type of the flow to be identified.

The description of steps 601-606 is relatively simple, and reference may be made to steps 101-105.

To achieve the above object, as shown in fig. 7, an embodiment of the present invention further provides a network traffic identification apparatus, where the apparatus includes:

and the continuous flow chart generating module 701 is configured to map the continuous features into pixel values by using multivariate correlation analysis based on the obtained continuous features of the flow to be identified, so as to obtain a continuous flow chart.

The discrete flow map generating module 702 is configured to map discrete features into pixel values by using unique hot coding and entity embedding based on the obtained discrete features of the flow to be identified, so as to obtain a discrete flow map.

And the continuous flow chart value calculation module 703 is configured to input the continuous flow chart into a convolutional neural network to obtain a continuous flow chart value, where the convolutional neural network is configured to perform high-order combination on the continuous features.

And the discrete flow map value calculation module 704 is configured to input the discrete flow map into a factorizer to obtain a discrete flow map value, and the factorizer is configured to perform low-order combination on the discrete features.

And the output module 705 is configured to input the continuous flow chart value and the discrete flow chart value into the normalized index function to obtain the type of the flow to be identified.

In one embodiment, the continuous flow chart generating module 701 may include:

and the standardization processing submodule is used for inputting each continuous characteristic of the flow to be identified into a standard deviation standardization function to obtain the standardized continuous characteristic.

And the calculation submodule is used for calculating the correlation between every two normalized continuous features.

And the correlation determination submodule is used for determining a correlation matrix corresponding to the continuous features according to the correlation between every two normalized continuous features.

And the first mapping submodule is used for mapping the value of each element in the correlation matrix to the value range of the pixel point in the gray scale image to obtain the pixel value corresponding to each element in the correlation matrix.

And the continuous flow chart generation submodule is used for generating a continuous flow chart by using the pixel value corresponding to each element in the correlation matrix.

In an embodiment, the computation submodule may be specifically configured to:

the correlation between each two normalized consecutive features is calculated using the following formula.

In one embodiment, the discrete flow graph generation module 702 may include:

and the first-class feature processing sub-module is used for carrying out vectorization processing on the first-class features in the discrete features by adopting entity embedding to obtain a first processed feature value, wherein the first-class features are discrete features of which the value number is greater than a preset value number threshold value.

And the second-class feature processing submodule is used for encoding the second-class features in the discrete features by adopting one-hot encoding to obtain a second processed feature value, and the second-class features are the discrete features of which the value number is less than or equal to a preset value number threshold value.

And the second mapping submodule is used for respectively mapping the first processed characteristic value and the second processed characteristic value to a value range of a pixel point in the gray scale image to obtain a pixel value corresponding to each discrete characteristic.

And the discrete flow chart generation submodule is used for generating a discrete flow chart by utilizing the pixel value corresponding to each discrete feature.

In one embodiment, the first-class feature processing sub-module may include:

and the calculating unit is used for calculating the vector dimension of each first-class feature according to the value quantity of each first-class feature.

And the acquisition unit is used for acquiring the vector element value of each first-class characteristic in the vector dimension.

In an embodiment, the computing unit may be specifically configured to:

for each first class feature, the vector dimension of the first class feature is calculated using the following formula.

is rounded up.

The processing unit may be specifically configured to:

and aiming at each first-class feature, carrying out standardization processing on the vector element value corresponding to the first-class feature by using the following formula to obtain a standardized vector element value corresponding to the first-class feature.

In one embodiment, the continuous flow map value calculation module 703 may include:

and the input submodule is used for inputting the continuous flow chart into the convolutional neural network under the condition that the convolutional neural network comprises a convolutional layer, a pooling layer and a full-link layer. And the convolution layer extracts continuous characteristics in the continuous flow chart through a convolution kernel and performs convolution operation on the continuous characteristics in the continuous flow chart to obtain a convolution result. And the pooling layer performs characteristic screening on the convolution result to obtain a pooling result. And the full connection layer rectifies the pooling result and outputs a continuous flow chart value.

In one embodiment, the input submodule may be specifically configured to:

the continuous flow chart value is calculated using the following formula.

y_out＝σ(y_in*ω+b)；

Wherein σ is a ReLU (Rectified Linear Unit) function parameter, y_inFor the processed continuous flow chart, y_outFor the pooled result, ω is the weight of the pooled result and b is the bias term for the full link layer.

In one embodiment, the discrete flow map value calculation 704 module may be specifically configured to:

the discrete flow map value is calculated using the following formula.

Wherein, y_discreteAs discrete flow map valuesω is the parameter of the discrete characteristics, d is the total number of discrete characteristics of the network traffic to be processed, V_iAnd V_jAs a discrete feature x_iAnd discrete feature x_jA parameter of the product of (c).

In order to achieve the above object, an embodiment of the present invention further provides a terminal device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804.

A memory 803 for storing a computer program;

the processor 801 is configured to implement the network traffic identification method according to the embodiment of the present invention when executing the program stored in the memory 803.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements a network traffic identification method provided by an embodiment of the present invention.

In another embodiment, the present invention further provides a computer program product containing instructions, which when executed on a computer, causes the computer to implement a network traffic identification method provided by the embodiment of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the computer-readable storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for identifying network traffic, the method comprising:

2. The method according to claim 1, wherein the step of mapping the continuous features into pixel values by using multivariate correlation analysis based on the obtained continuous features of the flow to be identified to obtain a continuous flow chart comprises:

calculating the correlation between every two normalized continuous features;

3. The method of claim 2, wherein the step of calculating the correlation between each two normalized consecutive features comprises:

wherein r is_i，jAs a continuous feature x_iAfter standardizationValue and continuous characteristic x of_jCorrelation between normalized values, x_{normalization_i}For said continuous feature x_iNormalized value, x_{normalization_j}As a continuous feature x_jAnd taking a normalized value, wherein p is the total number of the continuous features of the flow to be identified.

4. The method according to claim 1, wherein the step of mapping the discrete features into pixel values by using one-hot coding and entity embedding based on the obtained discrete features of the traffic to be identified to obtain a discrete traffic map comprises:

5. The method according to claim 4, wherein the step of vectorizing the first class of features in the discrete features by using entity embedding to obtain a first processed feature value comprises:

6. The method according to claim 5, wherein the step of calculating the vector dimension of each first class feature according to the value number of each first class feature comprises:

wherein, dimensions is the vector dimension of the first type of features, and the passive values are the value number of the first type of features;

7. The method of claim 1, wherein the convolutional neural network comprises a convolutional layer, a pooling layer, and a fully-connected layer;

inputting the continuous flow graph into the convolutional neural network;

8. The method of claim 7, wherein the fully connected layer is configured to rectify the pooled results, and the step of outputting continuous flow map values comprises:

calculating the continuous flow chart value by using the following formula:

y_out＝σ(y_in*ω+b)；

9. The method of claim 1, wherein the step of inputting the discrete flow map into a factorizer to obtain discrete flow map values comprises:

10. A network traffic identification apparatus, the apparatus comprising: