CN109873774B - Network traffic identification method and device - Google Patents

Network traffic identification method and device Download PDF

Info

Publication number
CN109873774B
CN109873774B CN201910036196.2A CN201910036196A CN109873774B CN 109873774 B CN109873774 B CN 109873774B CN 201910036196 A CN201910036196 A CN 201910036196A CN 109873774 B CN109873774 B CN 109873774B
Authority
CN
China
Prior art keywords
sample
model
data stream
cluster
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910036196.2A
Other languages
Chinese (zh)
Other versions
CN109873774A (en
Inventor
廖青
赵晶玲
李天琦
刘月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910036196.2A priority Critical patent/CN109873774B/en
Publication of CN109873774A publication Critical patent/CN109873774A/en
Application granted granted Critical
Publication of CN109873774B publication Critical patent/CN109873774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a network flow identification method and a device, wherein the method comprises the following steps: under the condition that the current data stream is received, extracting data of a data packet header in the current data stream as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is then identified. Compared with the prior art, the method and the device for identifying the data stream change the structure of the machine identification model, and the machine learning model with the changed structure is used for identifying the category of the next data stream after the current data stream, so that the category real-time performance of the identified data stream can be improved.

Description

Network traffic identification method and device
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a network traffic identification method and apparatus.
Background
The flow is an important carrier for transmitting data in the network, the flow identification is a key link of network monitoring, and only if the flow is identified, different monitoring strategies can be adopted according to different flows, for example: rejection, optimization, marking, priority classification, etc., and thus identification of network traffic is critical. Generally, network traffic is transmitted in the form of data streams, each data stream includes a plurality of data packets, each data packet includes header data of a fixed byte, characteristics of the header data can be obtained according to the header data, and the characteristics of the header data include: time interval, stream duration, mean, variance of packet size, etc.
In the prior art, a machine learning-based method is adopted for network traffic identification, and the method mainly comprises the steps of mining the characteristics of network packet header data through a machine learning technology, then training to obtain a machine learning model, inputting a data stream into the machine learning model obtained through training, and outputting the category of online network traffic. The machine learning model is obtained by training through the following steps: firstly, counting the characteristics of packet header data of a data packet in the whole data stream, selecting the characteristics of all or part of the packet header data in the whole data stream as a sample, training the sample to obtain a machine learning model, wherein the machine learning model is an offline model and the internal structure of the machine learning model is fixed.
Because the characteristics of the data stream can change due to the real-time change of the network environment, the real-time performance of identifying the type of the online network traffic is not high by using the machine learning model with a fixed internal structure, and therefore the real-time performance of identifying the type of the online network traffic is not high in the prior art.
Disclosure of Invention
The embodiment of the invention aims to provide a network flow identification method and a network flow identification device, which improve the category real-time property of identification data streams, and the specific technical scheme is as follows:
in a first aspect, a method for identifying network traffic provided in an embodiment of the present invention is applied to a server, and the method includes:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream as a first sample;
inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;
the category of the next data stream after the current data stream is identified using an online identification model.
Optionally, in a case that receiving the current data stream is completed, before the step of extracting the data of the packet header in the current data stream as the first sample, the method further includes:
sequentially receiving data packets of a current data stream and acquiring quintuple information of the data packets;
judging whether the database stores quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;
and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet into the storage area of the path corresponding to the quintuple information.
Optionally, under the condition that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample, including:
judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, receiving the data stream is finished, and extracting packet header data of the data packet in the data stream as a first sample.
Optionally, under the condition that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample, including:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;
encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.
Optionally, inputting the first sample into a semi-supervised model, and outputting a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model, including:
inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;
adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;
if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.
Optionally, when the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, including:
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.
Optionally, when the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, including:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, increasing the parameter dimension in a preset machine identification model by one dimension, and taking the machine identification model with the increased parameter dimension as an online identification model.
Optionally, when the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, including:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as a basic recognition model;
inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;
in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the results of multiplying proficiency increment and the partial derivatives of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model respectively;
and determining the basic recognition model after updating the weight and the bias as an online recognition model.
Optionally, identifying the category of the next data stream after the current data stream by using an online identification model includes:
under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;
and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.
In a second aspect, an embodiment of the present invention provides a network traffic identification apparatus, which is applied to a server, and includes:
the device comprises a sample module, a data processing module and a data processing module, wherein the sample module is used for extracting packet header data of a data packet in a current data stream as a first sample under the condition that the current data stream is received;
the monitoring module is used for inputting the first sample into the semi-monitoring model and outputting the category of the first sample and the result of whether the first sample is positioned in the boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;
the changing module is used for adding an output node in the output nodes of the preset machine recognition model if the first sample is a new-class sample under the condition that the first sample is located within the boundary distance of the cluster, and taking the machine recognition model after the output node is added as an online recognition model;
and the identification module is used for identifying the category of the next data stream after the current data stream by using the online identification model.
Optionally, the network traffic identification apparatus provided in the embodiment of the present invention further includes:
the storage unit is used for sequentially receiving the data packets of the current data stream and acquiring quintuple information of the data packets;
judging whether the database stores quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;
and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet into the storage area of the path corresponding to the quintuple information.
Optionally, the sample module is specifically configured to:
judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, receiving the data stream is finished, and extracting packet header data of the data packet in the data stream as a first sample.
Optionally, the sample module is specifically configured to:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;
encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.
Optionally, the monitoring module is specifically configured to:
inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;
adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;
if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.
Optionally, the modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.
Optionally, the modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, increasing the parameter dimension in a preset machine identification model by one dimension, and taking the machine identification model with the increased parameter dimension as an online identification model.
Optionally, the modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as a basic recognition model;
inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;
in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the results of multiplying proficiency increment and the partial derivatives of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model respectively;
and determining the basic recognition model after updating the weight and the bias as an online recognition model.
Optionally, the identification module is specifically configured to:
under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;
and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute a network traffic identification method as described in any one of the above.
In yet another aspect of the present invention, the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any one of the above described network traffic identification methods.
In the method and the device for identifying network traffic provided by the embodiment of the invention, under the condition that the current data stream is received, the data of the packet header of a data packet in the current data stream is extracted as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is identified using an online identification model. Compared with the prior art, the method and the device have the advantages that the type of the current data stream is identified through the semi-supervised model, whether the current data stream is a new type sample or not is judged based on the identification type, the structure of the machine identification model is changed, the machine learning model with the changed structure is used as the online identification model, the type of the next data stream after the current data stream is identified, and the real-time property of the type of the identification data stream can be improved.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a flowchart of a network traffic identification method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating storing a current data stream according to an embodiment of the present invention;
fig. 3 is a diagram of a preset semi-supervised model architecture according to an embodiment of the present invention;
FIG. 4 is a network architecture diagram of an LSTM encoder loop provided by an embodiment of the present invention;
FIG. 5 is a diagram of the internal structure of an LSTM encoder according to an embodiment of the present invention;
FIG. 6 is a block diagram of a default machine identification model according to an embodiment of the present invention;
FIG. 7 is a block diagram of an online identification model provided by an embodiment of the present invention;
FIG. 8 is a graph of the effect of proficiency as a function of different parameters provided by embodiments of the present invention;
FIG. 9 is a diagram illustrating the effect of the proficiency function on different parameters on the horizontal axis and the vertical axis, respectively, according to an embodiment of the present invention;
FIG. 10 is a graph illustrating the effect of proficiency function at different β values provided by embodiments of the present invention;
fig. 11 is a structural diagram of a network traffic recognition apparatus according to an embodiment of the present invention;
fig. 12 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
In the method and the device for identifying network traffic provided by the embodiment of the invention, under the condition that the current data stream is received, the data of the packet header of a data packet in the current data stream is extracted as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is identified using an online identification model.
First, a method for identifying network traffic according to an embodiment of the present invention is described below.
As shown in fig. 1, a network traffic identification method provided in an embodiment of the present invention is applied to a server, and the method includes:
s101, extracting header data of a data packet in a current data stream as a first sample under the condition that the current data stream is received;
before the step of S101, the method for identifying network traffic provided in the embodiment of the present invention further includes storing the current data stream:
as shown in fig. 2, the storing the current data stream includes:
s201, receiving data packets of the current data stream in sequence, and acquiring quintuple information of the data packets;
wherein, the quintuple information is: source IP address, destination IP address, source port number, destination port number, transport layer protocol.
S202, judging whether the database stores quintuple information or not, and if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;
s203, if the database does not store the quintuple information, a storage area of a path corresponding to the quintuple information is created, and the header data of the data packet is stored in the storage area of the path corresponding to the quintuple information.
The embodiment of the invention can improve the efficiency of searching the packet header data of the data packet in the same data stream by storing the packet header data of the data packet into the storage area corresponding to the quintuple information.
In order to improve the real-time performance of identifying the category of the data stream, at least one embodiment may be adopted in the above S101 to obtain a first sample:
in a possible implementation manner, by judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, the receiving of the data stream is completed; the header data of the data packets in the data stream is extracted as a first sample.
It can be understood that each data packet contains an end identifier, and if transmission of a data stream is ended, the end identifier of the last data packet in the data stream is changed.
In one possible embodiment, the first sample is obtained by:
the method comprises the following steps: under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;
step two: encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.
It is understood that a network flow is composed of a series of packets, each packet has a very regular header format, contains a fixed number of bytes, and has different field values, for example, for a TCP protocol packet, the header data thereof has 54 bytes except for optional fields, and includes a header of 14 bytes, an IP header of 20 bytes, and a TCP header of 20 bytes. If p data packets of a stream, each packet header data containing q bytes, each byte is converted into an unsigned integer, and one packet header data is taken as a line, a vector X ∈ R of a fixed dimension is obtainedp×qThe element is also [0, 255]Is an integer of (1). Therefore, in the present embodiment, the header data is encoded to obtain a fixed-dimension vector, where the fixed dimension is p × q, and the fixed-dimension vector is used as the first sample, thereby improving the efficiency of identifying the first sample type.
S102, inputting the first sample into a semi-supervised model, and outputting the type of the first sample and a result of whether the first sample is positioned in the boundary distance of the cluster by using the semi-supervised model;
the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relationship determines the result of whether the class-labeled sample lies within the cluster boundary distance.
It can be understood that the first training sample set required for training the semi-supervised model includes a part of samples with class labels, and the samples with class labels are packet header data of the obtained data stream after a class is determined. The remainder are samples without class labels. And dividing the samples of the first training sample set into a plurality of clusters, wherein the distribution of the samples of the first training sample set in each cluster is determined, and then the distribution of samples with class labels and samples without class labels in each cluster is determined. Therefore, the semi-supervised model is obtained by training the sample with the class label, and the semi-supervised model comprises the result of whether the sample with the class label is positioned within the boundary distance of the cluster.
In one possible embodiment, the semi-supervised model may be obtained by:
firstly, a first training sample set sample trains a preset semi-supervised model to obtain a trained semi-supervised model;
as shown in fig. 3, the preset semi-supervised model is composed of an LSTM (Long Short-Term Memory) encoder, a softmax layer and a CFSFDP (Clustering by Fast Search and Find of Density Peaks, Clustering based on Density Peaks) layer, wherein a sample carrying a class label is input into the LSTM encoder, a data stream with an unfixed length is encoded into a fixed-dimension vector by the LSTM encoder, the fixed-dimension vector includes a whole data stream sequence output, so that the fixed-dimension vector can represent characteristics of all data packets of the whole data stream, the softmax layer is used for mapping the fixed-dimension vector to a fixed class, and the softmax layer can output a class of packet header data; and then removing the softmax layer from the preset semi-supervised model, inputting the samples carrying the category labels and the samples not carrying the category labels into an LSTM encoder, and inputting the output of the LSTM encoder into a CFSFDP Clustering layer, wherein whether the vector of the fixed dimension is the cluster center point of the cluster and the category of the samples is determined mainly by using a CFSFDP (Clustering by Fast Search and Find of Density Peaks) algorithm.
As shown in fig. 4 and 5, the process of encoding the data stream into a fixed-dimension vector by the LSTM encoder is as follows:
such as for a data stream x0,x1,…,xt-1,xtThe data streams are sequentially input into a circular network structure as in fig. 4, each input xtAll can have an output htWhile passing the current state to the next input, thus outputting htIn not only contains xtAlso contains x0~xt-1The information of (1).
Internal structure of LSTM encoder As shown in FIG. 5, input x is received at encoder inputtLast output ht-1And last input xt-1State C of the postcodert-1Let us assume the output h of each steptThe dimension of (2) is 128 dimensions, a certain data stream contains n data packets in total, all the data packets of the whole stream are taken as a sequence, and the header data of each data packet is xtThe last x of the data streamnOutput h ofn
Wherein x istRepresents a header data; t represents the serial number of the packet header data; n represents the total number of data packets in a data stream; h istRepresenting the LSTM encoder input as xtThe output of the LSTM encoder; h isnRepresenting the last output of the LSTM encoder after a data stream is input to the LSTM encoder, i.e., the encoded fixed-dimension vector.
Inputting the test samples in the test sample set into the trained semi-supervised model, and outputting the types of the test samples in the test sample set by using the trained semi-supervised model;
it can be understood that the class labels of the header data in one data stream are the same, the test sample is the header data of the whole data stream, and the class label identifies the class of the test sample.
Determining whether the trained semi-supervised model meets the test index or not based on the category of the concentrated test sample of the trained semi-supervised model test sample and the label category of the concentrated test sample of the test sample;
wherein the test indexes are as follows: the accuracy reaches an accuracy threshold, the recall rate reaches a recall rate threshold, the F1 score reaches an F1 score threshold and/or FβThe fraction reaches FβA score threshold.
Wherein, the accuracy threshold, the precision threshold, the recall threshold, the F1 score threshold, and the FβThe score threshold is a preset numerical value, such as accuracy, precision, recall, F1 score, FβThe scores are respectively the accuracy, precision, recall, F1 score and FβAnd calculating a fraction formula.
If the trained semi-supervised model does not meet the test index, updating the parameters of the LSTM encoder in the trained semi-supervised model until the trained semi-supervised model meets the test index;
wherein, the output formula of the encoder is as follows:
ft=σ(Wf·[ht-1,xt]+bf)
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0001945995470000121
Figure BDA0001945995470000122
ot=σ(Wo·[ht-1,xt]+bo)
ht=ot·tanh(Ct)
therein, the LSTM encoderThe parameters are as follows: w*And b*,W*Representing a weight parameter, b*Representing a bias parameter, ftAn activation value representing the current step forgetting gate,
Figure BDA0001945995470000123
as a Sigmoid function, WfWeight representing forgetting gate, ht-1Representing the output of the previous step, xtInput representing current step, bfRepresenting the offset of a forgetting gate itRepresenting the activation value of the input gate at the current step, WiRepresenting the weight of the input gate, biWhich represents the offset of the input gate,
Figure BDA0001945995470000124
representing the current step intermediate state, WCRepresenting the state weight, bCRepresentative state bias, Ct-1Represents the state of the previous step, CtRepresenting the state of the current step, otRepresenting the current step output gate activation value, WoRepresenting the output gate weight, boRepresenting the output gate bias, i representing the activation value of the input gate, f representing the activation value of the forgetting gate, t representing the packet number, o representing the activation value of the output gate, and C representing the state.
If the trained semi-supervised model does not meet the test index, updating the parameter W of the LSTM encoder in the trained semi-supervised modelf、bf、Wi、bi、WC、bC、Wo、bo
And step five, determining the trained semi-supervised model meeting the test index as the semi-supervised model.
According to the embodiment, the accuracy of determining the first sample category can be improved by determining the trained preset semi-supervised model meeting the test index as the semi-supervised model.
S103, under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;
and S104, identifying the category of the next data stream after the current data stream by using the online identification model.
Compared with the prior art, the method and the device have the advantages that the type of the current data stream is identified through the semi-supervised model, whether the current data stream is a new type sample or not is judged based on the identification type, the structure of the machine identification model is changed, the machine learning model with the changed structure is used as the online identification model, the type of the next data stream after the current data stream is identified, the capability of the online identification model for adapting to the network environment is improved, and the type real-time performance of the identification data stream can be improved.
In order to improve the real-time property of identifying the category of the data stream, at least one embodiment may be adopted in the above S102 to obtain the result of the category of the first sample and whether the first sample is located within the boundary distance of the cluster:
in one possible embodiment, the result of the class of the first sample and whether the first sample is located within the boundary distance of the cluster is obtained by:
the method comprises the following steps: inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
step two: calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;
step three: adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
step four: if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;
the distance of the cluster boundary is a range surrounded by a spherical region which is formed by taking the cluster center point of the cluster as the spherical center and taking the radius as the boundary distance.
Step five: if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.
First, assume a first training sample set of data as S ═ xj|j∈ISIn which IS={1,2,…,n},dijDenotes xiSample sum xjDistance between samples, calculating sample xiLocal density of phiiAnd a minimum distance thetai,Φ={φi|i∈ISAnd theta ═ thetai|i∈IS},ISRepresenting an integer set, wherein i and j are positive integers; Φ represents the local density set and Θ represents the minimum distance set.
Local density phiiThe calculation formula of (2) is as follows:
Figure BDA0001945995470000141
or
Figure BDA0001945995470000142
Wherein d iscThe representative truncation distance is a preset numerical value, the boundary distance is m times of the truncation distance, and m is a preset numerical value; chi (·) is a step function,
Figure BDA0001945995470000143
x is the input to the step function.
From local density phiiThe calculation formula of (2):
Figure BDA0001945995470000144
it can be seen that xiPhi ofiRelative size and to xiIs less than dcIs related to the number of samples, i.e. less than dcThe more samples, phiiThe larger the value of (c). The calculation formula can be expressed as phiiThe discrete value of the value is changed into a continuous value, and the accuracy rate of calculating the local density is improved.
Minimum distance thetaiThe calculation formula of (2) is as follows:
Figure BDA0001945995470000145
the local density exceeds the density threshold and the minimum distance exceeds the distance threshold as a third sample.
Wherein the distance between the third sample and the first sample, dijMay be calculated using equations such as the euclidean distance, the manhattan distance, the chebyshev distance, the minkowski distance, the normalized euclidean distance, or the cosine similarity distance.
In the embodiment, by calculating the local density and the minimum distance of each sample in the first training sample set, if the distance between the first sample and the third sample does not exceed the boundary distance of the cluster, it is determined that the first sample is located within the boundary distance of the cluster, and the efficiency of determining that the first sample is located within the boundary distance of the cluster can be improved.
After the step of obtaining the result of whether the category of the first sample and the first sample are located within the boundary distance of the cluster by the foregoing embodiment, the method for identifying network traffic provided by the embodiment of the present invention further includes: the cluster center point of the cluster is updated.
In one possible embodiment, the cluster center point of a cluster is updated by:
the method comprises the following steps: based on the local density and the minimum distance of each sample in the first training sample set, taking the sample of which the local density exceeds a density threshold value and the minimum distance does not exceed a distance threshold value as a fourth sample;
step two: adding the fourth sample to the cluster where the cluster center point closest to the fourth sample is located;
step three: if the first sample is located within the boundary distance of the cluster, adding the first sample to the cluster where the cluster center point closest to the first sample is located;
step four: calculating the local density and the minimum distance of the samples in each cluster, determining the cluster center point of the cluster according to the samples of which the local density exceeds the density threshold and the minimum distance exceeds the distance threshold aiming at one cluster, and taking the cluster after the center point is updated as the updated cluster.
The cluster center point of the cluster is updated, so that the accuracy of determining whether the first sample is located within the boundary distance of the cluster can be improved.
In another possible embodiment, the result of the class of the first sample and whether the first sample is located within the boundary distance of the cluster is obtained by:
the method comprises the following steps: inputting the first sample into a semi-supervised model, and outputting the probability of different classes of the first sample and the result of whether the first sample is positioned in the boundary distance of the cluster by using an output node of the semi-supervised model; the output node corresponds to the category to which the first sample belongs.
For example: the g output node outputs the probability that the first sample belongs to the g category.
Step two: and selecting the class to which the first sample with the highest probability belongs as the class of the first sample from the probabilities of outputting different classes to which the first sample belongs at the output node.
In the present embodiment, the category to which the first sample with the highest probability belongs is selected as the category of the first sample, so that the accuracy of determining the category of the first sample can be improved.
In order to improve the real-time property of identifying the category of the data stream, the online identification model may be obtained in S103 by using at least one embodiment:
in one possible embodiment, the online identification model is obtained by:
the method comprises the following steps: under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;
step two: under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.
Referring to fig. 6 and 7, a preset machine recognition modelMoUsing CNN (Convolutional Neural Networks), MoThe output layer comprises K nodes, the upper layer of the output layer comprises J nodes, a connecting line between the output layer and the upper layer represents parameters of the output layer, under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same type as the first sample exists in the second training sample set, the first sample is judged not to be a new type sample, and the parameters between the output layer and the upper layer of the preset machine recognition model are updated. Under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.
In another possible implementation, in a case that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, the parameter dimension in the preset machine recognition model is increased by one dimension, and the machine recognition model with the increased parameter dimension is used as the online recognition model.
When the first sample belongs to a new class of samples, i.e. X ∈ CK+1For parameterized preset machine identification models, adding an output node means increasing the dimension of the output layer parameter of the preset machine identification model by one dimension, and setting W ∈ RJ×K→W∈RJ×(K+1),b∈RK→b∈RK+1,ρ∈RK→ρ∈RK+1Where W represents a set of weights, R represents a set of real numbers, b represents a set of biases, ρ represents a set of proficiency, → represents assignments, and K represents the total number of output nodes.
In yet another possible embodiment, the online identification model is obtained by:
the method comprises the following steps: under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine identification model, and taking the machine identification model with the added output node as a basic identification model;
step two: inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;
step three: in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the result of multiplying the increment of proficiency and the partial derivative of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model;
step four: and determining the basic recognition model after updating the weight and the bias as an online recognition model.
The output of the upper layer of the output layer of the basic recognition model is assumed as follows: f ═ Fj}∈RJOutput layer output Y ═ Yk}∈RKWeight W ═ Wjk}∈RJ×KOffset b ═ bk}∈RK. If the output layer of the basic identification model is a softmax layer, then the cross entropy loss function of the basic identification model can be derived as follows:
Figure BDA0001945995470000171
wherein T ═ { T ═ Tk}∈RKOne-hot encoding, y, being a data stream classk=g(zk) Is the activation value of the kth output node, g (-) is the softmax function, softmax is:
Figure BDA0001945995470000172
Figure BDA0001945995470000173
fja characteristic activation value representing the jth node, bkRepresenting the offset of the kth node, RKThe representative dimension is K dimension, tkRepresents the kth bit, z, of one-hot codekRepresents the activation value of the kth node, wjkRepresenting the weight between the node j of the upper layer of the output layer and the output node k, i and k representing the serial number of the output node, taking positive integer, k alsoThe sequence number representing the median of one-hot coding, J represents the sequence number of the node on the upper layer of the output layer, K is the total number of output nodes, and J is the number of the nodes on the upper layer of the output layer, and the increment of the weight and the offset can be obtained by solving the partial derivative of the loss function:
Figure BDA0001945995470000174
wherein I {. is an indicator function:
Figure BDA0001945995470000175
it can be understood that, each time the basic recognition model is trained by using the samples of the new category to obtain the online recognition model, the parameters of the basic recognition model can continuously adapt to the samples of the new category, the characteristics of the samples of the new category are learned, and the samples of the old category do not participate in the training. The training mode which enables the online learning model to adapt to the new environment without limitation has a serious problem, namely when the internal parameters of the basic recognition model change, the learned characteristics are affected, and even the capability of the previous basic recognition model is possibly completely damaged, so that the recognition of samples which are not of a new category is seriously wrong, and the problem of 'catastrophic forgetting' is caused.
To solve the catastrophic forgetting problem, the underlying recognition model needs to make a trade-off between learning samples of the new class and retaining samples of the old class. When the stability of the basic recognition model is higher, the basic recognition model is more prone to retain the characteristics of the samples of the old category, and the characteristic ability for learning the samples of the new category is weakened; on the contrary, when the plasticity of the basic recognition model is higher, the basic recognition model has stronger ability of learning samples of new classes, and is easier to forget the characteristics of samples of old classes, and the key point of the trained online recognition model for adapting to the network environment lies in obtaining different tradeoffs between stability and plasticity.
For online identification of model stability-plasticity controllability, the embodiment of the present invention proposes a proficiency mechanism that introduces an additional set of parameters ρ ═ ρ { (ρ) }k}∈RKWhere ρ represents a proficiency set, ρkRepresenting a basis recognition modelProficiency of the type on the class output by the kth node is used for measuring the recognition capability of the online recognition model on each class sample.
Where ρ iskE [0, 1)), the initial value of which is 0, indicates that the classification proficiency of the basic recognition model for all classes in the initial case is 0. In order to utilize proficiency to influence the stability-controllability of the model, the proficiency ρ should have the following properties:
1) proficiency is influenced by the result of identifying the sample class. The more times of correctly identifying the type of the sample, the higher the corresponding proficiency; the more times the wrong sample identifies a category, the lower the corresponding proficiency.
2) Proficiency affects variations in itself. When the proficiency is low, the difficulty of further improving or reducing the proficiency is small, and the proficiency is increased or reduced quickly; the higher the proficiency, the more difficult it is to further increase or decrease the proficiency itself, i.e., the slower the proficiency is increased or decreased.
3) Proficiency affects the difficulty of learning or forgetting knowledge. When the proficiency is low, it is relatively easy to learn more features of samples in a new class or forget the features of samples in an old class, namely the model parameters are updated more quickly; conversely, when proficiency is high, the difficulty of learning or forgetting is also greater, i.e., the model parameters are updated more slowly.
For example, if X ∈ CkAnd Y ∈ CkIf the classification of the kth class corresponding to the sample X is correct, the corresponding proficiency ρ is obtainedkIncreasing; if X ∈ CiBut Y ∈ CjIf the type i error is identified as the type j, the corresponding rhoiAnd ρjAnd decreases.
Referring to FIG. 8, to implement Property 2 and Property 3, embodiments of the present invention propose a function of proficiency for calculating the increment of proficiency:
the function of proficiency is:
Figure BDA0001945995470000191
where α and β are two parameters, the overall trend of the function used to control proficiencyFIG. 8 shows the course of the function of proficiency at different α and β, when ρkSmaller, incremental proficiency prof (p)k) Greater, with ρkIncrease, prof (p)k) And its derivative are all gradually reduced when pkIncreasing the proficiency increase prof (p) to the limit value of 1k) Value 0, proficiency ρkNo further updates are made.
Proficiency [ rho ]kThe updated formula of (2): rhok←ρk±prof(ρk),
Referring to FIG. 9, as proficiency increases, the proficiency increment prof (ρ)k) Gradually reducing, namely gradually reducing the updating amplitude of the parameters of the basic recognition model; FIG. 9 shows the incremental prof (ρ) of proficiency at different parametersk) It can be seen that the update rate of the underlying recognition model can be controlled by adjusting the parameters α and β for the increments of proficiency.
Therefore, when the weight and the offset of the output layer of the basic recognition model are updated by using the parameter updating formula, the increment of the proficiency is increased, and then the parameter updating formula is as follows:
Figure BDA0001945995470000192
Figure BDA0001945995470000193
represents a weight wjkThe increment of (a) is increased by (b),
Figure BDA0001945995470000194
represents the offset bkThe increment of (c). Function of proficiency profk) For updating the weight W and the offset b, the order is required
Figure BDA0001945995470000195
To ensure that prof (0) is 1, i.e. when proficiency ρkWhen the proficiency function is 0, the proficiency function does not influence the updating of the model; the incremental coefficient prof (ρ) with increasing proficiencyk) Gradually reducing, namely gradually reducing the updating amplitude of the parameters of the basic recognition model; when rhok→ 1, prof (ρ)k) → 0, i.e. the update amplitude of the underlying recognition modelTending to 0.
Referring to fig. 10, prof (ρ) in the case where the parameter β is differentk) The larger beta, the higher prof (p)k) The faster the rate of descent.
From the above analysis, it can be seen that the proficiency set ρ and the function prof (ρ) of proficiencyk) The method can control the capability and speed of updating parameters of the basic recognition model by utilizing two parameters alpha and beta of a function of proficiency, further realize the balance of the stability and plasticity of the online recognition model, and solve the problem of 'catastrophic forgetting'.
For example, one-hot encoding of data stream classes is illustrated, and it is assumed that the first training set samples are divided into 6 classes of samples, and an output layer of the basic recognition model has 6 nodes. A number is assigned to each type of sample, and the sample categories of the first training set include "RDP (Remote Desktop Protocol)", "bit-flood BitTorrent", "Web (World Wide Web )", "SSH (Secure Shell, Secure Shell)", "eDonkey (eDonkey Network, donkey)", and "NTP (Network Time Protocol)", corresponding numbers are: 0.1, 2, 3, 4, 5, the corresponding one-hot codes are: 100000, 010000, 001000, 000100, 000010, 000001. Assume that the label of a sample in the first training set is 0, the sample label is numbered 0, and the class of the sample is "RDP". The output of the 1 st to 6 th nodes of the category of the sample identified by the basic identification model is 0.5, 0.1 and 0.1. Wherein, the probability that the basic recognition model recognizes the sample as the code 0 is the highest, and the loss function of the basic recognition model is:
L=-1·log0.5+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)。
in order to improve the real-time property of identifying the class of the data stream, at least one implementation manner may be adopted in S104 to identify the class of the data in the packet header of the next data stream after the current data stream:
in one possible embodiment, the class of data of the packet header in the next data stream after the current data stream is identified by the following steps:
the method comprises the following steps: under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;
step two: and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.
The following provides a description of a network traffic identification apparatus according to an embodiment of the present invention.
As shown in fig. 11, a network traffic identification apparatus provided in an embodiment of the present invention is applied to a server, and the apparatus includes:
a sample module 1101, configured to extract packet header data of a data packet in a current data stream as a first sample when receiving the current data stream is completed;
a monitoring module 1102, configured to input the first sample into a semi-monitoring model, and output a result of the category of the first sample and whether the first sample is located within a boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;
a changing module 1103, configured to, when the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, add an output node to output nodes of a preset machine identification model, and use the machine identification model after the output node is added as an online identification model;
an identifying module 1104 is configured to identify a category of a next data stream after the current data stream using an online identification model.
Optionally, the network traffic identification apparatus provided in the embodiment of the present invention further includes:
the storage unit is used for sequentially receiving the data packets of the current data stream and acquiring quintuple information of the data packets;
judging whether the database stores quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;
and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet into the storage area of the path corresponding to the quintuple information.
The sample module is specifically configured to:
judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, receiving the data stream is finished, and extracting packet header data of the data packet in the data stream as a first sample.
The sample module is specifically configured to:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;
encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.
The supervision module is specifically configured to:
inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;
adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;
if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.
The modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.
The modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, increasing the parameter dimension in a preset machine identification model by one dimension, and taking the machine identification model with the increased parameter dimension as an online identification model.
The modification module is specifically configured to:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as a basic recognition model;
inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;
in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the result of multiplying the increment of proficiency and the partial derivative of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model;
and determining the basic recognition model after updating the weight and the bias as an online recognition model.
The identification module is specifically configured to:
under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;
and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.
An embodiment of the present invention further provides an electronic device, as shown in fig. 12, including a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, where the processor 1201, the communication interface 1202, and the memory 1203 complete mutual communication through the communication bus 1204,
a memory 1203 for storing a computer program;
the processor 1201 is configured to implement the following steps when executing the program stored in the memory 1203:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream as a first sample;
inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model;
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;
the category of the next data stream after the current data stream is identified using an online identification model.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform a network traffic identification method as described in any of the above embodiments.
In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform a method of network traffic identification as described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus/electronic device/computer-readable storage medium/computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for relevant points, reference may be made to some descriptions of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. A network traffic identification method is applied to a server, and the method comprises the following steps:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream to serve as a first sample;
inputting the first sample into a semi-supervised model, and outputting the category of the first sample and the result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relation determines whether the samples with the class labels are positioned in the boundary distance of the clusters;
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;
identifying a category of a next data stream after the current data stream using the online identification model;
wherein the inputting the first sample into a semi-supervised model, and outputting the result of the category of the first sample and whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model, comprises:
inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set consisting of samples used for training the semi-supervised model;
adding the third sample to a cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
determining that the first sample is not within a cluster boundary distance if the first sample is more than the cluster boundary distance from the third sample;
determining that the first sample is within a cluster boundary distance if the first sample is not a distance from the third sample that exceeds the cluster boundary distance.
2. The method of claim 1, wherein before the step of extracting data of a packet header in the current data stream as the first sample in case of completion of receiving the current data stream, the method further comprises:
sequentially receiving data packets of a current data stream and acquiring quintuple information of the data packets;
judging whether a database stores the quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;
and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet to the storage area of the path corresponding to the quintuple information.
3. The method according to claim 1, wherein in a case that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample includes:
judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, receiving the data stream is finished, and extracting packet header data of the data packet in the data stream as a first sample.
4. The method according to claim 1, wherein in a case that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample includes:
under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;
and encoding packet header data of a data packet in the current data stream to obtain a vector with a fixed dimension, and taking the vector with the fixed dimension as a first sample.
5. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as that of the first sample exists in a second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set of data streams used for training the machine recognition model;
under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in a second training sample set, the first sample is judged to be a sample of a new category, an output node is added to an output node of a preset machine recognition model, and the machine recognition model with the added output node is used as an online recognition model.
6. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding one dimension to a parameter dimension in a preset machine identification model, and taking the machine identification model with the parameter dimension added as an online identification model.
7. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:
under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine identification model, and taking the machine identification model with the added output node as a basic identification model;
inputting the first sample into the basic recognition model, and calculating partial derivatives of loss functions of the basic recognition model to output layer weights and biases in the basic recognition model;
in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using the parameter updating formula; the parameter updating formula comprises the result of multiplying the proficiency increment by the bias partial derivative of the loss function of the basic recognition model on the output layer weight and the bias in the basic recognition model respectively;
and determining the basic recognition model after updating the weight and the bias as an online recognition model.
8. The method of claim 1, wherein identifying the category of the next data stream after the current data stream using the online identification model comprises:
under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;
inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.
9. A network traffic identification device, applied to a server, the device comprising:
the device comprises a sample module, a data processing module and a data processing module, wherein the sample module is used for extracting packet header data of a data packet in a current data stream as a first sample under the condition that the current data stream is received;
the monitoring module is used for inputting the first sample into a semi-monitoring model and outputting the result of the category of the first sample and whether the first sample is positioned within the boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relation determines whether the samples with the class labels are positioned in the boundary distance of the clusters;
a changing module, configured to, when the first sample is located within a boundary distance of a cluster, if the first sample is a new-class sample, add an output node to output nodes of a preset machine identification model, and use the machine identification model after the output node is added as an online identification model;
the identification module is used for identifying the category of the next data stream after the current data stream by using the online identification model;
the supervision module is specifically configured to:
inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;
calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set consisting of samples used for training the semi-supervised model;
adding the third sample to a cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;
determining that the first sample is not within a cluster boundary distance if the first sample is more than the cluster boundary distance from the third sample;
determining that the first sample is within a cluster boundary distance if the first sample is not a distance from the third sample that exceeds the cluster boundary distance.
CN201910036196.2A 2019-01-15 2019-01-15 Network traffic identification method and device Active CN109873774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910036196.2A CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910036196.2A CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Publications (2)

Publication Number Publication Date
CN109873774A CN109873774A (en) 2019-06-11
CN109873774B true CN109873774B (en) 2021-01-01

Family

ID=66917604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910036196.2A Active CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Country Status (1)

Country Link
CN (1) CN109873774B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111447151A (en) * 2019-10-30 2020-07-24 长沙理工大学 Attention mechanism-based time-space characteristic flow classification research method
CN113326946A (en) * 2020-02-29 2021-08-31 华为技术有限公司 Method, device and storage medium for updating application recognition model
CN111614514B (en) * 2020-04-30 2021-09-24 北京邮电大学 Network traffic identification method and device
WO2022083509A1 (en) * 2020-10-19 2022-04-28 华为技术有限公司 Data stream identification method and device
CN112367334A (en) * 2020-11-23 2021-02-12 中国科学院信息工程研究所 Network traffic identification method and device, electronic equipment and storage medium
CN113472654B (en) * 2021-05-31 2022-11-15 济南浪潮数据技术有限公司 Network traffic data forwarding method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN107729952A (en) * 2017-11-29 2018-02-23 新华三信息安全技术有限公司 A kind of traffic flow classification method and device
CN107846326A (en) * 2017-11-10 2018-03-27 北京邮电大学 A kind of adaptive semi-supervised net flow assorted method, system and equipment
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11122058B2 (en) * 2014-07-23 2021-09-14 Seclytics, Inc. System and method for the automated detection and prediction of online threats

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN107846326A (en) * 2017-11-10 2018-03-27 北京邮电大学 A kind of adaptive semi-supervised net flow assorted method, system and equipment
CN107729952A (en) * 2017-11-29 2018-02-23 新华三信息安全技术有限公司 A kind of traffic flow classification method and device
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习的网络流量分类系统设计与实现;梅国薇;《中国优秀硕士学位论文全文数据库》;20180630;39-59 *

Also Published As

Publication number Publication date
CN109873774A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN109873774B (en) Network traffic identification method and device
CN108023876B (en) Intrusion detection method and intrusion detection system based on sustainability ensemble learning
CN108062561B (en) Short-time data flow prediction method based on long-time and short-time memory network model
CN110881037A (en) Network intrusion detection method and training method and device of model thereof, and server
WO2022227388A1 (en) Log anomaly detection model training method, apparatus and device
CN113378545B (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN111368920A (en) Quantum twin neural network-based binary classification method and face recognition method thereof
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
CN111431819A (en) Network traffic classification method and device based on serialized protocol flow characteristics
KR20230094956A (en) Techniques for performing subject word classification of document data
CN105390132B (en) A kind of application protocol recognition methods and system based on language model
CN110830291B (en) Node classification method of heterogeneous information network based on meta-path
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN116776270A (en) Method and system for detecting micro-service performance abnormality based on transducer
CN116127376A (en) Model training method, data classification and classification method, device, equipment and medium
CN110175655B (en) Data identification method and device, storage medium and electronic equipment
CN113342909B (en) Data processing system for identifying identical solid models
CN110866169A (en) Learning-based Internet of things entity message analysis method
CN117575745A (en) Course teaching resource individual recommendation method based on AI big data
CN106533955B (en) A kind of sequence number recognition methods based on network message
CN109543712B (en) Method for identifying entities on temporal data set
CN117194742A (en) Industrial software component recommendation method and system
CN116049396A (en) False news detection method based on pre-training model fusion
CN114861004A (en) Social event detection method, device and system
CN112463964B (en) Text classification and model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant