CN109873774A

CN109873774A - A kind of network flow identification method and device

Info

Publication number: CN109873774A
Application number: CN201910036196.2A
Authority: CN
Inventors: 廖青; 赵晶玲; 李天琦; 刘月
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2019-06-11
Anticipated expiration: 2039-01-15
Also published as: CN109873774B

Abstract

A kind of network flow identification method and device provided in an embodiment of the present invention, method include: the data of packet header in current data stream to be extracted, as first sample in the case where receiving the completion of current data stream；First sample is inputted into semi-supervised model, whether is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample；When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category, then increase an output node in the output node of preset machine recognition model, using the machine recognition model after increase output node as online recognition model；Then the classification of next data flow after current data stream is identified.Compared with the prior art, the structure of change of embodiment of the present invention machine recognition model, the classification of next data flow after current data stream is identified using the machine learning model after restructuring, can be improved the classification real-time of identification data flow.

Description

A kind of network flow identification method and device

Technical field

The present invention relates to fields of communication technology, more particularly to a kind of network flow identification method and device.

Background technique

Flow is the important carrier that data are transmitted in network, and flow identification is the key link of network monitoring, only convection current Amount is identified, different monitoring strategies could be taken according to different flows, such as: refusal, optimization, mark, preferential fraction Class etc., therefore network flow identify most important.General networking flow is transmitted in the form of data flow, every data Stream includes multiple data packets, and each data packet includes the header data of fixed byte, can obtain packet header number according to header data According to feature, the feature of header data includes: time interval, flow the duration, mean value, variance of data package size etc..

Using the method based on machine learning, this method, which mainly passes through machine, is identified to network flow in the prior art Device learning art excavates the feature of network header data, and then training obtains machine learning model, then inputs data flow and instructs The machine learning model got exports the classification of online network flow.Wherein, machine learning is obtained using following steps training Model: first by counting the feature of packet header data in whole data flow, whole or portion in whole data flow are selected Divide the feature of header data as sample, training sample obtains machine learning model, this machine learning model is offline mould Type, internal structure are fixed.

Due to the real-time change of network environment, the feature of data flow can also change, and use the machine that internal structure is fixed Device learning model identifies that the real-time of the classification of online network flow is not high, therefore the prior art identifies online network flow class Other real-time is not high.

Summary of the invention

The embodiment of the present invention is designed to provide a kind of network flow identification method and device, improves identification data flow Classification real-time, specific technical solution are as follows:

In a first aspect, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, method packet It includes:

In the case where receiving current data stream and completing, the header data of data packet in current data stream is extracted, as the One sample；

First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample The no result in the frontier distance of cluster；Semi-supervised model is to be obtained using the training of the first training sample set and include to have obtained The classification of header data and the first training sample concentrate the distribution relation of remaining sample；First training sample concentrate comprising at least One has the sample of class label；Distribution relation decision has whether the sample of class label is located at the knot in the frontier distance of cluster Fruit；

When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model；

Use online recognition model, the classification of next data flow after identifying current data stream.

Optionally, in the case where receiving the completion of current data stream, the number of packet header in current data stream is extracted According to, the step of as first sample before, method further include:

The data packet of current data stream is successively received, and obtains the five-tuple information of data packet；

Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet Head data are saved to the storage region with five-tuple information respective path；

If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted It saves according to the header data of packet to the storage region of five-tuple information respective path.

Optionally, in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted, As first sample, comprising:

Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.

In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted；

It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed The vector of dimension is as first sample.

Optionally, first sample is inputted into semi-supervised model, utilizes the classification and the of semi-supervised model output first sample Whether one sample is located at the result in the frontier distance of cluster, comprising:

First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample；

Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample；First training sample set is used in the semi-supervised model of training The set of sample composition；

Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster；The number and third of cluster Number of samples is identical, only one third sample in each cluster；

If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster In frontier distance；

If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster In frontier distance.

Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model, comprising:

In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould The parameter of type；Second training sample set is the set of the composition of data flow used in training machine identification model；

In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model Output node in increase an output node, using increase output node after machine recognition model as online recognition model.

It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online Other model.

In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category, Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node As basic identification model；

First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model Middle output layer weight and the local derviation of biasing；

In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula； Parameter more new formula includes proficiency increment respectively with the loss function of basic identification model for exporting in basic identification model The result that layer weight is multiplied with the local derviation of biasing；

The basic identification model after weight being updated and biased is determined as online recognition model.

Optionally, using online recognition model, the classification of next data flow after identifying current data stream, comprising:

In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next The data of packet header are as the second sample in data flow；

By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.

Second aspect, a kind of network flow identification device provided in an embodiment of the present invention are applied to server, device packet It includes:

Sample module, for extracting the packet of data packet in current data stream in the case where receiving the completion of current data stream Head data, as first sample；

Supervision module utilizes the class of semi-supervised model output first sample for first sample to be inputted semi-supervised model Not and whether first sample is located at the result in the frontier distance of cluster；Semi-supervised model is trained using first training sample set To and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample；First training sample Concentrate the sample comprising there is class label at least one；Distribution relation determines the side for having the sample of class label whether to be located at cluster Result in boundary's distance；

Module is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is new class Other sample then increases an output node in the output node of preset machine recognition model, after increasing output node Machine recognition model as online recognition model；

Identification module, for using online recognition model, the classification of next data flow after identifying current data stream.

Optionally, a kind of network flow identification device provided in an embodiment of the present invention further include:

Storage unit for successively receiving the data packet of current data stream, and obtains the five-tuple information of data packet；

Optionally, sample module is specifically used for:

Optionally, supervision module is specifically used for:

Optionally, change module is specifically used for:

Optionally, identification module is specifically used for:

At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable Instruction is stored in storage medium, when run on a computer, so that computer executes a kind of any of the above-described net Network method for recognizing flux.

At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced Product, when run on a computer, so that computer executes a kind of any of the above-described network flow identification method.

A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream Under condition, the data of packet header in current data stream are extracted, as first sample；First sample is inputted into semi-supervised model, Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample；First In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould Type；Use online recognition model, the classification of next data flow after identifying current data stream.Compared with the prior art, this hair Bright embodiment identifies the classification of current data stream by semi-supervised model, judges whether current data stream is new based on identification classification Classification sample changes the structure of machine recognition model, using the machine learning model after restructuring as online recognition model, knows The classification real-time of identification data flow can be improved in the classification of next data flow after other current data stream.

Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.

Fig. 1 is a kind of flow chart of network flow identification method provided in an embodiment of the present invention；

Fig. 2 is the flow chart provided in an embodiment of the present invention stored to current data stream；

Fig. 3 presets semi-supervised model structure to be provided in an embodiment of the present invention；

Fig. 4 is the network structure of LSTM encoder cycles provided in an embodiment of the present invention；

Fig. 5 is the internal structure chart of LSTM encoder provided in an embodiment of the present invention；

Fig. 6 is the structure chart of preset machine recognition model provided in an embodiment of the present invention；

Fig. 7 is the structure chart of online recognition model provided in an embodiment of the present invention；

Fig. 8 is the effect picture of the function of proficiency under different parameters provided in an embodiment of the present invention；

Fig. 9 is that horizontal axis provided in an embodiment of the present invention is to identify that correct number, the longitudinal axis are proficiency function in different parameters Under effect picture；

Figure 10 is the effect picture of proficiency function under different beta provided in an embodiment of the present invention；

Figure 11 is a kind of structure chart of network flow identification device provided in an embodiment of the present invention；

Figure 12 is the structure chart of a kind of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.

A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream Under condition, the data of packet header in current data stream are extracted, as first sample；First sample is inputted into semi-supervised model, Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample；First In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould Type；Use online recognition model, the classification of next data flow after identifying current data stream.

A kind of network flow identification method provided in an embodiment of the present invention is described first below.

As shown in Figure 1, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, the side Method includes:

S101 extracts the header data of data packet in current data stream in the case where receiving the completion of current data stream, makees For first sample；

Before the above-mentioned S101 the step of, a kind of network flow identification method provided in an embodiment of the present invention further includes to working as Preceding data flow is stored:

As shown in Fig. 2, including: to current data stream progress storing step

S201, successively receives the data packet of current data stream, and obtains the five-tuple information of data packet；

Wherein, five-tuple information are as follows: source IP address, purpose IP address, source port number, destination slogan, transport layer protocol.

S202, judges whether database stores five-tuple information, if database purchase five-tuple information, by data packet Header data save to the storage region with five-tuple information respective path；

S203, if the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path, The header data of data packet is saved to the storage region of five-tuple information respective path.

The embodiment of the present invention, can be with by the way that the header data of data packet is stored storage region corresponding to five-tuple information Improve the efficiency for searching the header data of data packet in same data flow.

In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S101 First sample:

In a kind of possible embodiment, by judging whether each data packet of current data stream includes to terminate mark Know, includes end of identification if there is a data packet, then receive data flow and be completed；Extract the packet header of data packet in data flow Data are as first sample.

It should be understood that each data packet includes end of identification, if a data stream end of transmission, the data stream In the end of identification of the last one data packet will change, present embodiment is by judge whether data packet includes to tie in data flow Beam identification can quickly determine whether a data stream finishes receiving.

In a kind of possible embodiment, first sample is obtained by following steps:

Step 1: in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted；

Step 2: encoding using by the header data of data packet in current data stream, obtains the vector of fixed dimension, Using the vector of fixed dimension as first sample.

It should be understood that a network flow, is made of volume of data packet, the package head format of each data packet is very Rule has different field values, such as the data packet of a Transmission Control Protocol, packet header number comprising fixed byte number respectively According to 54 bytes are shared in addition to Optional Field, the TCP header of the IP head and 20 bytes of frame head, 20 bytes including 14 bytes.If will P data packet of one stream, each header data includes q byte, by each byte conversion at signless integer, a packet header Data just obtain the vector X ∈ R an of fixed dimension as a line^p×q, element is also the integer of [0,255].Therefore this reality Mode is applied by encoding header data, obtains the vector of fixed dimension, fixed dimension is p × q, by the fixed dimension Vector is as first sample so as to improve the efficiency of identification first sample classification.

First sample is inputted semi-supervised model by S102, utilizes the classification and first of semi-supervised model output first sample Whether sample is located at the result in the frontier distance of cluster；

Wherein, semi-supervised model is obtained using the training of the first training sample set, and includes the class for having obtained header data The distribution relation of remaining sample is not concentrated with the first training sample；First training sample is concentrated comprising having classification mark at least one The sample of label；Distribution relation decision has whether the sample of class label is located at the result in the frontier distance of cluster.

It should be understood that training obtains the concentration of the first training sample needed for semi-supervised model, there is classification comprising a part The sample of label, the sample for having class label are the header data for having obtained data flow after a classification determines.Remainder For the sample of no class label.The sample of first training sample set is divided into several clusters, the first training sample set in each cluster Sample distribution it has been determined that then thering is class label sample and the distribution of the sample without class label to have determined in each cluster. Therefore class label sample training will obtains semi-supervised model, whether the sample for containing class label is located at the boundary of cluster Result in distance.

In a kind of possible embodiment, semi-supervised model can be obtained as follows:

Step 1, the first training sample concentrate sample training to preset semi-supervised model, the semi-supervised model after being trained；

As shown in figure 3, presetting semi-supervised model by LSTM (Long Short-Term Memory, shot and long term memory network) Encoder, softmax layers and CFSFDP (Clustering by Fast Search and Find of Density Peaks, Cluster based on density peaks) layer composition, the sample input LSTM encoder of class label will be carried, it will by LSTM encoder The unfixed data stream encoding of length is the vector of fixed dimension, since the vector of fixed dimension includes that entire data stream sequences are defeated Out, therefore the vector of fixed dimension can represent the features of all data packets of whole data flow, and softmax layers for that will fix dimension The vector of degree, is mapped to fixed classification, the softmax layers of classification that can export header data；Then by softmax layers from pre- If removing in semi-supervised model, the sample for carrying class label and the sample for not carrying class label are inputted into LSTM encoder, The output input CFSFDP of LSTM encoder is clustered into layer, mainly uses CFSFDP (Clustering by Fast Search And Find of Density Peaks, based on the cluster of density peaks) algorithm determines whether the vector of fixed dimension is cluster The classification of cluster central point and sample.

As shown in Figures 4 and 5, LSTM encoder is as follows by the process for the vector that data stream encoding is fixed dimension:

Such as a data flow { x₀,x₁,…,x_t-1,x_t, data flow is inputted such as the circulation of Fig. 4 in sequence In network structure, each input x_tThere can be an output h_t, while current state is passed into next input, it is so defeated H out_tIn both contain x_tInformation, also contain x₀~x_t-1Information.

The internal structure of LSTM encoder such as Fig. 5, encoder input terminal receive input x_t, upper output h_t-1With upper one Secondary input x_t-1The state C of encoder afterwards_t-1, it is assumed that the output h of each step_tDimension be 128 dimension, certain data stream includes n altogether A data packet, using all data packets of whole stream as a sequence, the header data of each data packet is x_t, the data stream The last one x_nOutput h_n。

Wherein, x_tRepresent a header data；T represents the serial number of header data；N represents data packet in a data stream Total number；h_tThe input of LSTM encoder is represented as x_tWhen, the output of LSTM encoder；h_nData stream input LSTM is represented to compile After code device, the last one of LSTM encoder is exported, that is, the vector of the fixed dimension after encoding.

Step 2, the semi-supervised model after the test sample input training that test sample is concentrated, utilizes half after training Monitor model exports the classification that test sample concentrates test sample；

It should be understood that the class label of a data stream middle wrapping head data be it is identical, test sample be whole data The header data of stream, the classification of the category tag identifier test sample.

Step 3, classification and test sample based on test sample in the semi-supervised model measurement sample set after training are concentrated Whether the label classification of test sample, the semi-supervised model after determining training meet test index；

Wherein, test index are as follows: accuracy reaches accuracy threshold value, precision reaches precision threshold, recall rate reaches and recalls Rate threshold value, F1 score reach F1 score threshold and/or F_βScore reaches F_βScore threshold.

Wherein, accuracy threshold value, precision threshold, recall rate threshold value, F1 score threshold, F_βScore threshold is pre-set Numerical value, accuracy, precision, recall rate, F1 score, F_βScore uses accuracy, precision, recall rate, F1 score, F respectively_βScore Formula is calculated.

Step 4, if the semi-supervised model after training is unsatisfactory for test index, the semi-supervised model after updating training The parameter of middle LSTM encoder, until the semi-supervised model after training meets test index；

Wherein, encoder output formula is as follows:

f_t=σ (W_f·[h_t-1,x_t]+b_f)

i_t=σ (W_i·[h_t-1,x_t]+b_i)

o_t=σ (W_o·[h_t-1,x_t]+b_o)

h_t=o_t·tanh(C_t)

Wherein, the inner parameter of LSTM encoder are as follows: W_*And b_*, W_*Represent weight parameter, b_*Represent offset parameter, f_tIt represents Current step forgets the activation value of door,For Sigmoid function, W_fRepresent the weight for forgeing door, h_t-1Represent upper one The output of step, x_tRepresent the input currently walked, b_fRepresent the biasing for forgeing door, i_tRepresent the activation value of current step input gate, W_iGeneration The weight of table input gate, b_iThe biasing of input gate is represented,Represent current step intermediate state, W_CRepresent state weight, b_CRepresent shape State biasing, C_t-1Represent previous step state, C_tRepresent the state currently walked, o_tRepresent current step out gate activation value, W_oIt represents defeated It gos out weight, b_oOut gate biasing is represented, i represents the activation value of input gate, and f represents the activation value for forgeing door, and t represents data packet Number, o represent out gate activation value, and C represents state.

If the semi-supervised model after training is unsatisfactory for test index, LSTM is compiled in the semi-supervised model after updating training The parameter W of code device_f、b_f、W_i、b_i、W_C、b_C、W_o、b_o。

Semi-supervised model after the training for meeting test index is determined as semi-supervised model by step 5.

Present embodiment, which passes through, is determined as semi-supervised model for the semi-supervised model of presetting after the training for meeting test index, The accuracy rate of determining first sample classification can be improved.

S103, when first sample is the result in the frontier distance for being located at cluster, if first sample is new category Sample, then in the output node of preset machine recognition model increase an output node, by increase output node after Machine recognition model is as online recognition model；

S104 uses online recognition model, the classification of next data flow after identifying current data stream.

Compared with the prior art, the embodiment of the present invention identifies the classification of current data stream by semi-supervised model, based on knowledge Other classification judges whether current data stream is new category sample, the structure of machine recognition model is changed, by the machine after restructuring Device learning model improves online knowledge as online recognition model, the classification of next data flow after identifying current data stream Other model adapts to the ability of network environment, and the classification real-time of identification data flow can be improved.

In order to improve the classification real-time of identification data flow, above-mentioned S102 can obtain the using at least one embodiment Whether the classification and first sample of one sample are located at the result in the frontier distance of cluster:

In a kind of possible embodiment, as follows, obtain first sample classification and first sample whether Result in the frontier distance of cluster:

Step 1: inputting semi-supervised model for first sample, utilizes the classification of semi-supervised model output first sample；

Step 2: local density and minimum range that the first training sample concentrates each sample are calculated, local density is surpassed It crosses density threshold and minimum range is more than the sample of distance threshold as third sample；First training sample set is that training is semi-supervised Sample group used in model at set；

Step 3: third sample is added in cluster, and third sample is determined as to the cluster central point of cluster；The number of cluster It is identical as third number of samples, only one third sample in each cluster；

Step 4: if first sample is more than the frontier distance of cluster at a distance from third sample, determine first sample not In the frontier distance of cluster；

Wherein, for using the cluster central point of cluster as the centre of sphere, radius is spherical shape composed by frontier distance in the frontier distance of cluster The range that region surrounds.

Step 5: if first sample is less than the frontier distance of cluster at a distance from third sample, determine first sample In the frontier distance of cluster.

First, it is assumed that the first training sample set of data is S={ x_j|j∈I_S, wherein I_S={ 1,2 ..., n }, d_ijIt indicates x_iSample and x_jThe distance between sample calculates sample x_iLocal density φ_iWith minimum range θ_i, Φ={ φ_i|i∈I_SAnd Θ={ θ_i|i∈I_S, I_SSet of integers is represented, i and j take positive integer；Φ represents local density's set, and Θ represents minimum range collection It closes.

Local density φ_iCalculation formula are as follows:OrIts In, d_cTruncation distance is represented, is a pre-set numerical value, the truncation distance that frontier distance is m times, m is preset Numerical value；χ () is jump function,X is the input of jump function.

By local density φ_iCalculation formula:It can be seen that x_iφ_iRelative size with To x_iDistance be less than d_cNumber of samples it is related, that is, be less than d_cSample number it is more, φ_iValue it is bigger.The calculation formula, can With by φ_iThe discrete value of value has become successive value, improves and calculates local density's accuracy rate.

Minimum range θ_iCalculation formula are as follows:Local density is surpassed It crosses density threshold and minimum range is more than distance threshold as third sample.

Wherein, third sample is at a distance from first sample, d_ijEuclidean distance, manhatton distance, Qie Bixue can be used The formula such as husband's distance, Minkowski Distance, standardization Euclidean distance or cosine similarity distance are calculated.

Present embodiment concentrates local density and the minimum range of each sample by calculating the first training sample, if the One sample is less than the frontier distance of cluster at a distance from third sample, then determines that first sample is located in the frontier distance of cluster, can Determine that first sample is located at the efficiency in the frontier distance of cluster to improve.

The classification of first sample is obtained by above embodiment and frontier distance that whether first sample is located at cluster in Result the step of after, a kind of network flow identification method provided in an embodiment of the present invention further include: update the cluster center of cluster Point.

In a kind of possible embodiment, the cluster central point of cluster is updated as follows:

Step 1: local density and the minimum range of each sample are concentrated based on the first training sample, local density is surpassed It crosses density threshold and minimum range is less than the sample of distance threshold as the 4th sample；

Step 2: the 4th sample is added in the cluster where the nearest cluster central point of the 4th sample of distance；

Step 3: if first sample is located in the frontier distance of cluster, first sample is added to apart from first sample Where nearest cluster central point in cluster；

Step 4: calculating the local density of sample and minimum range in each cluster, for a cluster, is more than by local density Density threshold and minimum range are more than that the sample of distance threshold determines the cluster central point of the cluster, and the cluster after update central point is made For updated cluster.

Present embodiment by updating the cluster central point of cluster, can be improved determining first sample whether be located at the boundary of cluster away from From interior accuracy rate.

In alternatively possible embodiment, as follows, the classification and first sample for obtaining first sample are The no result in the frontier distance of cluster:

Step 1: inputting semi-supervised model for first sample, exports first sample using the output node of semi-supervised model Whether affiliated different classes of probability and first sample are located at the result in the frontier distance of cluster；Output node and first sample institute It is corresponding to belong to classification.

Such as: g-th of output node output first sample belongs to the other probability of g type.

Step 2: it is exported in output node in probability different classes of belonging to first sample, select probability highest first Classification of the sample generic as first sample.

Present embodiment passes through classification of the highest first sample generic of select probability as first sample, Ke Yiti Height determines the accuracy rate of the classification of first sample.

In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S103 Online recognition model:

In a kind of possible embodiment, online recognition model is obtained as follows:

Step 1: in the case where in the frontier distance that first sample is located at cluster, exist if the second training sample is concentrated The second sample identical with the classification of first sample, then determining first sample not is the sample of new category, and updates preset machine The parameter of device identification model；Second training sample set is the set of the composition of data flow used in training machine identification model；

Step 2: in the case where in the frontier distance that first sample is located at cluster, if the second training sample concentration is not deposited In the second sample identical with the classification of first sample, then determine that first sample is the sample of new category, and in preset machine Increase an output node in the output node of identification model, knows using the machine recognition model after increase output node as online Other model.

With reference to Fig. 6 and Fig. 7, preset machine recognition model M_oUsing CNN (Convolutional Neural Networks, convolutional neural networks), M_oOutput layer include K node, upper one layer of output layer includes J node, output layer With upper one layer between line represent the parameter of output layer, in the case where in the frontier distance that first sample is located at cluster, if Two training samples, which are concentrated, has the second sample identical with the classification of first sample, then determining first sample not is the sample of new category This, and update preset machine recognition model output layer and it is one layer upper between parameter.It is located at the boundary of cluster in first sample In the case that distance is interior, if the second training sample, which is concentrated, does not have the second sample identical with the classification of first sample, sentence Determine the sample that first sample is new category, and increase an output node in the output node of preset machine recognition model, Using the machine recognition model after increase output node as online recognition model.

In alternatively possible embodiment, in the case where in the frontier distance that first sample is located at cluster, if the One sample is the sample of new category, then the increase of preset machine recognition Model Parameter dimension is one-dimensional, will increase parameter dimensions Machine recognition model afterwards is as online recognition model.

When first sample belongs to the sample of new category, i.e. X ∈ C_K+1, the preset machine recognition model of parametrization is come Say, increase an output node mean by preset machine recognition model output layer parameter dimension increase it is one-dimensional, by W ∈ R^J×K→W∈R^J×(K+1), b ∈ R^K→b∈R^K+1, ρ ∈ R^K→ρ∈R^K+1, wherein W represents weight set, and R represents set of real numbers, b generation The set of table biasing, ρ represent the set of proficiency, → represent assignment, the total number of K output node.

In another possible embodiment, online recognition model is obtained as follows:

Step 1: in the case where in the frontier distance that the first sample is located at cluster, if the first sample is new The sample of classification then increases an output node in the output node of preset machine recognition model, will increase output node The machine recognition model afterwards is as basic identification model；

Step 2: inputting basic identification model for first sample, calculates the loss function of basic identification model for basis Output layer weight and the local derviation of biasing in identification model；

Step 3: in the direction of gradient decline, the weight of basic identification model output layer is updated using parameter more new formula And biasing；Parameter more new formula includes that the increment of proficiency identifies mould for basis with the loss function of basic identification model respectively The result that output layer weight is multiplied with the local derviation of biasing in type；

Step 4: the basic identification model after updating weight and bias is determined as online recognition model.

Assuming that upper one layer of output of basic identification model output layer are as follows: F={ f_j}∈R^J, output layer output Y={ y_k}∈ R^K, weight W={ w_jk}∈R^J×K, bias b={ b_k}∈R^K.The output layer of basic identification model is softmax layers, then can push away Lead the cross entropy loss function for obtaining basic identification model are as follows:Wherein, T={ t_k}∈R^KIt is data The one-hot of traffic category is encoded, y_k=g (z_k) be k-th of output node activation value, g () is softmax function, Softmax are as follows: f_jRepresent the feature activation value of j-th of node, b_k Represent the biasing of k-th of node, R^KDimension is represented as K dimension, t_kRepresent the kth position of one-hot coding, z_kRepresent k-th of node Activation value, w_jkThe weight of one layer on output layer of node j and output node between k is represented, i, k represent the sequence of output node Number, positive integer is taken, k also represents the serial number of one-hot coding median, and j represents the serial number of one layer of node on output layer, and K is defeated The total number of egress, the node number that J is one layer on output layer are available by seeking partial derivative to above-mentioned loss function The increment of weight and biasing:Wherein I { } is indicator function:

It is understood that obtaining online recognition model, base using the sample training basis identification model of new category every time The parameter of plinth identification model all can constantly adapt to the sample of new category, learn the feature of new category sample, the sample of old classification It is no longer participate in training.It is this have no limitation so that the training method that on-line study model shakes down there is a problem of it is serious, I.e. when the variation of basic identification model inner parameter, the feature to have learnt can be had an impact, it could even be possible to destroying completely The ability of basic identification model before leads to that serious error occurs to the identification for not being new category sample, " calamity something lost occurs Forget " problem.

In order to solve the problems, such as catastrophic forgetting, needing to make basic identification model in the sample of study new category and retain old class Very weighed between this.When the high stability of basic identification model, basic identification model is more prone to retain old The feature of the sample of classification, and the feature capabilities for learning new category sample can be weakened；On the contrary, when basic identification model is plastic Property it is higher when, will have the ability of stronger study new category sample, and while be also easier to forget the feature of the sample of old classification, instruct The key that the online recognition model got adapts to network environment is that obtains different power between stability and plasticity Weighing apparatus.

In order to which stability-plasticity of online recognition model is controllable, the embodiment of the present invention proposes a kind of proficiency mechanism, should Mechanism introduces one group of additional parameter ρ={ ρ_k}∈R^K, ρ represents proficiency set, ρ_kIndicate basic identification model for kth The proficiency of the classification of a node output, proficiency is for measuring online recognition model for the recognition capability of sample of all categories.

Wherein, ρ_k∈ [0,1), initial value 0 indicates that basic identification model is for point of all categories under initial situation Class proficiency is all 0.In order to influence stability-controllability of model using proficiency, proficiency ρ should have property below Matter:

1), proficiency is influenced by the result of identification sample class.The number of correct specimen discerning classification is more, corresponding Proficiency it is higher；Error sample identifies that the number of classification is more, and corresponding proficiency is lower.

2), proficiency influences the variation of itself.When proficiency is lower, the difficulty for further increasing or reducing proficiency is small, Proficiency itself increases or decreases fast；Proficiency is higher, and the difficulty for further increasing or reducing itself proficiency is bigger, i.e., ripe White silk degree increases or decreases slower.

3), proficiency influences study or forgets the difficulty of knowledge.When proficiency is lower, more new category samples are acquired Feature or forget that the feature of old sample class is relatively easy, i.e., model parameter update faster；On the contrary, when proficiency is higher When, the difficulty for learning or forgetting is also bigger, i.e., model parameter updates slower.

For example, if X ∈ C_kAnd Y ∈ C_k, show that the classification of kth class corresponding for sample X is correct, then correspond to Proficiency ρ_kIncrease；If X ∈ C_iBut Y ∈ C_j, show the i-th class mistake being identified as jth class, then corresponding ρ_iAnd ρ_j It reduces.

With reference to Fig. 8, in order to realize that property 2 and property 3, the embodiment of the present invention propose the function of a proficiency for calculating The increment of proficiency:

The function of proficiency are as follows:Wherein α and β is two parameters, for controlling The overall trend of the function of proficiency, Fig. 8 show the change procedure of the function of the proficiency under different α and β, work as ρ_kIt is smaller When, the increment prof (ρ of proficiency_k) larger, with ρ_kIncrease, prof (ρ_k) and its derivative be all gradually reduced, work as ρ_kIncrease to pole When limit value 1, the increment prof (ρ of proficiency_k) value be 0, proficiency ρ_kNo longer it is updated.

Proficiency ρ_kUpdate formula: ρ_k←ρ_k±prof(ρ_k),

With reference to Fig. 9, as proficiency increases, the increment prof (ρ of proficiency_k) be gradually reduced, i.e., basic identification model ginseng The amplitude that number updates is gradually reduced；Fig. 9 shows the increment prof (ρ of proficiency under different parameters_k) situation of change, it is seen that it is logical The parameter alpha and β for crossing the increment of adjustment proficiency, can control the renewal speed of basic identification model.

Therefore, when updating the weight and biasing of basic identification model output layer using parameter more new formula, increase proficiency Increment, then parameter more new formula are as follows: Represent weight w_jkIncrement,Represent biasing b_kIncrement.As the function prof (ρ of proficiency_k) for update weight W and biasing b when, need to enableTo ensure prof (0)=1, i.e., as proficiency ρ_kWhen=0, proficiency function does not influence the update of model；With Proficiency increase, increment coefficient prof (ρ_k) be gradually reduced, i.e., the amplitude that basic identification model parameter updates is gradually reduced；When ρ_kWhen → 1, prof (ρ_kThe update amplitude of) → 0, i.e., basic identification model tends to 0.

With reference to Figure 10, in the case where parameter beta difference, prof (ρ_k) situation of change, β is bigger, prof (ρ_k) decline speed It spends faster.

According to above analysis it is found that passing through the function prof (ρ of proficiency set ρ and proficiency_k) introducing, utilization is ripe Two parameter alphas and β of the function of white silk degree can control the ability and speed of basic identification model undated parameter, and then realize The tradeoff of the stability and plasticity of line identification model solves the problems, such as " calamity is forgotten ".

Illustrate the other one-hot coding of Tstream, it is assumed that the first training set sample is divided into 6 class samples, basis The output layer of identification model has 6 nodes.Number is specified for every a kind of sample, the sample class of the first training set includes " RDP (Remote Desktop Protocol, Remote Desktop Protocol) ", " bit-torrent BitTorrent ", " Web (World Wide Web, WWW) ", " SSH (Secure Shell, safety shell protocol) ", " eDonkey (eDonkey Network, electricity Donkey) ", and " NTP (Network Time Protocol, Network Time Protocol) ", reference numeral are as follows: 0,1,2,3,4,5, it is corresponding One-hot coding are as follows: 100000,010000,001000,000100,000010,000001.Assuming that one in the first training set The label of bar sample is 0, and the number of the sample label is 0, and the classification of the sample is " RDP ".Basic identification model identifies the sample The the 1st to the 6th node output of this classification is 0.5,0.1,0.1,0.1,0.1,0.1.Wherein, basic identification model identification should Sample is the probability highest for encoding 0, the loss function of basic identification model are as follows:

L=-1log0.5+ (- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(- 0·log0.1)。

It, can be using at least one embodiment identification in above-mentioned S104 in order to improve the classification real-time of identification data flow The classification of the data of packet header in next data flow after current data stream:

In a kind of possible embodiment, as follows, in next data flow after identifying current data stream The classification of the data of packet header:

Step 1: it in the case that data packet number reaches predetermined number in next data flow after current data stream, mentions The data of packet header in a data flow are removed as the second sample；

Step 2: by the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.

It continues with and a kind of network flow identification device provided in an embodiment of the present invention is described.

As shown in figure 11, a kind of network flow identification device provided in an embodiment of the present invention is applied to server, device packet It includes:

Sample module 1101, for extracting data packet in current data stream in the case where receiving the completion of current data stream Header data, as first sample；

Supervision module 1102 exports first sample using semi-supervised model for first sample to be inputted semi-supervised model Classification and first sample whether be located at the result in the frontier distance of cluster；Semi-supervised model is assembled for training using the first training sample It gets and the classification comprising having obtained header data and the first training sample concentrates the distribution relation of remaining sample；First training Comprising there is the sample of class label at least one in sample set；Distribution relation decision has whether the sample of class label is located at cluster Frontier distance in result；

Module 1103 is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section Machine recognition model after point is as online recognition model；

Identification module 1104, for using online recognition model, the class of next data flow after identifying current data stream Not.

Sample module is specifically used for:

Supervision module is specifically used for:

Change module is specifically used for:

In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula； Parameter more new formula includes the increment of proficiency respectively with the loss function of basic identification model for defeated in basic identification model The result that layer weight is multiplied with the local derviation of biasing out；

Identification module is specifically used for:

The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 12, including processor 1201, communication interface 1202, memory 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through communication Bus 1204 completes mutual communication,

Memory 1203, for storing computer program；

Processor 1201 when for executing the program stored on memory 1203, realizes following steps:

First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample The no result in the frontier distance of cluster；

The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a thick line in figure, it is not intended that an only bus or a type of bus convenient for indicating.

Communication interface is for the communication between above-mentioned electronic equipment and other equipment.

Memory may include random access memory (Random Access Memory, abbreviation RAM), also may include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, Abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.；It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array, Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.

In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with instruction in storage medium, when run on a computer, so that computer executes any institute in above-described embodiment A kind of network flow identification method stated.

In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any a kind of network flow identification method in above-described embodiment.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device/ For electronic equipment/computer readable storage medium/computer program product embodiments, implement since it is substantially similar to method Example, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of network flow identification method, which is characterized in that be applied to server, which comprises

In the case where receiving current data stream and completing, the header data of data packet in the current data stream is extracted, as the One sample；

The first sample is inputted into semi-supervised model, the classification and the of the first sample is exported using the semi-supervised model Whether one sample is located at the result in the frontier distance of cluster；The semi-supervised model is obtained using the training of the first training sample set It and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample；The first training sample This concentration includes the sample for having class label at least one；The distribution relation determine described in have class label sample whether Result in the frontier distance of cluster；

When the first sample is the result in the frontier distance for being located at cluster, if the first sample is new category Sample then increases an output node in the output node of preset machine recognition model, by the institute after increase output node Machine recognition model is stated as online recognition model；

Use the online recognition model, the classification of next data flow after identifying current data stream.

2. the method according to claim 1, wherein being extracted in the case where receiving the completion of current data stream The data of packet header in the current data stream, the step of as first sample before, the method also includes:

The data packet of current data stream is successively received, and obtains the five-tuple information of the data packet；

Judge whether database stores the five-tuple information, if five-tuple information described in the database purchase, by institute The header data for stating data packet is saved to the storage region with the five-tuple information respective path；

If the not stored five-tuple information of database, the memory block of creation and the five-tuple information respective path Domain saves the header data of the data packet to the storage region of the five-tuple information respective path.

3. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention The header data for taking data packet in the current data stream, as first sample, comprising:

Whether each data packet for judging current data stream includes end of identification, includes to terminate mark if there is a data packet Know, then receives data flow and be completed, extract the header data of data packet in the data flow as first sample.

4. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention The header data for taking data packet in the current data stream, as first sample, comprising:

In the case where receiving the completion of current data stream, the header data of data packet in the current data stream is extracted；

It is encoded using by the header data of data packet in the current data stream, obtains the vector of fixed dimension, it will be described The vector of fixed dimension is as first sample.

5. the method according to claim 1, wherein described input semi-supervised model, benefit for the first sample Export the classification of the first sample with the semi-supervised model and first sample whether be located at it is in the frontier distance of cluster as a result, Include:

The first sample is inputted into semi-supervised model, the classification of the first sample is exported using the semi-supervised model；

Local density and minimum range that first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample；First training sample set is that training is described semi-supervised Sample group used in model at set；

The third sample is added in cluster, and the third sample is determined as to the cluster central point of cluster；Of the cluster Number, in each cluster only one third sample identical as third number of samples；

If the first sample is more than the frontier distance of cluster at a distance from third sample, determine that the first sample is not located at In the frontier distance of cluster；

If the first sample is less than the frontier distance of cluster at a distance from third sample, determine that the first sample is located at In the frontier distance of cluster.

6. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:

In the case where in the frontier distance that the first sample is located at cluster, exist and described the if the second training sample is concentrated Identical second sample of the classification of one sample, then determining the first sample not is the sample of new category, and updates preset machine The parameter of device identification model；Second training sample set is the collection of data flow composition used in the training machine recognition model It closes；

In the case where in the frontier distance that the first sample is located at cluster, if the second training sample concentrate do not exist with it is described Identical second sample of the classification of first sample then determines that the first sample is the sample of new category, and in preset machine In the output node of identification model increase an output node, using increase output node after the machine recognition model as Line identification model.

7. the method according to claim 1, wherein described in the frontier distance for being located at cluster in the first sample In the case where interior, if the first sample is the sample of new category, in the output node of preset machine recognition model Increase an output node, using the machine recognition model after increase output node as online recognition model, comprising:

In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category, It is then that the increase of preset machine recognition Model Parameter dimension is one-dimensional, the machine recognition model after increase parameter dimensions is made For online recognition model.

8. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:

In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category, Then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as basic identification model；

By the first sample input basic identification model, the loss function of the basic identification model is calculated for described Output layer weight and the local derviation of biasing in basic identification model；

In the direction of gradient decline, the weight and partially of the basic identification model output layer is updated using the parameter more new formula It sets；The parameter more new formula include proficiency increment proficiency increment respectively with the loss function of basic identification model for The result that output layer weight is multiplied with the local derviation of biasing in the basis identification model；

9. identifying current number the method according to claim 1, wherein described use the online recognition model According to the classification of next data flow after stream, comprising:

In the case that data packet number reaches predetermined number in next data flow after current data stream, extract described next The data of packet header are as the second sample in data flow；

Second sample is inputted into the online recognition model, exports second sample using the online recognition model Classification.

10. a kind of network flow identification device, which is characterized in that be applied to server, described device includes:

Sample module, for extracting the packet of data packet in the current data stream in the case where receiving the completion of current data stream Head data, as first sample；

Supervision module utilizes the semi-supervised model output described first for the first sample to be inputted semi-supervised model Whether the classification and first sample of sample are located at the result in the frontier distance of cluster；The semi-supervised model is to utilize the first training Sample set training obtains and includes that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample； First training sample concentrates the sample comprising there is class label at least one；The distribution relation has classification described in determining Whether the sample of label is located at the result in the frontier distance of cluster；

Module is changed, in the case where in the frontier distance that the first sample is located at cluster, if the first sample is The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section The machine recognition model after point is as online recognition model；

Identification module, for using the online recognition model, the classification of next data flow after identifying current data stream.