CN109873774A - A kind of network flow identification method and device - Google Patents

A kind of network flow identification method and device Download PDF

Info

Publication number
CN109873774A
CN109873774A CN201910036196.2A CN201910036196A CN109873774A CN 109873774 A CN109873774 A CN 109873774A CN 201910036196 A CN201910036196 A CN 201910036196A CN 109873774 A CN109873774 A CN 109873774A
Authority
CN
China
Prior art keywords
sample
model
cluster
recognition model
output node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910036196.2A
Other languages
Chinese (zh)
Other versions
CN109873774B (en
Inventor
廖青
赵晶玲
李天琦
刘月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910036196.2A priority Critical patent/CN109873774B/en
Publication of CN109873774A publication Critical patent/CN109873774A/en
Application granted granted Critical
Publication of CN109873774B publication Critical patent/CN109873774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of network flow identification method and device provided in an embodiment of the present invention, method include: the data of packet header in current data stream to be extracted, as first sample in the case where receiving the completion of current data stream;First sample is inputted into semi-supervised model, whether is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category, then increase an output node in the output node of preset machine recognition model, using the machine recognition model after increase output node as online recognition model;Then the classification of next data flow after current data stream is identified.Compared with the prior art, the structure of change of embodiment of the present invention machine recognition model, the classification of next data flow after current data stream is identified using the machine learning model after restructuring, can be improved the classification real-time of identification data flow.

Description

A kind of network flow identification method and device
Technical field
The present invention relates to fields of communication technology, more particularly to a kind of network flow identification method and device.
Background technique
Flow is the important carrier that data are transmitted in network, and flow identification is the key link of network monitoring, only convection current Amount is identified, different monitoring strategies could be taken according to different flows, such as: refusal, optimization, mark, preferential fraction Class etc., therefore network flow identify most important.General networking flow is transmitted in the form of data flow, every data Stream includes multiple data packets, and each data packet includes the header data of fixed byte, can obtain packet header number according to header data According to feature, the feature of header data includes: time interval, flow the duration, mean value, variance of data package size etc..
Using the method based on machine learning, this method, which mainly passes through machine, is identified to network flow in the prior art Device learning art excavates the feature of network header data, and then training obtains machine learning model, then inputs data flow and instructs The machine learning model got exports the classification of online network flow.Wherein, machine learning is obtained using following steps training Model: first by counting the feature of packet header data in whole data flow, whole or portion in whole data flow are selected Divide the feature of header data as sample, training sample obtains machine learning model, this machine learning model is offline mould Type, internal structure are fixed.
Due to the real-time change of network environment, the feature of data flow can also change, and use the machine that internal structure is fixed Device learning model identifies that the real-time of the classification of online network flow is not high, therefore the prior art identifies online network flow class Other real-time is not high.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of network flow identification method and device, improves identification data flow Classification real-time, specific technical solution are as follows:
In a first aspect, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, method packet It includes:
In the case where receiving current data stream and completing, the header data of data packet in current data stream is extracted, as the One sample;
First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample The no result in the frontier distance of cluster;Semi-supervised model is to be obtained using the training of the first training sample set and include to have obtained The classification of header data and the first training sample concentrate the distribution relation of remaining sample;First training sample concentrate comprising at least One has the sample of class label;Distribution relation decision has whether the sample of class label is located at the knot in the frontier distance of cluster Fruit;
When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model;
Use online recognition model, the classification of next data flow after identifying current data stream.
Optionally, in the case where receiving the completion of current data stream, the number of packet header in current data stream is extracted According to, the step of as first sample before, method further include:
The data packet of current data stream is successively received, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted It saves according to the header data of packet to the storage region of five-tuple information respective path.
Optionally, in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted, As first sample, comprising:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Optionally, in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted, As first sample, comprising:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed The vector of dimension is as first sample.
Optionally, first sample is inputted into semi-supervised model, utilizes the classification and the of semi-supervised model output first sample Whether one sample is located at the result in the frontier distance of cluster, comprising:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster In frontier distance.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model, comprising:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model, comprising:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online Other model.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model, comprising:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category, Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula; Parameter more new formula includes proficiency increment respectively with the loss function of basic identification model for exporting in basic identification model The result that layer weight is multiplied with the local derviation of biasing;
The basic identification model after weight being updated and biased is determined as online recognition model.
Optionally, using online recognition model, the classification of next data flow after identifying current data stream, comprising:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
Second aspect, a kind of network flow identification device provided in an embodiment of the present invention are applied to server, device packet It includes:
Sample module, for extracting the packet of data packet in current data stream in the case where receiving the completion of current data stream Head data, as first sample;
Supervision module utilizes the class of semi-supervised model output first sample for first sample to be inputted semi-supervised model Not and whether first sample is located at the result in the frontier distance of cluster;Semi-supervised model is trained using first training sample set To and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample;First training sample Concentrate the sample comprising there is class label at least one;Distribution relation determines the side for having the sample of class label whether to be located at cluster Result in boundary's distance;
Module is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is new class Other sample then increases an output node in the output node of preset machine recognition model, after increasing output node Machine recognition model as online recognition model;
Identification module, for using online recognition model, the classification of next data flow after identifying current data stream.
Optionally, a kind of network flow identification device provided in an embodiment of the present invention further include:
Storage unit for successively receiving the data packet of current data stream, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted It saves according to the header data of packet to the storage region of five-tuple information respective path.
Optionally, sample module is specifically used for:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Optionally, sample module is specifically used for:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed The vector of dimension is as first sample.
Optionally, supervision module is specifically used for:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster In frontier distance.
Optionally, change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Optionally, change module is specifically used for:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online Other model.
Optionally, change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category, Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula; Parameter more new formula includes proficiency increment respectively with the loss function of basic identification model for exporting in basic identification model The result that layer weight is multiplied with the local derviation of biasing;
The basic identification model after weight being updated and biased is determined as online recognition model.
Optionally, identification module is specifically used for:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable Instruction is stored in storage medium, when run on a computer, so that computer executes a kind of any of the above-described net Network method for recognizing flux.
At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced Product, when run on a computer, so that computer executes a kind of any of the above-described network flow identification method.
A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream Under condition, the data of packet header in current data stream are extracted, as first sample;First sample is inputted into semi-supervised model, Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;First In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould Type;Use online recognition model, the classification of next data flow after identifying current data stream.Compared with the prior art, this hair Bright embodiment identifies the classification of current data stream by semi-supervised model, judges whether current data stream is new based on identification classification Classification sample changes the structure of machine recognition model, using the machine learning model after restructuring as online recognition model, knows The classification real-time of identification data flow can be improved in the classification of next data flow after other current data stream.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.
Fig. 1 is a kind of flow chart of network flow identification method provided in an embodiment of the present invention;
Fig. 2 is the flow chart provided in an embodiment of the present invention stored to current data stream;
Fig. 3 presets semi-supervised model structure to be provided in an embodiment of the present invention;
Fig. 4 is the network structure of LSTM encoder cycles provided in an embodiment of the present invention;
Fig. 5 is the internal structure chart of LSTM encoder provided in an embodiment of the present invention;
Fig. 6 is the structure chart of preset machine recognition model provided in an embodiment of the present invention;
Fig. 7 is the structure chart of online recognition model provided in an embodiment of the present invention;
Fig. 8 is the effect picture of the function of proficiency under different parameters provided in an embodiment of the present invention;
Fig. 9 is that horizontal axis provided in an embodiment of the present invention is to identify that correct number, the longitudinal axis are proficiency function in different parameters Under effect picture;
Figure 10 is the effect picture of proficiency function under different beta provided in an embodiment of the present invention;
Figure 11 is a kind of structure chart of network flow identification device provided in an embodiment of the present invention;
Figure 12 is the structure chart of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.
A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream Under condition, the data of packet header in current data stream are extracted, as first sample;First sample is inputted into semi-supervised model, Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;First In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould Type;Use online recognition model, the classification of next data flow after identifying current data stream.
A kind of network flow identification method provided in an embodiment of the present invention is described first below.
As shown in Figure 1, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, the side Method includes:
S101 extracts the header data of data packet in current data stream in the case where receiving the completion of current data stream, makees For first sample;
Before the above-mentioned S101 the step of, a kind of network flow identification method provided in an embodiment of the present invention further includes to working as Preceding data flow is stored:
As shown in Fig. 2, including: to current data stream progress storing step
S201, successively receives the data packet of current data stream, and obtains the five-tuple information of data packet;
Wherein, five-tuple information are as follows: source IP address, purpose IP address, source port number, destination slogan, transport layer protocol.
S202, judges whether database stores five-tuple information, if database purchase five-tuple information, by data packet Header data save to the storage region with five-tuple information respective path;
S203, if the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path, The header data of data packet is saved to the storage region of five-tuple information respective path.
The embodiment of the present invention, can be with by the way that the header data of data packet is stored storage region corresponding to five-tuple information Improve the efficiency for searching the header data of data packet in same data flow.
In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S101 First sample:
In a kind of possible embodiment, by judging whether each data packet of current data stream includes to terminate mark Know, includes end of identification if there is a data packet, then receive data flow and be completed;Extract the packet header of data packet in data flow Data are as first sample.
It should be understood that each data packet includes end of identification, if a data stream end of transmission, the data stream In the end of identification of the last one data packet will change, present embodiment is by judge whether data packet includes to tie in data flow Beam identification can quickly determine whether a data stream finishes receiving.
In a kind of possible embodiment, first sample is obtained by following steps:
Step 1: in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
Step 2: encoding using by the header data of data packet in current data stream, obtains the vector of fixed dimension, Using the vector of fixed dimension as first sample.
It should be understood that a network flow, is made of volume of data packet, the package head format of each data packet is very Rule has different field values, such as the data packet of a Transmission Control Protocol, packet header number comprising fixed byte number respectively According to 54 bytes are shared in addition to Optional Field, the TCP header of the IP head and 20 bytes of frame head, 20 bytes including 14 bytes.If will P data packet of one stream, each header data includes q byte, by each byte conversion at signless integer, a packet header Data just obtain the vector X ∈ R an of fixed dimension as a linep×q, element is also the integer of [0,255].Therefore this reality Mode is applied by encoding header data, obtains the vector of fixed dimension, fixed dimension is p × q, by the fixed dimension Vector is as first sample so as to improve the efficiency of identification first sample classification.
First sample is inputted semi-supervised model by S102, utilizes the classification and first of semi-supervised model output first sample Whether sample is located at the result in the frontier distance of cluster;
Wherein, semi-supervised model is obtained using the training of the first training sample set, and includes the class for having obtained header data The distribution relation of remaining sample is not concentrated with the first training sample;First training sample is concentrated comprising having classification mark at least one The sample of label;Distribution relation decision has whether the sample of class label is located at the result in the frontier distance of cluster.
It should be understood that training obtains the concentration of the first training sample needed for semi-supervised model, there is classification comprising a part The sample of label, the sample for having class label are the header data for having obtained data flow after a classification determines.Remainder For the sample of no class label.The sample of first training sample set is divided into several clusters, the first training sample set in each cluster Sample distribution it has been determined that then thering is class label sample and the distribution of the sample without class label to have determined in each cluster. Therefore class label sample training will obtains semi-supervised model, whether the sample for containing class label is located at the boundary of cluster Result in distance.
In a kind of possible embodiment, semi-supervised model can be obtained as follows:
Step 1, the first training sample concentrate sample training to preset semi-supervised model, the semi-supervised model after being trained;
As shown in figure 3, presetting semi-supervised model by LSTM (Long Short-Term Memory, shot and long term memory network) Encoder, softmax layers and CFSFDP (Clustering by Fast Search and Find of Density Peaks, Cluster based on density peaks) layer composition, the sample input LSTM encoder of class label will be carried, it will by LSTM encoder The unfixed data stream encoding of length is the vector of fixed dimension, since the vector of fixed dimension includes that entire data stream sequences are defeated Out, therefore the vector of fixed dimension can represent the features of all data packets of whole data flow, and softmax layers for that will fix dimension The vector of degree, is mapped to fixed classification, the softmax layers of classification that can export header data;Then by softmax layers from pre- If removing in semi-supervised model, the sample for carrying class label and the sample for not carrying class label are inputted into LSTM encoder, The output input CFSFDP of LSTM encoder is clustered into layer, mainly uses CFSFDP (Clustering by Fast Search And Find of Density Peaks, based on the cluster of density peaks) algorithm determines whether the vector of fixed dimension is cluster The classification of cluster central point and sample.
As shown in Figures 4 and 5, LSTM encoder is as follows by the process for the vector that data stream encoding is fixed dimension:
Such as a data flow { x0,x1,…,xt-1,xt, data flow is inputted such as the circulation of Fig. 4 in sequence In network structure, each input xtThere can be an output ht, while current state is passed into next input, it is so defeated H outtIn both contain xtInformation, also contain x0~xt-1Information.
The internal structure of LSTM encoder such as Fig. 5, encoder input terminal receive input xt, upper output ht-1With upper one Secondary input xt-1The state C of encoder afterwardst-1, it is assumed that the output h of each steptDimension be 128 dimension, certain data stream includes n altogether A data packet, using all data packets of whole stream as a sequence, the header data of each data packet is xt, the data stream The last one xnOutput hn
Wherein, xtRepresent a header data;T represents the serial number of header data;N represents data packet in a data stream Total number;htThe input of LSTM encoder is represented as xtWhen, the output of LSTM encoder;hnData stream input LSTM is represented to compile After code device, the last one of LSTM encoder is exported, that is, the vector of the fixed dimension after encoding.
Step 2, the semi-supervised model after the test sample input training that test sample is concentrated, utilizes half after training Monitor model exports the classification that test sample concentrates test sample;
It should be understood that the class label of a data stream middle wrapping head data be it is identical, test sample be whole data The header data of stream, the classification of the category tag identifier test sample.
Step 3, classification and test sample based on test sample in the semi-supervised model measurement sample set after training are concentrated Whether the label classification of test sample, the semi-supervised model after determining training meet test index;
Wherein, test index are as follows: accuracy reaches accuracy threshold value, precision reaches precision threshold, recall rate reaches and recalls Rate threshold value, F1 score reach F1 score threshold and/or FβScore reaches FβScore threshold.
Wherein, accuracy threshold value, precision threshold, recall rate threshold value, F1 score threshold, FβScore threshold is pre-set Numerical value, accuracy, precision, recall rate, F1 score, FβScore uses accuracy, precision, recall rate, F1 score, F respectivelyβScore Formula is calculated.
Step 4, if the semi-supervised model after training is unsatisfactory for test index, the semi-supervised model after updating training The parameter of middle LSTM encoder, until the semi-supervised model after training meets test index;
Wherein, encoder output formula is as follows:
ft=σ (Wf·[ht-1,xt]+bf)
it=σ (Wi·[ht-1,xt]+bi)
ot=σ (Wo·[ht-1,xt]+bo)
ht=ot·tanh(Ct)
Wherein, the inner parameter of LSTM encoder are as follows: W*And b*, W*Represent weight parameter, b*Represent offset parameter, ftIt represents Current step forgets the activation value of door,For Sigmoid function, WfRepresent the weight for forgeing door, ht-1Represent upper one The output of step, xtRepresent the input currently walked, bfRepresent the biasing for forgeing door, itRepresent the activation value of current step input gate, WiGeneration The weight of table input gate, biThe biasing of input gate is represented,Represent current step intermediate state, WCRepresent state weight, bCRepresent shape State biasing, Ct-1Represent previous step state, CtRepresent the state currently walked, otRepresent current step out gate activation value, WoIt represents defeated It gos out weight, boOut gate biasing is represented, i represents the activation value of input gate, and f represents the activation value for forgeing door, and t represents data packet Number, o represent out gate activation value, and C represents state.
If the semi-supervised model after training is unsatisfactory for test index, LSTM is compiled in the semi-supervised model after updating training The parameter W of code devicef、bf、Wi、bi、WC、bC、Wo、bo
Semi-supervised model after the training for meeting test index is determined as semi-supervised model by step 5.
Present embodiment, which passes through, is determined as semi-supervised model for the semi-supervised model of presetting after the training for meeting test index, The accuracy rate of determining first sample classification can be improved.
S103, when first sample is the result in the frontier distance for being located at cluster, if first sample is new category Sample, then in the output node of preset machine recognition model increase an output node, by increase output node after Machine recognition model is as online recognition model;
S104 uses online recognition model, the classification of next data flow after identifying current data stream.
Compared with the prior art, the embodiment of the present invention identifies the classification of current data stream by semi-supervised model, based on knowledge Other classification judges whether current data stream is new category sample, the structure of machine recognition model is changed, by the machine after restructuring Device learning model improves online knowledge as online recognition model, the classification of next data flow after identifying current data stream Other model adapts to the ability of network environment, and the classification real-time of identification data flow can be improved.
In order to improve the classification real-time of identification data flow, above-mentioned S102 can obtain the using at least one embodiment Whether the classification and first sample of one sample are located at the result in the frontier distance of cluster:
In a kind of possible embodiment, as follows, obtain first sample classification and first sample whether Result in the frontier distance of cluster:
Step 1: inputting semi-supervised model for first sample, utilizes the classification of semi-supervised model output first sample;
Step 2: local density and minimum range that the first training sample concentrates each sample are calculated, local density is surpassed It crosses density threshold and minimum range is more than the sample of distance threshold as third sample;First training sample set is that training is semi-supervised Sample group used in model at set;
Step 3: third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number of cluster It is identical as third number of samples, only one third sample in each cluster;
Step 4: if first sample is more than the frontier distance of cluster at a distance from third sample, determine first sample not In the frontier distance of cluster;
Wherein, for using the cluster central point of cluster as the centre of sphere, radius is spherical shape composed by frontier distance in the frontier distance of cluster The range that region surrounds.
Step 5: if first sample is less than the frontier distance of cluster at a distance from third sample, determine first sample In the frontier distance of cluster.
First, it is assumed that the first training sample set of data is S={ xj|j∈IS, wherein IS={ 1,2 ..., n }, dijIt indicates xiSample and xjThe distance between sample calculates sample xiLocal density φiWith minimum range θi, Φ={ φi|i∈ISAnd Θ={ θi|i∈IS, ISSet of integers is represented, i and j take positive integer;Φ represents local density's set, and Θ represents minimum range collection It closes.
Local density φiCalculation formula are as follows:OrIts In, dcTruncation distance is represented, is a pre-set numerical value, the truncation distance that frontier distance is m times, m is preset Numerical value;χ () is jump function,X is the input of jump function.
By local density φiCalculation formula:It can be seen that xiφiRelative size with To xiDistance be less than dcNumber of samples it is related, that is, be less than dcSample number it is more, φiValue it is bigger.The calculation formula, can With by φiThe discrete value of value has become successive value, improves and calculates local density's accuracy rate.
Minimum range θiCalculation formula are as follows:Local density is surpassed It crosses density threshold and minimum range is more than distance threshold as third sample.
Wherein, third sample is at a distance from first sample, dijEuclidean distance, manhatton distance, Qie Bixue can be used The formula such as husband's distance, Minkowski Distance, standardization Euclidean distance or cosine similarity distance are calculated.
Present embodiment concentrates local density and the minimum range of each sample by calculating the first training sample, if the One sample is less than the frontier distance of cluster at a distance from third sample, then determines that first sample is located in the frontier distance of cluster, can Determine that first sample is located at the efficiency in the frontier distance of cluster to improve.
The classification of first sample is obtained by above embodiment and frontier distance that whether first sample is located at cluster in Result the step of after, a kind of network flow identification method provided in an embodiment of the present invention further include: update the cluster center of cluster Point.
In a kind of possible embodiment, the cluster central point of cluster is updated as follows:
Step 1: local density and the minimum range of each sample are concentrated based on the first training sample, local density is surpassed It crosses density threshold and minimum range is less than the sample of distance threshold as the 4th sample;
Step 2: the 4th sample is added in the cluster where the nearest cluster central point of the 4th sample of distance;
Step 3: if first sample is located in the frontier distance of cluster, first sample is added to apart from first sample Where nearest cluster central point in cluster;
Step 4: calculating the local density of sample and minimum range in each cluster, for a cluster, is more than by local density Density threshold and minimum range are more than that the sample of distance threshold determines the cluster central point of the cluster, and the cluster after update central point is made For updated cluster.
Present embodiment by updating the cluster central point of cluster, can be improved determining first sample whether be located at the boundary of cluster away from From interior accuracy rate.
In alternatively possible embodiment, as follows, the classification and first sample for obtaining first sample are The no result in the frontier distance of cluster:
Step 1: inputting semi-supervised model for first sample, exports first sample using the output node of semi-supervised model Whether affiliated different classes of probability and first sample are located at the result in the frontier distance of cluster;Output node and first sample institute It is corresponding to belong to classification.
Such as: g-th of output node output first sample belongs to the other probability of g type.
Step 2: it is exported in output node in probability different classes of belonging to first sample, select probability highest first Classification of the sample generic as first sample.
Present embodiment passes through classification of the highest first sample generic of select probability as first sample, Ke Yiti Height determines the accuracy rate of the classification of first sample.
In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S103 Online recognition model:
In a kind of possible embodiment, online recognition model is obtained as follows:
Step 1: in the case where in the frontier distance that first sample is located at cluster, exist if the second training sample is concentrated The second sample identical with the classification of first sample, then determining first sample not is the sample of new category, and updates preset machine The parameter of device identification model;Second training sample set is the set of the composition of data flow used in training machine identification model;
Step 2: in the case where in the frontier distance that first sample is located at cluster, if the second training sample concentration is not deposited In the second sample identical with the classification of first sample, then determine that first sample is the sample of new category, and in preset machine Increase an output node in the output node of identification model, knows using the machine recognition model after increase output node as online Other model.
With reference to Fig. 6 and Fig. 7, preset machine recognition model MoUsing CNN (Convolutional Neural Networks, convolutional neural networks), MoOutput layer include K node, upper one layer of output layer includes J node, output layer With upper one layer between line represent the parameter of output layer, in the case where in the frontier distance that first sample is located at cluster, if Two training samples, which are concentrated, has the second sample identical with the classification of first sample, then determining first sample not is the sample of new category This, and update preset machine recognition model output layer and it is one layer upper between parameter.It is located at the boundary of cluster in first sample In the case that distance is interior, if the second training sample, which is concentrated, does not have the second sample identical with the classification of first sample, sentence Determine the sample that first sample is new category, and increase an output node in the output node of preset machine recognition model, Using the machine recognition model after increase output node as online recognition model.
In alternatively possible embodiment, in the case where in the frontier distance that first sample is located at cluster, if the One sample is the sample of new category, then the increase of preset machine recognition Model Parameter dimension is one-dimensional, will increase parameter dimensions Machine recognition model afterwards is as online recognition model.
When first sample belongs to the sample of new category, i.e. X ∈ CK+1, the preset machine recognition model of parametrization is come Say, increase an output node mean by preset machine recognition model output layer parameter dimension increase it is one-dimensional, by W ∈ RJ×K→W∈RJ×(K+1), b ∈ RK→b∈RK+1, ρ ∈ RK→ρ∈RK+1, wherein W represents weight set, and R represents set of real numbers, b generation The set of table biasing, ρ represent the set of proficiency, → represent assignment, the total number of K output node.
In another possible embodiment, online recognition model is obtained as follows:
Step 1: in the case where in the frontier distance that the first sample is located at cluster, if the first sample is new The sample of classification then increases an output node in the output node of preset machine recognition model, will increase output node The machine recognition model afterwards is as basic identification model;
Step 2: inputting basic identification model for first sample, calculates the loss function of basic identification model for basis Output layer weight and the local derviation of biasing in identification model;
Step 3: in the direction of gradient decline, the weight of basic identification model output layer is updated using parameter more new formula And biasing;Parameter more new formula includes that the increment of proficiency identifies mould for basis with the loss function of basic identification model respectively The result that output layer weight is multiplied with the local derviation of biasing in type;
Step 4: the basic identification model after updating weight and bias is determined as online recognition model.
Assuming that upper one layer of output of basic identification model output layer are as follows: F={ fj}∈RJ, output layer output Y={ yk}∈ RK, weight W={ wjk}∈RJ×K, bias b={ bk}∈RK.The output layer of basic identification model is softmax layers, then can push away Lead the cross entropy loss function for obtaining basic identification model are as follows:Wherein, T={ tk}∈RKIt is data The one-hot of traffic category is encoded, yk=g (zk) be k-th of output node activation value, g () is softmax function, Softmax are as follows: fjRepresent the feature activation value of j-th of node, bk Represent the biasing of k-th of node, RKDimension is represented as K dimension, tkRepresent the kth position of one-hot coding, zkRepresent k-th of node Activation value, wjkThe weight of one layer on output layer of node j and output node between k is represented, i, k represent the sequence of output node Number, positive integer is taken, k also represents the serial number of one-hot coding median, and j represents the serial number of one layer of node on output layer, and K is defeated The total number of egress, the node number that J is one layer on output layer are available by seeking partial derivative to above-mentioned loss function The increment of weight and biasing:Wherein I { } is indicator function:
It is understood that obtaining online recognition model, base using the sample training basis identification model of new category every time The parameter of plinth identification model all can constantly adapt to the sample of new category, learn the feature of new category sample, the sample of old classification It is no longer participate in training.It is this have no limitation so that the training method that on-line study model shakes down there is a problem of it is serious, I.e. when the variation of basic identification model inner parameter, the feature to have learnt can be had an impact, it could even be possible to destroying completely The ability of basic identification model before leads to that serious error occurs to the identification for not being new category sample, " calamity something lost occurs Forget " problem.
In order to solve the problems, such as catastrophic forgetting, needing to make basic identification model in the sample of study new category and retain old class Very weighed between this.When the high stability of basic identification model, basic identification model is more prone to retain old The feature of the sample of classification, and the feature capabilities for learning new category sample can be weakened;On the contrary, when basic identification model is plastic Property it is higher when, will have the ability of stronger study new category sample, and while be also easier to forget the feature of the sample of old classification, instruct The key that the online recognition model got adapts to network environment is that obtains different power between stability and plasticity Weighing apparatus.
In order to which stability-plasticity of online recognition model is controllable, the embodiment of the present invention proposes a kind of proficiency mechanism, should Mechanism introduces one group of additional parameter ρ={ ρk}∈RK, ρ represents proficiency set, ρkIndicate basic identification model for kth The proficiency of the classification of a node output, proficiency is for measuring online recognition model for the recognition capability of sample of all categories.
Wherein, ρk∈ [0,1), initial value 0 indicates that basic identification model is for point of all categories under initial situation Class proficiency is all 0.In order to influence stability-controllability of model using proficiency, proficiency ρ should have property below Matter:
1), proficiency is influenced by the result of identification sample class.The number of correct specimen discerning classification is more, corresponding Proficiency it is higher;Error sample identifies that the number of classification is more, and corresponding proficiency is lower.
2), proficiency influences the variation of itself.When proficiency is lower, the difficulty for further increasing or reducing proficiency is small, Proficiency itself increases or decreases fast;Proficiency is higher, and the difficulty for further increasing or reducing itself proficiency is bigger, i.e., ripe White silk degree increases or decreases slower.
3), proficiency influences study or forgets the difficulty of knowledge.When proficiency is lower, more new category samples are acquired Feature or forget that the feature of old sample class is relatively easy, i.e., model parameter update faster;On the contrary, when proficiency is higher When, the difficulty for learning or forgetting is also bigger, i.e., model parameter updates slower.
For example, if X ∈ CkAnd Y ∈ Ck, show that the classification of kth class corresponding for sample X is correct, then correspond to Proficiency ρkIncrease;If X ∈ CiBut Y ∈ Cj, show the i-th class mistake being identified as jth class, then corresponding ρiAnd ρj It reduces.
With reference to Fig. 8, in order to realize that property 2 and property 3, the embodiment of the present invention propose the function of a proficiency for calculating The increment of proficiency:
The function of proficiency are as follows:Wherein α and β is two parameters, for controlling The overall trend of the function of proficiency, Fig. 8 show the change procedure of the function of the proficiency under different α and β, work as ρkIt is smaller When, the increment prof (ρ of proficiencyk) larger, with ρkIncrease, prof (ρk) and its derivative be all gradually reduced, work as ρkIncrease to pole When limit value 1, the increment prof (ρ of proficiencyk) value be 0, proficiency ρkNo longer it is updated.
Proficiency ρkUpdate formula: ρk←ρk±prof(ρk),
With reference to Fig. 9, as proficiency increases, the increment prof (ρ of proficiencyk) be gradually reduced, i.e., basic identification model ginseng The amplitude that number updates is gradually reduced;Fig. 9 shows the increment prof (ρ of proficiency under different parametersk) situation of change, it is seen that it is logical The parameter alpha and β for crossing the increment of adjustment proficiency, can control the renewal speed of basic identification model.
Therefore, when updating the weight and biasing of basic identification model output layer using parameter more new formula, increase proficiency Increment, then parameter more new formula are as follows: Represent weight wjkIncrement,Represent biasing bkIncrement.As the function prof (ρ of proficiencyk) for update weight W and biasing b when, need to enableTo ensure prof (0)=1, i.e., as proficiency ρkWhen=0, proficiency function does not influence the update of model;With Proficiency increase, increment coefficient prof (ρk) be gradually reduced, i.e., the amplitude that basic identification model parameter updates is gradually reduced;When ρkWhen → 1, prof (ρkThe update amplitude of) → 0, i.e., basic identification model tends to 0.
With reference to Figure 10, in the case where parameter beta difference, prof (ρk) situation of change, β is bigger, prof (ρk) decline speed It spends faster.
According to above analysis it is found that passing through the function prof (ρ of proficiency set ρ and proficiencyk) introducing, utilization is ripe Two parameter alphas and β of the function of white silk degree can control the ability and speed of basic identification model undated parameter, and then realize The tradeoff of the stability and plasticity of line identification model solves the problems, such as " calamity is forgotten ".
Illustrate the other one-hot coding of Tstream, it is assumed that the first training set sample is divided into 6 class samples, basis The output layer of identification model has 6 nodes.Number is specified for every a kind of sample, the sample class of the first training set includes " RDP (Remote Desktop Protocol, Remote Desktop Protocol) ", " bit-torrent BitTorrent ", " Web (World Wide Web, WWW) ", " SSH (Secure Shell, safety shell protocol) ", " eDonkey (eDonkey Network, electricity Donkey) ", and " NTP (Network Time Protocol, Network Time Protocol) ", reference numeral are as follows: 0,1,2,3,4,5, it is corresponding One-hot coding are as follows: 100000,010000,001000,000100,000010,000001.Assuming that one in the first training set The label of bar sample is 0, and the number of the sample label is 0, and the classification of the sample is " RDP ".Basic identification model identifies the sample The the 1st to the 6th node output of this classification is 0.5,0.1,0.1,0.1,0.1,0.1.Wherein, basic identification model identification should Sample is the probability highest for encoding 0, the loss function of basic identification model are as follows:
L=-1log0.5+ (- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(- 0·log0.1)。
It, can be using at least one embodiment identification in above-mentioned S104 in order to improve the classification real-time of identification data flow The classification of the data of packet header in next data flow after current data stream:
In a kind of possible embodiment, as follows, in next data flow after identifying current data stream The classification of the data of packet header:
Step 1: it in the case that data packet number reaches predetermined number in next data flow after current data stream, mentions The data of packet header in a data flow are removed as the second sample;
Step 2: by the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
It continues with and a kind of network flow identification device provided in an embodiment of the present invention is described.
As shown in figure 11, a kind of network flow identification device provided in an embodiment of the present invention is applied to server, device packet It includes:
Sample module 1101, for extracting data packet in current data stream in the case where receiving the completion of current data stream Header data, as first sample;
Supervision module 1102 exports first sample using semi-supervised model for first sample to be inputted semi-supervised model Classification and first sample whether be located at the result in the frontier distance of cluster;Semi-supervised model is assembled for training using the first training sample It gets and the classification comprising having obtained header data and the first training sample concentrates the distribution relation of remaining sample;First training Comprising there is the sample of class label at least one in sample set;Distribution relation decision has whether the sample of class label is located at cluster Frontier distance in result;
Module 1103 is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section Machine recognition model after point is as online recognition model;
Identification module 1104, for using online recognition model, the class of next data flow after identifying current data stream Not.
Optionally, a kind of network flow identification device provided in an embodiment of the present invention further include:
Storage unit for successively receiving the data packet of current data stream, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted It saves according to the header data of packet to the storage region of five-tuple information respective path.
Sample module is specifically used for:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Sample module is specifically used for:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed The vector of dimension is as first sample.
Supervision module is specifically used for:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster In frontier distance.
Change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Change module is specifically used for:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online Other model.
Change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category, Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula; Parameter more new formula includes the increment of proficiency respectively with the loss function of basic identification model for defeated in basic identification model The result that layer weight is multiplied with the local derviation of biasing out;
The basic identification model after weight being updated and biased is determined as online recognition model.
Identification module is specifically used for:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 12, including processor 1201, communication interface 1202, memory 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through communication Bus 1204 completes mutual communication,
Memory 1203, for storing computer program;
Processor 1201 when for executing the program stored on memory 1203, realizes following steps:
In the case where receiving current data stream and completing, the header data of data packet in current data stream is extracted, as the One sample;
First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample The no result in the frontier distance of cluster;
When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as online recognition model;
Use online recognition model, the classification of next data flow after identifying current data stream.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a thick line in figure, it is not intended that an only bus or a type of bus convenient for indicating.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, abbreviation RAM), also may include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, Abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array, Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with instruction in storage medium, when run on a computer, so that computer executes any institute in above-described embodiment A kind of network flow identification method stated.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any a kind of network flow identification method in above-described embodiment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device/ For electronic equipment/computer readable storage medium/computer program product embodiments, implement since it is substantially similar to method Example, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of network flow identification method, which is characterized in that be applied to server, which comprises
In the case where receiving current data stream and completing, the header data of data packet in the current data stream is extracted, as the One sample;
The first sample is inputted into semi-supervised model, the classification and the of the first sample is exported using the semi-supervised model Whether one sample is located at the result in the frontier distance of cluster;The semi-supervised model is obtained using the training of the first training sample set It and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample;The first training sample This concentration includes the sample for having class label at least one;The distribution relation determine described in have class label sample whether Result in the frontier distance of cluster;
When the first sample is the result in the frontier distance for being located at cluster, if the first sample is new category Sample then increases an output node in the output node of preset machine recognition model, by the institute after increase output node Machine recognition model is stated as online recognition model;
Use the online recognition model, the classification of next data flow after identifying current data stream.
2. the method according to claim 1, wherein being extracted in the case where receiving the completion of current data stream The data of packet header in the current data stream, the step of as first sample before, the method also includes:
The data packet of current data stream is successively received, and obtains the five-tuple information of the data packet;
Judge whether database stores the five-tuple information, if five-tuple information described in the database purchase, by institute The header data for stating data packet is saved to the storage region with the five-tuple information respective path;
If the not stored five-tuple information of database, the memory block of creation and the five-tuple information respective path Domain saves the header data of the data packet to the storage region of the five-tuple information respective path.
3. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention The header data for taking data packet in the current data stream, as first sample, comprising:
Whether each data packet for judging current data stream includes end of identification, includes to terminate mark if there is a data packet Know, then receives data flow and be completed, extract the header data of data packet in the data flow as first sample.
4. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention The header data for taking data packet in the current data stream, as first sample, comprising:
In the case where receiving the completion of current data stream, the header data of data packet in the current data stream is extracted;
It is encoded using by the header data of data packet in the current data stream, obtains the vector of fixed dimension, it will be described The vector of fixed dimension is as first sample.
5. the method according to claim 1, wherein described input semi-supervised model, benefit for the first sample Export the classification of the first sample with the semi-supervised model and first sample whether be located at it is in the frontier distance of cluster as a result, Include:
The first sample is inputted into semi-supervised model, the classification of the first sample is exported using the semi-supervised model;
Local density and minimum range that first training sample concentrates each sample are calculated, is more than density threshold by local density Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is that training is described semi-supervised Sample group used in model at set;
The third sample is added in cluster, and the third sample is determined as to the cluster central point of cluster;Of the cluster Number, in each cluster only one third sample identical as third number of samples;
If the first sample is more than the frontier distance of cluster at a distance from third sample, determine that the first sample is not located at In the frontier distance of cluster;
If the first sample is less than the frontier distance of cluster at a distance from third sample, determine that the first sample is located at In the frontier distance of cluster.
6. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, exist and described the if the second training sample is concentrated Identical second sample of the classification of one sample, then determining the first sample not is the sample of new category, and updates preset machine The parameter of device identification model;Second training sample set is the collection of data flow composition used in the training machine recognition model It closes;
In the case where in the frontier distance that the first sample is located at cluster, if the second training sample concentrate do not exist with it is described Identical second sample of the classification of first sample then determines that the first sample is the sample of new category, and in preset machine In the output node of identification model increase an output node, using increase output node after the machine recognition model as Line identification model.
7. the method according to claim 1, wherein described in the frontier distance for being located at cluster in the first sample In the case where interior, if the first sample is the sample of new category, in the output node of preset machine recognition model Increase an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category, It is then that the increase of preset machine recognition Model Parameter dimension is one-dimensional, the machine recognition model after increase parameter dimensions is made For online recognition model.
8. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category, Then increase an output node in the output node of preset machine recognition model, by the machine after increase output node Identification model is as basic identification model;
By the first sample input basic identification model, the loss function of the basic identification model is calculated for described Output layer weight and the local derviation of biasing in basic identification model;
In the direction of gradient decline, the weight and partially of the basic identification model output layer is updated using the parameter more new formula It sets;The parameter more new formula include proficiency increment proficiency increment respectively with the loss function of basic identification model for The result that output layer weight is multiplied with the local derviation of biasing in the basis identification model;
The basic identification model after weight being updated and biased is determined as online recognition model.
9. identifying current number the method according to claim 1, wherein described use the online recognition model According to the classification of next data flow after stream, comprising:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract described next The data of packet header are as the second sample in data flow;
Second sample is inputted into the online recognition model, exports second sample using the online recognition model Classification.
10. a kind of network flow identification device, which is characterized in that be applied to server, described device includes:
Sample module, for extracting the packet of data packet in the current data stream in the case where receiving the completion of current data stream Head data, as first sample;
Supervision module utilizes the semi-supervised model output described first for the first sample to be inputted semi-supervised model Whether the classification and first sample of sample are located at the result in the frontier distance of cluster;The semi-supervised model is to utilize the first training Sample set training obtains and includes that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample; First training sample concentrates the sample comprising there is class label at least one;The distribution relation has classification described in determining Whether the sample of label is located at the result in the frontier distance of cluster;
Module is changed, in the case where in the frontier distance that the first sample is located at cluster, if the first sample is The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section The machine recognition model after point is as online recognition model;
Identification module, for using the online recognition model, the classification of next data flow after identifying current data stream.
CN201910036196.2A 2019-01-15 2019-01-15 Network traffic identification method and device Active CN109873774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910036196.2A CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910036196.2A CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Publications (2)

Publication Number Publication Date
CN109873774A true CN109873774A (en) 2019-06-11
CN109873774B CN109873774B (en) 2021-01-01

Family

ID=66917604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910036196.2A Active CN109873774B (en) 2019-01-15 2019-01-15 Network traffic identification method and device

Country Status (1)

Country Link
CN (1) CN109873774B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111447151A (en) * 2019-10-30 2020-07-24 长沙理工大学 Attention mechanism-based time-space characteristic flow classification research method
CN111614514A (en) * 2020-04-30 2020-09-01 北京邮电大学 Network traffic identification method and device
CN112367334A (en) * 2020-11-23 2021-02-12 中国科学院信息工程研究所 Network traffic identification method and device, electronic equipment and storage medium
CN113326946A (en) * 2020-02-29 2021-08-31 华为技术有限公司 Method, device and storage medium for updating application recognition model
CN113472654A (en) * 2021-05-31 2021-10-01 济南浪潮数据技术有限公司 Network traffic data forwarding method, device, equipment and medium
WO2022083509A1 (en) * 2020-10-19 2022-04-28 华为技术有限公司 Data stream identification method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
US20170026391A1 (en) * 2014-07-23 2017-01-26 Saeed Abu-Nimeh System and method for the automated detection and prediction of online threats
CN107729952A (en) * 2017-11-29 2018-02-23 新华三信息安全技术有限公司 A kind of traffic flow classification method and device
CN107846326A (en) * 2017-11-10 2018-03-27 北京邮电大学 A kind of adaptive semi-supervised net flow assorted method, system and equipment
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
US20170026391A1 (en) * 2014-07-23 2017-01-26 Saeed Abu-Nimeh System and method for the automated detection and prediction of online threats
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN107846326A (en) * 2017-11-10 2018-03-27 北京邮电大学 A kind of adaptive semi-supervised net flow assorted method, system and equipment
CN107729952A (en) * 2017-11-29 2018-02-23 新华三信息安全技术有限公司 A kind of traffic flow classification method and device
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梅国薇: "基于机器学习的网络流量分类系统设计与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111447151A (en) * 2019-10-30 2020-07-24 长沙理工大学 Attention mechanism-based time-space characteristic flow classification research method
CN113326946A (en) * 2020-02-29 2021-08-31 华为技术有限公司 Method, device and storage medium for updating application recognition model
WO2021169294A1 (en) * 2020-02-29 2021-09-02 华为技术有限公司 Application recognition model updating method and apparatus, and storage medium
CN111614514A (en) * 2020-04-30 2020-09-01 北京邮电大学 Network traffic identification method and device
CN111614514B (en) * 2020-04-30 2021-09-24 北京邮电大学 Network traffic identification method and device
WO2022083509A1 (en) * 2020-10-19 2022-04-28 华为技术有限公司 Data stream identification method and device
CN112367334A (en) * 2020-11-23 2021-02-12 中国科学院信息工程研究所 Network traffic identification method and device, electronic equipment and storage medium
CN113472654A (en) * 2021-05-31 2021-10-01 济南浪潮数据技术有限公司 Network traffic data forwarding method, device, equipment and medium

Also Published As

Publication number Publication date
CN109873774B (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN109873774A (en) A kind of network flow identification method and device
CN107846326B (en) Self-adaptive semi-supervised network traffic classification method, system and equipment
Wang et al. App-net: A hybrid neural network for encrypted mobile traffic classification
CN109639739A (en) A kind of anomalous traffic detection method based on autocoder network
CN111475680A (en) Method, device, equipment and storage medium for detecting abnormal high-density subgraph
CN108986907A (en) A kind of tele-medicine based on KNN algorithm divides the method for examining automatically
CN108846097A (en) The interest tags representation method of user, article recommended method and device, equipment
CN109298225B (en) Automatic identification model system and method for abnormal state of voltage measurement data
CN109003091A (en) A kind of risk prevention system processing method, device and equipment
CN114386538A (en) Method for marking wave band characteristics of KPI (Key performance indicator) curve of monitoring index
CN111581445A (en) Graph embedding learning method based on graph elements
Cui et al. Feature extraction and classification method for switchgear faults based on sample entropy and cloud model
Hu et al. A novel SDN-based application-awareness mechanism by using deep learning
Qi et al. Patent analytic citation-based vsm: Challenges and applications
CN114398891B (en) Method for generating KPI curve and marking wave band characteristics based on log keywords
Ullah et al. Adaptive data balancing method using stacking ensemble model and its application to non-technical loss detection in smart grids
Yan et al. TL-CNN-IDS: transfer learning-based intrusion detection system using convolutional neural network
CN117041017B (en) Intelligent operation and maintenance management method and system for data center
CN116842459B (en) Electric energy metering fault diagnosis method and diagnosis terminal based on small sample learning
Qi et al. Incorporating adaptability-related knowledge into support vector machine for case-based design adaptation
Yang Uncertainty prediction method for traffic flow based on K-nearest neighbor algorithm
Xu et al. HTtext: A TextCNN-based pre-silicon detection for hardware Trojans
CN114124437B (en) Encrypted flow identification method based on prototype convolutional network
CN109063735A (en) A kind of classification of insect Design Method based on insect biology parameter
CN105740329B (en) A kind of contents semantic method for digging of unstructured high amount of traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant