CN109873774A - A kind of network flow identification method and device - Google Patents
A kind of network flow identification method and device Download PDFInfo
- Publication number
- CN109873774A CN109873774A CN201910036196.2A CN201910036196A CN109873774A CN 109873774 A CN109873774 A CN 109873774A CN 201910036196 A CN201910036196 A CN 201910036196A CN 109873774 A CN109873774 A CN 109873774A
- Authority
- CN
- China
- Prior art keywords
- sample
- model
- cluster
- recognition model
- output node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
A kind of network flow identification method and device provided in an embodiment of the present invention, method include: the data of packet header in current data stream to be extracted, as first sample in the case where receiving the completion of current data stream;First sample is inputted into semi-supervised model, whether is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category, then increase an output node in the output node of preset machine recognition model, using the machine recognition model after increase output node as online recognition model;Then the classification of next data flow after current data stream is identified.Compared with the prior art, the structure of change of embodiment of the present invention machine recognition model, the classification of next data flow after current data stream is identified using the machine learning model after restructuring, can be improved the classification real-time of identification data flow.
Description
Technical field
The present invention relates to fields of communication technology, more particularly to a kind of network flow identification method and device.
Background technique
Flow is the important carrier that data are transmitted in network, and flow identification is the key link of network monitoring, only convection current
Amount is identified, different monitoring strategies could be taken according to different flows, such as: refusal, optimization, mark, preferential fraction
Class etc., therefore network flow identify most important.General networking flow is transmitted in the form of data flow, every data
Stream includes multiple data packets, and each data packet includes the header data of fixed byte, can obtain packet header number according to header data
According to feature, the feature of header data includes: time interval, flow the duration, mean value, variance of data package size etc..
Using the method based on machine learning, this method, which mainly passes through machine, is identified to network flow in the prior art
Device learning art excavates the feature of network header data, and then training obtains machine learning model, then inputs data flow and instructs
The machine learning model got exports the classification of online network flow.Wherein, machine learning is obtained using following steps training
Model: first by counting the feature of packet header data in whole data flow, whole or portion in whole data flow are selected
Divide the feature of header data as sample, training sample obtains machine learning model, this machine learning model is offline mould
Type, internal structure are fixed.
Due to the real-time change of network environment, the feature of data flow can also change, and use the machine that internal structure is fixed
Device learning model identifies that the real-time of the classification of online network flow is not high, therefore the prior art identifies online network flow class
Other real-time is not high.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of network flow identification method and device, improves identification data flow
Classification real-time, specific technical solution are as follows:
In a first aspect, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, method packet
It includes:
In the case where receiving current data stream and completing, the header data of data packet in current data stream is extracted, as the
One sample;
First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample
The no result in the frontier distance of cluster;Semi-supervised model is to be obtained using the training of the first training sample set and include to have obtained
The classification of header data and the first training sample concentrate the distribution relation of remaining sample;First training sample concentrate comprising at least
One has the sample of class label;Distribution relation decision has whether the sample of class label is located at the knot in the frontier distance of cluster
Fruit;
When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category
This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as online recognition model;
Use online recognition model, the classification of next data flow after identifying current data stream.
Optionally, in the case where receiving the completion of current data stream, the number of packet header in current data stream is extracted
According to, the step of as first sample before, method further include:
The data packet of current data stream is successively received, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet
Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted
It saves according to the header data of packet to the storage region of five-tuple information respective path.
Optionally, in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted,
As first sample, comprising:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet
Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Optionally, in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted,
As first sample, comprising:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed
The vector of dimension is as first sample.
Optionally, first sample is inputted into semi-supervised model, utilizes the classification and the of semi-supervised model output first sample
Whether one sample is located at the result in the frontier distance of cluster, comprising:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density
Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training
The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster
Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster
In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster
In frontier distance.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category
This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as online recognition model, comprising:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated
This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould
The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated
Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model
Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category
This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as online recognition model, comprising:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster
Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online
Other model.
Optionally, in the frontier distance that first sample is located at cluster in the case where, if first sample is the sample of new category
This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as online recognition model, comprising:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category,
Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node
As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model
Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula;
Parameter more new formula includes proficiency increment respectively with the loss function of basic identification model for exporting in basic identification model
The result that layer weight is multiplied with the local derviation of biasing;
The basic identification model after weight being updated and biased is determined as online recognition model.
Optionally, using online recognition model, the classification of next data flow after identifying current data stream, comprising:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next
The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
Second aspect, a kind of network flow identification device provided in an embodiment of the present invention are applied to server, device packet
It includes:
Sample module, for extracting the packet of data packet in current data stream in the case where receiving the completion of current data stream
Head data, as first sample;
Supervision module utilizes the class of semi-supervised model output first sample for first sample to be inputted semi-supervised model
Not and whether first sample is located at the result in the frontier distance of cluster;Semi-supervised model is trained using first training sample set
To and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample;First training sample
Concentrate the sample comprising there is class label at least one;Distribution relation determines the side for having the sample of class label whether to be located at cluster
Result in boundary's distance;
Module is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is new class
Other sample then increases an output node in the output node of preset machine recognition model, after increasing output node
Machine recognition model as online recognition model;
Identification module, for using online recognition model, the classification of next data flow after identifying current data stream.
Optionally, a kind of network flow identification device provided in an embodiment of the present invention further include:
Storage unit for successively receiving the data packet of current data stream, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet
Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted
It saves according to the header data of packet to the storage region of five-tuple information respective path.
Optionally, sample module is specifically used for:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet
Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Optionally, sample module is specifically used for:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed
The vector of dimension is as first sample.
Optionally, supervision module is specifically used for:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density
Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training
The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster
Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster
In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster
In frontier distance.
Optionally, change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated
This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould
The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated
Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model
Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Optionally, change module is specifically used for:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster
Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online
Other model.
Optionally, change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category,
Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node
As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model
Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula;
Parameter more new formula includes proficiency increment respectively with the loss function of basic identification model for exporting in basic identification model
The result that layer weight is multiplied with the local derviation of biasing;
The basic identification model after weight being updated and biased is determined as online recognition model.
Optionally, identification module is specifically used for:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next
The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable
Instruction is stored in storage medium, when run on a computer, so that computer executes a kind of any of the above-described net
Network method for recognizing flux.
At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced
Product, when run on a computer, so that computer executes a kind of any of the above-described network flow identification method.
A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream
Under condition, the data of packet header in current data stream are extracted, as first sample;First sample is inputted into semi-supervised model,
Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;First
In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition
Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould
Type;Use online recognition model, the classification of next data flow after identifying current data stream.Compared with the prior art, this hair
Bright embodiment identifies the classification of current data stream by semi-supervised model, judges whether current data stream is new based on identification classification
Classification sample changes the structure of machine recognition model, using the machine learning model after restructuring as online recognition model, knows
The classification real-time of identification data flow can be improved in the classification of next data flow after other current data stream.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent
Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described.
Fig. 1 is a kind of flow chart of network flow identification method provided in an embodiment of the present invention;
Fig. 2 is the flow chart provided in an embodiment of the present invention stored to current data stream;
Fig. 3 presets semi-supervised model structure to be provided in an embodiment of the present invention;
Fig. 4 is the network structure of LSTM encoder cycles provided in an embodiment of the present invention;
Fig. 5 is the internal structure chart of LSTM encoder provided in an embodiment of the present invention;
Fig. 6 is the structure chart of preset machine recognition model provided in an embodiment of the present invention;
Fig. 7 is the structure chart of online recognition model provided in an embodiment of the present invention;
Fig. 8 is the effect picture of the function of proficiency under different parameters provided in an embodiment of the present invention;
Fig. 9 is that horizontal axis provided in an embodiment of the present invention is to identify that correct number, the longitudinal axis are proficiency function in different parameters
Under effect picture;
Figure 10 is the effect picture of proficiency function under different beta provided in an embodiment of the present invention;
Figure 11 is a kind of structure chart of network flow identification device provided in an embodiment of the present invention;
Figure 12 is the structure chart of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.
A kind of network flow identification method and device provided in an embodiment of the present invention, in the feelings for receiving the completion of current data stream
Under condition, the data of packet header in current data stream are extracted, as first sample;First sample is inputted into semi-supervised model,
Whether it is located at the result in the frontier distance of cluster using the classification and first sample of semi-supervised model output first sample;First
In the case that sample is located in the frontier distance of cluster, if first sample is the sample of new category, in preset machine recognition
Increase an output node in the output node of model, using the machine recognition model after increase output node as online recognition mould
Type;Use online recognition model, the classification of next data flow after identifying current data stream.
A kind of network flow identification method provided in an embodiment of the present invention is described first below.
As shown in Figure 1, a kind of network flow identification method provided in an embodiment of the present invention, is applied to server, the side
Method includes:
S101 extracts the header data of data packet in current data stream in the case where receiving the completion of current data stream, makees
For first sample;
Before the above-mentioned S101 the step of, a kind of network flow identification method provided in an embodiment of the present invention further includes to working as
Preceding data flow is stored:
As shown in Fig. 2, including: to current data stream progress storing step
S201, successively receives the data packet of current data stream, and obtains the five-tuple information of data packet;
Wherein, five-tuple information are as follows: source IP address, purpose IP address, source port number, destination slogan, transport layer protocol.
S202, judges whether database stores five-tuple information, if database purchase five-tuple information, by data packet
Header data save to the storage region with five-tuple information respective path;
S203, if the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path,
The header data of data packet is saved to the storage region of five-tuple information respective path.
The embodiment of the present invention, can be with by the way that the header data of data packet is stored storage region corresponding to five-tuple information
Improve the efficiency for searching the header data of data packet in same data flow.
In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S101
First sample:
In a kind of possible embodiment, by judging whether each data packet of current data stream includes to terminate mark
Know, includes end of identification if there is a data packet, then receive data flow and be completed;Extract the packet header of data packet in data flow
Data are as first sample.
It should be understood that each data packet includes end of identification, if a data stream end of transmission, the data stream
In the end of identification of the last one data packet will change, present embodiment is by judge whether data packet includes to tie in data flow
Beam identification can quickly determine whether a data stream finishes receiving.
In a kind of possible embodiment, first sample is obtained by following steps:
Step 1: in the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
Step 2: encoding using by the header data of data packet in current data stream, obtains the vector of fixed dimension,
Using the vector of fixed dimension as first sample.
It should be understood that a network flow, is made of volume of data packet, the package head format of each data packet is very
Rule has different field values, such as the data packet of a Transmission Control Protocol, packet header number comprising fixed byte number respectively
According to 54 bytes are shared in addition to Optional Field, the TCP header of the IP head and 20 bytes of frame head, 20 bytes including 14 bytes.If will
P data packet of one stream, each header data includes q byte, by each byte conversion at signless integer, a packet header
Data just obtain the vector X ∈ R an of fixed dimension as a linep×q, element is also the integer of [0,255].Therefore this reality
Mode is applied by encoding header data, obtains the vector of fixed dimension, fixed dimension is p × q, by the fixed dimension
Vector is as first sample so as to improve the efficiency of identification first sample classification.
First sample is inputted semi-supervised model by S102, utilizes the classification and first of semi-supervised model output first sample
Whether sample is located at the result in the frontier distance of cluster;
Wherein, semi-supervised model is obtained using the training of the first training sample set, and includes the class for having obtained header data
The distribution relation of remaining sample is not concentrated with the first training sample;First training sample is concentrated comprising having classification mark at least one
The sample of label;Distribution relation decision has whether the sample of class label is located at the result in the frontier distance of cluster.
It should be understood that training obtains the concentration of the first training sample needed for semi-supervised model, there is classification comprising a part
The sample of label, the sample for having class label are the header data for having obtained data flow after a classification determines.Remainder
For the sample of no class label.The sample of first training sample set is divided into several clusters, the first training sample set in each cluster
Sample distribution it has been determined that then thering is class label sample and the distribution of the sample without class label to have determined in each cluster.
Therefore class label sample training will obtains semi-supervised model, whether the sample for containing class label is located at the boundary of cluster
Result in distance.
In a kind of possible embodiment, semi-supervised model can be obtained as follows:
Step 1, the first training sample concentrate sample training to preset semi-supervised model, the semi-supervised model after being trained;
As shown in figure 3, presetting semi-supervised model by LSTM (Long Short-Term Memory, shot and long term memory network)
Encoder, softmax layers and CFSFDP (Clustering by Fast Search and Find of Density Peaks,
Cluster based on density peaks) layer composition, the sample input LSTM encoder of class label will be carried, it will by LSTM encoder
The unfixed data stream encoding of length is the vector of fixed dimension, since the vector of fixed dimension includes that entire data stream sequences are defeated
Out, therefore the vector of fixed dimension can represent the features of all data packets of whole data flow, and softmax layers for that will fix dimension
The vector of degree, is mapped to fixed classification, the softmax layers of classification that can export header data;Then by softmax layers from pre-
If removing in semi-supervised model, the sample for carrying class label and the sample for not carrying class label are inputted into LSTM encoder,
The output input CFSFDP of LSTM encoder is clustered into layer, mainly uses CFSFDP (Clustering by Fast Search
And Find of Density Peaks, based on the cluster of density peaks) algorithm determines whether the vector of fixed dimension is cluster
The classification of cluster central point and sample.
As shown in Figures 4 and 5, LSTM encoder is as follows by the process for the vector that data stream encoding is fixed dimension:
Such as a data flow { x0,x1,…,xt-1,xt, data flow is inputted such as the circulation of Fig. 4 in sequence
In network structure, each input xtThere can be an output ht, while current state is passed into next input, it is so defeated
H outtIn both contain xtInformation, also contain x0~xt-1Information.
The internal structure of LSTM encoder such as Fig. 5, encoder input terminal receive input xt, upper output ht-1With upper one
Secondary input xt-1The state C of encoder afterwardst-1, it is assumed that the output h of each steptDimension be 128 dimension, certain data stream includes n altogether
A data packet, using all data packets of whole stream as a sequence, the header data of each data packet is xt, the data stream
The last one xnOutput hn。
Wherein, xtRepresent a header data;T represents the serial number of header data;N represents data packet in a data stream
Total number;htThe input of LSTM encoder is represented as xtWhen, the output of LSTM encoder;hnData stream input LSTM is represented to compile
After code device, the last one of LSTM encoder is exported, that is, the vector of the fixed dimension after encoding.
Step 2, the semi-supervised model after the test sample input training that test sample is concentrated, utilizes half after training
Monitor model exports the classification that test sample concentrates test sample;
It should be understood that the class label of a data stream middle wrapping head data be it is identical, test sample be whole data
The header data of stream, the classification of the category tag identifier test sample.
Step 3, classification and test sample based on test sample in the semi-supervised model measurement sample set after training are concentrated
Whether the label classification of test sample, the semi-supervised model after determining training meet test index;
Wherein, test index are as follows: accuracy reaches accuracy threshold value, precision reaches precision threshold, recall rate reaches and recalls
Rate threshold value, F1 score reach F1 score threshold and/or FβScore reaches FβScore threshold.
Wherein, accuracy threshold value, precision threshold, recall rate threshold value, F1 score threshold, FβScore threshold is pre-set
Numerical value, accuracy, precision, recall rate, F1 score, FβScore uses accuracy, precision, recall rate, F1 score, F respectivelyβScore
Formula is calculated.
Step 4, if the semi-supervised model after training is unsatisfactory for test index, the semi-supervised model after updating training
The parameter of middle LSTM encoder, until the semi-supervised model after training meets test index;
Wherein, encoder output formula is as follows:
ft=σ (Wf·[ht-1,xt]+bf)
it=σ (Wi·[ht-1,xt]+bi)
ot=σ (Wo·[ht-1,xt]+bo)
ht=ot·tanh(Ct)
Wherein, the inner parameter of LSTM encoder are as follows: W*And b*, W*Represent weight parameter, b*Represent offset parameter, ftIt represents
Current step forgets the activation value of door,For Sigmoid function, WfRepresent the weight for forgeing door, ht-1Represent upper one
The output of step, xtRepresent the input currently walked, bfRepresent the biasing for forgeing door, itRepresent the activation value of current step input gate, WiGeneration
The weight of table input gate, biThe biasing of input gate is represented,Represent current step intermediate state, WCRepresent state weight, bCRepresent shape
State biasing, Ct-1Represent previous step state, CtRepresent the state currently walked, otRepresent current step out gate activation value, WoIt represents defeated
It gos out weight, boOut gate biasing is represented, i represents the activation value of input gate, and f represents the activation value for forgeing door, and t represents data packet
Number, o represent out gate activation value, and C represents state.
If the semi-supervised model after training is unsatisfactory for test index, LSTM is compiled in the semi-supervised model after updating training
The parameter W of code devicef、bf、Wi、bi、WC、bC、Wo、bo。
Semi-supervised model after the training for meeting test index is determined as semi-supervised model by step 5.
Present embodiment, which passes through, is determined as semi-supervised model for the semi-supervised model of presetting after the training for meeting test index,
The accuracy rate of determining first sample classification can be improved.
S103, when first sample is the result in the frontier distance for being located at cluster, if first sample is new category
Sample, then in the output node of preset machine recognition model increase an output node, by increase output node after
Machine recognition model is as online recognition model;
S104 uses online recognition model, the classification of next data flow after identifying current data stream.
Compared with the prior art, the embodiment of the present invention identifies the classification of current data stream by semi-supervised model, based on knowledge
Other classification judges whether current data stream is new category sample, the structure of machine recognition model is changed, by the machine after restructuring
Device learning model improves online knowledge as online recognition model, the classification of next data flow after identifying current data stream
Other model adapts to the ability of network environment, and the classification real-time of identification data flow can be improved.
In order to improve the classification real-time of identification data flow, above-mentioned S102 can obtain the using at least one embodiment
Whether the classification and first sample of one sample are located at the result in the frontier distance of cluster:
In a kind of possible embodiment, as follows, obtain first sample classification and first sample whether
Result in the frontier distance of cluster:
Step 1: inputting semi-supervised model for first sample, utilizes the classification of semi-supervised model output first sample;
Step 2: local density and minimum range that the first training sample concentrates each sample are calculated, local density is surpassed
It crosses density threshold and minimum range is more than the sample of distance threshold as third sample;First training sample set is that training is semi-supervised
Sample group used in model at set;
Step 3: third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number of cluster
It is identical as third number of samples, only one third sample in each cluster;
Step 4: if first sample is more than the frontier distance of cluster at a distance from third sample, determine first sample not
In the frontier distance of cluster;
Wherein, for using the cluster central point of cluster as the centre of sphere, radius is spherical shape composed by frontier distance in the frontier distance of cluster
The range that region surrounds.
Step 5: if first sample is less than the frontier distance of cluster at a distance from third sample, determine first sample
In the frontier distance of cluster.
First, it is assumed that the first training sample set of data is S={ xj|j∈IS, wherein IS={ 1,2 ..., n }, dijIt indicates
xiSample and xjThe distance between sample calculates sample xiLocal density φiWith minimum range θi, Φ={ φi|i∈ISAnd
Θ={ θi|i∈IS, ISSet of integers is represented, i and j take positive integer;Φ represents local density's set, and Θ represents minimum range collection
It closes.
Local density φiCalculation formula are as follows:OrIts
In, dcTruncation distance is represented, is a pre-set numerical value, the truncation distance that frontier distance is m times, m is preset
Numerical value;χ () is jump function,X is the input of jump function.
By local density φiCalculation formula:It can be seen that xiφiRelative size with
To xiDistance be less than dcNumber of samples it is related, that is, be less than dcSample number it is more, φiValue it is bigger.The calculation formula, can
With by φiThe discrete value of value has become successive value, improves and calculates local density's accuracy rate.
Minimum range θiCalculation formula are as follows:Local density is surpassed
It crosses density threshold and minimum range is more than distance threshold as third sample.
Wherein, third sample is at a distance from first sample, dijEuclidean distance, manhatton distance, Qie Bixue can be used
The formula such as husband's distance, Minkowski Distance, standardization Euclidean distance or cosine similarity distance are calculated.
Present embodiment concentrates local density and the minimum range of each sample by calculating the first training sample, if the
One sample is less than the frontier distance of cluster at a distance from third sample, then determines that first sample is located in the frontier distance of cluster, can
Determine that first sample is located at the efficiency in the frontier distance of cluster to improve.
The classification of first sample is obtained by above embodiment and frontier distance that whether first sample is located at cluster in
Result the step of after, a kind of network flow identification method provided in an embodiment of the present invention further include: update the cluster center of cluster
Point.
In a kind of possible embodiment, the cluster central point of cluster is updated as follows:
Step 1: local density and the minimum range of each sample are concentrated based on the first training sample, local density is surpassed
It crosses density threshold and minimum range is less than the sample of distance threshold as the 4th sample;
Step 2: the 4th sample is added in the cluster where the nearest cluster central point of the 4th sample of distance;
Step 3: if first sample is located in the frontier distance of cluster, first sample is added to apart from first sample
Where nearest cluster central point in cluster;
Step 4: calculating the local density of sample and minimum range in each cluster, for a cluster, is more than by local density
Density threshold and minimum range are more than that the sample of distance threshold determines the cluster central point of the cluster, and the cluster after update central point is made
For updated cluster.
Present embodiment by updating the cluster central point of cluster, can be improved determining first sample whether be located at the boundary of cluster away from
From interior accuracy rate.
In alternatively possible embodiment, as follows, the classification and first sample for obtaining first sample are
The no result in the frontier distance of cluster:
Step 1: inputting semi-supervised model for first sample, exports first sample using the output node of semi-supervised model
Whether affiliated different classes of probability and first sample are located at the result in the frontier distance of cluster;Output node and first sample institute
It is corresponding to belong to classification.
Such as: g-th of output node output first sample belongs to the other probability of g type.
Step 2: it is exported in output node in probability different classes of belonging to first sample, select probability highest first
Classification of the sample generic as first sample.
Present embodiment passes through classification of the highest first sample generic of select probability as first sample, Ke Yiti
Height determines the accuracy rate of the classification of first sample.
In order to improve the classification real-time of identification data flow, can be obtained using at least one embodiment in above-mentioned S103
Online recognition model:
In a kind of possible embodiment, online recognition model is obtained as follows:
Step 1: in the case where in the frontier distance that first sample is located at cluster, exist if the second training sample is concentrated
The second sample identical with the classification of first sample, then determining first sample not is the sample of new category, and updates preset machine
The parameter of device identification model;Second training sample set is the set of the composition of data flow used in training machine identification model;
Step 2: in the case where in the frontier distance that first sample is located at cluster, if the second training sample concentration is not deposited
In the second sample identical with the classification of first sample, then determine that first sample is the sample of new category, and in preset machine
Increase an output node in the output node of identification model, knows using the machine recognition model after increase output node as online
Other model.
With reference to Fig. 6 and Fig. 7, preset machine recognition model MoUsing CNN (Convolutional Neural
Networks, convolutional neural networks), MoOutput layer include K node, upper one layer of output layer includes J node, output layer
With upper one layer between line represent the parameter of output layer, in the case where in the frontier distance that first sample is located at cluster, if
Two training samples, which are concentrated, has the second sample identical with the classification of first sample, then determining first sample not is the sample of new category
This, and update preset machine recognition model output layer and it is one layer upper between parameter.It is located at the boundary of cluster in first sample
In the case that distance is interior, if the second training sample, which is concentrated, does not have the second sample identical with the classification of first sample, sentence
Determine the sample that first sample is new category, and increase an output node in the output node of preset machine recognition model,
Using the machine recognition model after increase output node as online recognition model.
In alternatively possible embodiment, in the case where in the frontier distance that first sample is located at cluster, if the
One sample is the sample of new category, then the increase of preset machine recognition Model Parameter dimension is one-dimensional, will increase parameter dimensions
Machine recognition model afterwards is as online recognition model.
When first sample belongs to the sample of new category, i.e. X ∈ CK+1, the preset machine recognition model of parametrization is come
Say, increase an output node mean by preset machine recognition model output layer parameter dimension increase it is one-dimensional, by W ∈
RJ×K→W∈RJ×(K+1), b ∈ RK→b∈RK+1, ρ ∈ RK→ρ∈RK+1, wherein W represents weight set, and R represents set of real numbers, b generation
The set of table biasing, ρ represent the set of proficiency, → represent assignment, the total number of K output node.
In another possible embodiment, online recognition model is obtained as follows:
Step 1: in the case where in the frontier distance that the first sample is located at cluster, if the first sample is new
The sample of classification then increases an output node in the output node of preset machine recognition model, will increase output node
The machine recognition model afterwards is as basic identification model;
Step 2: inputting basic identification model for first sample, calculates the loss function of basic identification model for basis
Output layer weight and the local derviation of biasing in identification model;
Step 3: in the direction of gradient decline, the weight of basic identification model output layer is updated using parameter more new formula
And biasing;Parameter more new formula includes that the increment of proficiency identifies mould for basis with the loss function of basic identification model respectively
The result that output layer weight is multiplied with the local derviation of biasing in type;
Step 4: the basic identification model after updating weight and bias is determined as online recognition model.
Assuming that upper one layer of output of basic identification model output layer are as follows: F={ fj}∈RJ, output layer output Y={ yk}∈
RK, weight W={ wjk}∈RJ×K, bias b={ bk}∈RK.The output layer of basic identification model is softmax layers, then can push away
Lead the cross entropy loss function for obtaining basic identification model are as follows:Wherein, T={ tk}∈RKIt is data
The one-hot of traffic category is encoded, yk=g (zk) be k-th of output node activation value, g () is softmax function,
Softmax are as follows: fjRepresent the feature activation value of j-th of node, bk
Represent the biasing of k-th of node, RKDimension is represented as K dimension, tkRepresent the kth position of one-hot coding, zkRepresent k-th of node
Activation value, wjkThe weight of one layer on output layer of node j and output node between k is represented, i, k represent the sequence of output node
Number, positive integer is taken, k also represents the serial number of one-hot coding median, and j represents the serial number of one layer of node on output layer, and K is defeated
The total number of egress, the node number that J is one layer on output layer are available by seeking partial derivative to above-mentioned loss function
The increment of weight and biasing:Wherein I { } is indicator function:
It is understood that obtaining online recognition model, base using the sample training basis identification model of new category every time
The parameter of plinth identification model all can constantly adapt to the sample of new category, learn the feature of new category sample, the sample of old classification
It is no longer participate in training.It is this have no limitation so that the training method that on-line study model shakes down there is a problem of it is serious,
I.e. when the variation of basic identification model inner parameter, the feature to have learnt can be had an impact, it could even be possible to destroying completely
The ability of basic identification model before leads to that serious error occurs to the identification for not being new category sample, " calamity something lost occurs
Forget " problem.
In order to solve the problems, such as catastrophic forgetting, needing to make basic identification model in the sample of study new category and retain old class
Very weighed between this.When the high stability of basic identification model, basic identification model is more prone to retain old
The feature of the sample of classification, and the feature capabilities for learning new category sample can be weakened;On the contrary, when basic identification model is plastic
Property it is higher when, will have the ability of stronger study new category sample, and while be also easier to forget the feature of the sample of old classification, instruct
The key that the online recognition model got adapts to network environment is that obtains different power between stability and plasticity
Weighing apparatus.
In order to which stability-plasticity of online recognition model is controllable, the embodiment of the present invention proposes a kind of proficiency mechanism, should
Mechanism introduces one group of additional parameter ρ={ ρk}∈RK, ρ represents proficiency set, ρkIndicate basic identification model for kth
The proficiency of the classification of a node output, proficiency is for measuring online recognition model for the recognition capability of sample of all categories.
Wherein, ρk∈ [0,1), initial value 0 indicates that basic identification model is for point of all categories under initial situation
Class proficiency is all 0.In order to influence stability-controllability of model using proficiency, proficiency ρ should have property below
Matter:
1), proficiency is influenced by the result of identification sample class.The number of correct specimen discerning classification is more, corresponding
Proficiency it is higher;Error sample identifies that the number of classification is more, and corresponding proficiency is lower.
2), proficiency influences the variation of itself.When proficiency is lower, the difficulty for further increasing or reducing proficiency is small,
Proficiency itself increases or decreases fast;Proficiency is higher, and the difficulty for further increasing or reducing itself proficiency is bigger, i.e., ripe
White silk degree increases or decreases slower.
3), proficiency influences study or forgets the difficulty of knowledge.When proficiency is lower, more new category samples are acquired
Feature or forget that the feature of old sample class is relatively easy, i.e., model parameter update faster;On the contrary, when proficiency is higher
When, the difficulty for learning or forgetting is also bigger, i.e., model parameter updates slower.
For example, if X ∈ CkAnd Y ∈ Ck, show that the classification of kth class corresponding for sample X is correct, then correspond to
Proficiency ρkIncrease;If X ∈ CiBut Y ∈ Cj, show the i-th class mistake being identified as jth class, then corresponding ρiAnd ρj
It reduces.
With reference to Fig. 8, in order to realize that property 2 and property 3, the embodiment of the present invention propose the function of a proficiency for calculating
The increment of proficiency:
The function of proficiency are as follows:Wherein α and β is two parameters, for controlling
The overall trend of the function of proficiency, Fig. 8 show the change procedure of the function of the proficiency under different α and β, work as ρkIt is smaller
When, the increment prof (ρ of proficiencyk) larger, with ρkIncrease, prof (ρk) and its derivative be all gradually reduced, work as ρkIncrease to pole
When limit value 1, the increment prof (ρ of proficiencyk) value be 0, proficiency ρkNo longer it is updated.
Proficiency ρkUpdate formula: ρk←ρk±prof(ρk),
With reference to Fig. 9, as proficiency increases, the increment prof (ρ of proficiencyk) be gradually reduced, i.e., basic identification model ginseng
The amplitude that number updates is gradually reduced;Fig. 9 shows the increment prof (ρ of proficiency under different parametersk) situation of change, it is seen that it is logical
The parameter alpha and β for crossing the increment of adjustment proficiency, can control the renewal speed of basic identification model.
Therefore, when updating the weight and biasing of basic identification model output layer using parameter more new formula, increase proficiency
Increment, then parameter more new formula are as follows: Represent weight wjkIncrement,Represent biasing bkIncrement.As the function prof (ρ of proficiencyk) for update weight W and biasing b when, need to enableTo ensure prof (0)=1, i.e., as proficiency ρkWhen=0, proficiency function does not influence the update of model;With
Proficiency increase, increment coefficient prof (ρk) be gradually reduced, i.e., the amplitude that basic identification model parameter updates is gradually reduced;When
ρkWhen → 1, prof (ρkThe update amplitude of) → 0, i.e., basic identification model tends to 0.
With reference to Figure 10, in the case where parameter beta difference, prof (ρk) situation of change, β is bigger, prof (ρk) decline speed
It spends faster.
According to above analysis it is found that passing through the function prof (ρ of proficiency set ρ and proficiencyk) introducing, utilization is ripe
Two parameter alphas and β of the function of white silk degree can control the ability and speed of basic identification model undated parameter, and then realize
The tradeoff of the stability and plasticity of line identification model solves the problems, such as " calamity is forgotten ".
Illustrate the other one-hot coding of Tstream, it is assumed that the first training set sample is divided into 6 class samples, basis
The output layer of identification model has 6 nodes.Number is specified for every a kind of sample, the sample class of the first training set includes " RDP
(Remote Desktop Protocol, Remote Desktop Protocol) ", " bit-torrent BitTorrent ", " Web (World Wide
Web, WWW) ", " SSH (Secure Shell, safety shell protocol) ", " eDonkey (eDonkey Network, electricity
Donkey) ", and " NTP (Network Time Protocol, Network Time Protocol) ", reference numeral are as follows: 0,1,2,3,4,5, it is corresponding
One-hot coding are as follows: 100000,010000,001000,000100,000010,000001.Assuming that one in the first training set
The label of bar sample is 0, and the number of the sample label is 0, and the classification of the sample is " RDP ".Basic identification model identifies the sample
The the 1st to the 6th node output of this classification is 0.5,0.1,0.1,0.1,0.1,0.1.Wherein, basic identification model identification should
Sample is the probability highest for encoding 0, the loss function of basic identification model are as follows:
L=-1log0.5+ (- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(- 0log0.1)+(-
0·log0.1)。
It, can be using at least one embodiment identification in above-mentioned S104 in order to improve the classification real-time of identification data flow
The classification of the data of packet header in next data flow after current data stream:
In a kind of possible embodiment, as follows, in next data flow after identifying current data stream
The classification of the data of packet header:
Step 1: it in the case that data packet number reaches predetermined number in next data flow after current data stream, mentions
The data of packet header in a data flow are removed as the second sample;
Step 2: by the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
It continues with and a kind of network flow identification device provided in an embodiment of the present invention is described.
As shown in figure 11, a kind of network flow identification device provided in an embodiment of the present invention is applied to server, device packet
It includes:
Sample module 1101, for extracting data packet in current data stream in the case where receiving the completion of current data stream
Header data, as first sample;
Supervision module 1102 exports first sample using semi-supervised model for first sample to be inputted semi-supervised model
Classification and first sample whether be located at the result in the frontier distance of cluster;Semi-supervised model is assembled for training using the first training sample
It gets and the classification comprising having obtained header data and the first training sample concentrates the distribution relation of remaining sample;First training
Comprising there is the sample of class label at least one in sample set;Distribution relation decision has whether the sample of class label is located at cluster
Frontier distance in result;
Module 1103 is changed, in the case where in the frontier distance that first sample is located at cluster, if first sample is
The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section
Machine recognition model after point is as online recognition model;
Identification module 1104, for using online recognition model, the class of next data flow after identifying current data stream
Not.
Optionally, a kind of network flow identification device provided in an embodiment of the present invention further include:
Storage unit for successively receiving the data packet of current data stream, and obtains the five-tuple information of data packet;
Judge whether database stores five-tuple information, if database purchase five-tuple information, by the packet of data packet
Head data are saved to the storage region with five-tuple information respective path;
If the not stored five-tuple information of database, the storage region of creation and five-tuple information respective path will be counted
It saves according to the header data of packet to the storage region of five-tuple information respective path.
Sample module is specifically used for:
Whether each data packet for judging current data stream includes end of identification, includes to terminate if there is a data packet
Mark, then receive data flow and be completed, and extracts the header data of data packet in data flow as first sample.
Sample module is specifically used for:
In the case where receiving the completion of current data stream, the header data of data packet in current data stream is extracted;
It is encoded using by the header data of data packet in current data stream, obtains the vector of fixed dimension, will fixed
The vector of dimension is as first sample.
Supervision module is specifically used for:
First sample is inputted into semi-supervised model, utilizes the classification of semi-supervised model output first sample;
Local density and minimum range that the first training sample concentrates each sample are calculated, is more than density threshold by local density
Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is used in the semi-supervised model of training
The set of sample composition;
Third sample is added in cluster, and third sample is determined as to the cluster central point of cluster;The number and third of cluster
Number of samples is identical, only one third sample in each cluster;
If first sample is more than the frontier distance of cluster at a distance from third sample, determine that first sample is not located at cluster
In frontier distance;
If first sample is less than the frontier distance of cluster at a distance from third sample, determine that first sample is located at cluster
In frontier distance.
Change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, exist and the first sample if the second training sample is concentrated
This identical second sample of classification, then determining first sample not is the sample of new category, and updates preset machine recognition mould
The parameter of type;Second training sample set is the set of the composition of data flow used in training machine identification model;
In the case where in the frontier distance that first sample is located at cluster, do not exist and first if the second training sample is concentrated
Identical second sample of the classification of sample then determines that first sample is the sample of new category, and in preset machine recognition model
Output node in increase an output node, using increase output node after machine recognition model as online recognition model.
Change module is specifically used for:
It, will if first sample is the sample of new category in the case where in the frontier distance that first sample is located at cluster
Preset machine recognition Model Parameter dimension increase is one-dimensional, knows using the machine recognition model after increase parameter dimensions as online
Other model.
Change module is specifically used for:
In the case where in the frontier distance that first sample is located at cluster, if first sample is the sample of new category,
Increase an output node in the output node of preset machine recognition model, by the machine recognition model after increase output node
As basic identification model;
First sample is inputted into basic identification model, calculates the loss function of basic identification model for basic identification model
Middle output layer weight and the local derviation of biasing;
In the direction of gradient decline, the weight and biasing of basic identification model output layer are updated using parameter more new formula;
Parameter more new formula includes the increment of proficiency respectively with the loss function of basic identification model for defeated in basic identification model
The result that layer weight is multiplied with the local derviation of biasing out;
The basic identification model after weight being updated and biased is determined as online recognition model.
Identification module is specifically used for:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract next
The data of packet header are as the second sample in data flow;
By the second sample Input Online identification model, the classification of the second sample is exported using online recognition model.
The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 12, including processor 1201, communication interface
1202, memory 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through communication
Bus 1204 completes mutual communication,
Memory 1203, for storing computer program;
Processor 1201 when for executing the program stored on memory 1203, realizes following steps:
In the case where receiving current data stream and completing, the header data of data packet in current data stream is extracted, as the
One sample;
First sample is inputted into semi-supervised model, is using the classification and first sample of semi-supervised model output first sample
The no result in the frontier distance of cluster;
When first sample is the result in the frontier distance for being located at cluster, if first sample is the sample of new category
This, then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as online recognition model;
Use online recognition model, the classification of next data flow after identifying current data stream.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component
Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard
Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..
Only to be indicated with a thick line in figure, it is not intended that an only bus or a type of bus convenient for indicating.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, abbreviation RAM), also may include
Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also
To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit,
Abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;It can also be digital signal processor
(Digital Signal Processing, abbreviation DSP), specific integrated circuit (Application Specific
Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array,
Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can
It reads to be stored with instruction in storage medium, when run on a computer, so that computer executes any institute in above-described embodiment
A kind of network flow identification method stated.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it
When running on computers, so that computer executes any a kind of network flow identification method in above-described embodiment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program
Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or
It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter
Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium
In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer
Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center
User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or
Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or
It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with
It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk
Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device/
For electronic equipment/computer readable storage medium/computer program product embodiments, implement since it is substantially similar to method
Example, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention
It is interior.
Claims (10)
1. a kind of network flow identification method, which is characterized in that be applied to server, which comprises
In the case where receiving current data stream and completing, the header data of data packet in the current data stream is extracted, as the
One sample;
The first sample is inputted into semi-supervised model, the classification and the of the first sample is exported using the semi-supervised model
Whether one sample is located at the result in the frontier distance of cluster;The semi-supervised model is obtained using the training of the first training sample set
It and include that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample;The first training sample
This concentration includes the sample for having class label at least one;The distribution relation determine described in have class label sample whether
Result in the frontier distance of cluster;
When the first sample is the result in the frontier distance for being located at cluster, if the first sample is new category
Sample then increases an output node in the output node of preset machine recognition model, by the institute after increase output node
Machine recognition model is stated as online recognition model;
Use the online recognition model, the classification of next data flow after identifying current data stream.
2. the method according to claim 1, wherein being extracted in the case where receiving the completion of current data stream
The data of packet header in the current data stream, the step of as first sample before, the method also includes:
The data packet of current data stream is successively received, and obtains the five-tuple information of the data packet;
Judge whether database stores the five-tuple information, if five-tuple information described in the database purchase, by institute
The header data for stating data packet is saved to the storage region with the five-tuple information respective path;
If the not stored five-tuple information of database, the memory block of creation and the five-tuple information respective path
Domain saves the header data of the data packet to the storage region of the five-tuple information respective path.
3. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention
The header data for taking data packet in the current data stream, as first sample, comprising:
Whether each data packet for judging current data stream includes end of identification, includes to terminate mark if there is a data packet
Know, then receives data flow and be completed, extract the header data of data packet in the data flow as first sample.
4. the method according to claim 1, wherein it is described receive current data stream complete in the case where, mention
The header data for taking data packet in the current data stream, as first sample, comprising:
In the case where receiving the completion of current data stream, the header data of data packet in the current data stream is extracted;
It is encoded using by the header data of data packet in the current data stream, obtains the vector of fixed dimension, it will be described
The vector of fixed dimension is as first sample.
5. the method according to claim 1, wherein described input semi-supervised model, benefit for the first sample
Export the classification of the first sample with the semi-supervised model and first sample whether be located at it is in the frontier distance of cluster as a result,
Include:
The first sample is inputted into semi-supervised model, the classification of the first sample is exported using the semi-supervised model;
Local density and minimum range that first training sample concentrates each sample are calculated, is more than density threshold by local density
Value and minimum range are more than the sample of distance threshold as third sample;First training sample set is that training is described semi-supervised
Sample group used in model at set;
The third sample is added in cluster, and the third sample is determined as to the cluster central point of cluster;Of the cluster
Number, in each cluster only one third sample identical as third number of samples;
If the first sample is more than the frontier distance of cluster at a distance from third sample, determine that the first sample is not located at
In the frontier distance of cluster;
If the first sample is less than the frontier distance of cluster at a distance from third sample, determine that the first sample is located at
In the frontier distance of cluster.
6. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster
In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model
Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, exist and described the if the second training sample is concentrated
Identical second sample of the classification of one sample, then determining the first sample not is the sample of new category, and updates preset machine
The parameter of device identification model;Second training sample set is the collection of data flow composition used in the training machine recognition model
It closes;
In the case where in the frontier distance that the first sample is located at cluster, if the second training sample concentrate do not exist with it is described
Identical second sample of the classification of first sample then determines that the first sample is the sample of new category, and in preset machine
In the output node of identification model increase an output node, using increase output node after the machine recognition model as
Line identification model.
7. the method according to claim 1, wherein described in the frontier distance for being located at cluster in the first sample
In the case where interior, if the first sample is the sample of new category, in the output node of preset machine recognition model
Increase an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category,
It is then that the increase of preset machine recognition Model Parameter dimension is one-dimensional, the machine recognition model after increase parameter dimensions is made
For online recognition model.
8. the method according to claim 1, wherein described in the frontier distance that the first sample is located at cluster
In the case where, if the first sample is the sample of new category, increase in the output node of preset machine recognition model
Add an output node, using the machine recognition model after increase output node as online recognition model, comprising:
In the case where in the frontier distance that the first sample is located at cluster, if the first sample is the sample of new category,
Then increase an output node in the output node of preset machine recognition model, by the machine after increase output node
Identification model is as basic identification model;
By the first sample input basic identification model, the loss function of the basic identification model is calculated for described
Output layer weight and the local derviation of biasing in basic identification model;
In the direction of gradient decline, the weight and partially of the basic identification model output layer is updated using the parameter more new formula
It sets;The parameter more new formula include proficiency increment proficiency increment respectively with the loss function of basic identification model for
The result that output layer weight is multiplied with the local derviation of biasing in the basis identification model;
The basic identification model after weight being updated and biased is determined as online recognition model.
9. identifying current number the method according to claim 1, wherein described use the online recognition model
According to the classification of next data flow after stream, comprising:
In the case that data packet number reaches predetermined number in next data flow after current data stream, extract described next
The data of packet header are as the second sample in data flow;
Second sample is inputted into the online recognition model, exports second sample using the online recognition model
Classification.
10. a kind of network flow identification device, which is characterized in that be applied to server, described device includes:
Sample module, for extracting the packet of data packet in the current data stream in the case where receiving the completion of current data stream
Head data, as first sample;
Supervision module utilizes the semi-supervised model output described first for the first sample to be inputted semi-supervised model
Whether the classification and first sample of sample are located at the result in the frontier distance of cluster;The semi-supervised model is to utilize the first training
Sample set training obtains and includes that the classification for having obtained header data and the first training sample concentrate the distribution relation of remaining sample;
First training sample concentrates the sample comprising there is class label at least one;The distribution relation has classification described in determining
Whether the sample of label is located at the result in the frontier distance of cluster;
Module is changed, in the case where in the frontier distance that the first sample is located at cluster, if the first sample is
The sample of new category then increases an output node in the output node of preset machine recognition model, will increase output section
The machine recognition model after point is as online recognition model;
Identification module, for using the online recognition model, the classification of next data flow after identifying current data stream.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910036196.2A CN109873774B (en) | 2019-01-15 | 2019-01-15 | Network traffic identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910036196.2A CN109873774B (en) | 2019-01-15 | 2019-01-15 | Network traffic identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109873774A true CN109873774A (en) | 2019-06-11 |
CN109873774B CN109873774B (en) | 2021-01-01 |
Family
ID=66917604
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910036196.2A Active CN109873774B (en) | 2019-01-15 | 2019-01-15 | Network traffic identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109873774B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111447151A (en) * | 2019-10-30 | 2020-07-24 | 长沙理工大学 | Attention mechanism-based time-space characteristic flow classification research method |
CN111614514A (en) * | 2020-04-30 | 2020-09-01 | 北京邮电大学 | Network traffic identification method and device |
CN112367334A (en) * | 2020-11-23 | 2021-02-12 | 中国科学院信息工程研究所 | Network traffic identification method and device, electronic equipment and storage medium |
CN113326946A (en) * | 2020-02-29 | 2021-08-31 | 华为技术有限公司 | Method, device and storage medium for updating application recognition model |
CN113472654A (en) * | 2021-05-31 | 2021-10-01 | 济南浪潮数据技术有限公司 | Network traffic data forwarding method, device, equipment and medium |
WO2022083509A1 (en) * | 2020-10-19 | 2022-04-28 | 华为技术有限公司 | Data stream identification method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150578A (en) * | 2013-04-09 | 2013-06-12 | 山东师范大学 | Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning |
CN104156438A (en) * | 2014-08-12 | 2014-11-19 | 德州学院 | Unlabeled sample selection method based on confidence coefficients and clustering |
US20170026391A1 (en) * | 2014-07-23 | 2017-01-26 | Saeed Abu-Nimeh | System and method for the automated detection and prediction of online threats |
CN107729952A (en) * | 2017-11-29 | 2018-02-23 | 新华三信息安全技术有限公司 | A kind of traffic flow classification method and device |
CN107846326A (en) * | 2017-11-10 | 2018-03-27 | 北京邮电大学 | A kind of adaptive semi-supervised net flow assorted method, system and equipment |
CN108900432A (en) * | 2018-07-05 | 2018-11-27 | 中山大学 | A kind of perception of content method based on network Flow Behavior |
CN109067612A (en) * | 2018-07-13 | 2018-12-21 | 哈尔滨工程大学 | A kind of online method for recognizing flux based on incremental clustering algorithm |
-
2019
- 2019-01-15 CN CN201910036196.2A patent/CN109873774B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150578A (en) * | 2013-04-09 | 2013-06-12 | 山东师范大学 | Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning |
US20170026391A1 (en) * | 2014-07-23 | 2017-01-26 | Saeed Abu-Nimeh | System and method for the automated detection and prediction of online threats |
CN104156438A (en) * | 2014-08-12 | 2014-11-19 | 德州学院 | Unlabeled sample selection method based on confidence coefficients and clustering |
CN107846326A (en) * | 2017-11-10 | 2018-03-27 | 北京邮电大学 | A kind of adaptive semi-supervised net flow assorted method, system and equipment |
CN107729952A (en) * | 2017-11-29 | 2018-02-23 | 新华三信息安全技术有限公司 | A kind of traffic flow classification method and device |
CN108900432A (en) * | 2018-07-05 | 2018-11-27 | 中山大学 | A kind of perception of content method based on network Flow Behavior |
CN109067612A (en) * | 2018-07-13 | 2018-12-21 | 哈尔滨工程大学 | A kind of online method for recognizing flux based on incremental clustering algorithm |
Non-Patent Citations (1)
Title |
---|
梅国薇: "基于机器学习的网络流量分类系统设计与实现", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111447151A (en) * | 2019-10-30 | 2020-07-24 | 长沙理工大学 | Attention mechanism-based time-space characteristic flow classification research method |
CN113326946A (en) * | 2020-02-29 | 2021-08-31 | 华为技术有限公司 | Method, device and storage medium for updating application recognition model |
WO2021169294A1 (en) * | 2020-02-29 | 2021-09-02 | 华为技术有限公司 | Application recognition model updating method and apparatus, and storage medium |
CN111614514A (en) * | 2020-04-30 | 2020-09-01 | 北京邮电大学 | Network traffic identification method and device |
CN111614514B (en) * | 2020-04-30 | 2021-09-24 | 北京邮电大学 | Network traffic identification method and device |
WO2022083509A1 (en) * | 2020-10-19 | 2022-04-28 | 华为技术有限公司 | Data stream identification method and device |
CN112367334A (en) * | 2020-11-23 | 2021-02-12 | 中国科学院信息工程研究所 | Network traffic identification method and device, electronic equipment and storage medium |
CN113472654A (en) * | 2021-05-31 | 2021-10-01 | 济南浪潮数据技术有限公司 | Network traffic data forwarding method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN109873774B (en) | 2021-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109873774A (en) | A kind of network flow identification method and device | |
CN107846326B (en) | Self-adaptive semi-supervised network traffic classification method, system and equipment | |
Wang et al. | App-net: A hybrid neural network for encrypted mobile traffic classification | |
CN109639739A (en) | A kind of anomalous traffic detection method based on autocoder network | |
CN111475680A (en) | Method, device, equipment and storage medium for detecting abnormal high-density subgraph | |
CN108986907A (en) | A kind of tele-medicine based on KNN algorithm divides the method for examining automatically | |
CN108846097A (en) | The interest tags representation method of user, article recommended method and device, equipment | |
CN109298225B (en) | Automatic identification model system and method for abnormal state of voltage measurement data | |
CN109003091A (en) | A kind of risk prevention system processing method, device and equipment | |
CN114386538A (en) | Method for marking wave band characteristics of KPI (Key performance indicator) curve of monitoring index | |
CN111581445A (en) | Graph embedding learning method based on graph elements | |
Cui et al. | Feature extraction and classification method for switchgear faults based on sample entropy and cloud model | |
Hu et al. | A novel SDN-based application-awareness mechanism by using deep learning | |
Qi et al. | Patent analytic citation-based vsm: Challenges and applications | |
CN114398891B (en) | Method for generating KPI curve and marking wave band characteristics based on log keywords | |
Ullah et al. | Adaptive data balancing method using stacking ensemble model and its application to non-technical loss detection in smart grids | |
Yan et al. | TL-CNN-IDS: transfer learning-based intrusion detection system using convolutional neural network | |
CN117041017B (en) | Intelligent operation and maintenance management method and system for data center | |
CN116842459B (en) | Electric energy metering fault diagnosis method and diagnosis terminal based on small sample learning | |
Qi et al. | Incorporating adaptability-related knowledge into support vector machine for case-based design adaptation | |
Yang | Uncertainty prediction method for traffic flow based on K-nearest neighbor algorithm | |
Xu et al. | HTtext: A TextCNN-based pre-silicon detection for hardware Trojans | |
CN114124437B (en) | Encrypted flow identification method based on prototype convolutional network | |
CN109063735A (en) | A kind of classification of insect Design Method based on insect biology parameter | |
CN105740329B (en) | A kind of contents semantic method for digging of unstructured high amount of traffic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |