A kind of network path classification method and device based on keyword
Technical field
The present invention relates to field of information security technology, and in particular to a kind of network path classification method based on keyword and
Device.
Background technique
The classification of network flow is the basis of Logistics networks space safety.Traffic classification identifies different types of network protocol
Stream, for ensureing that the fields such as communication security, network management, network-combination yarn, intrusion detection and agreement be reverse are of great importance.
With the development of internet, i.e., 5G all things on earth Internet age will be welcome.Computer, mobile phone, the terminals such as sensor generate big
The Classification Management of the flow of amount, a large amount of flows proposes challenge to existing traffic classification scheme.Traffic classification is for network pipe
Manage it is most important, such as monitoring Internet resources, in time discovery and processing network failure, Logistics networks service quality, Logistics networks
High efficiency etc..On the one hand, for safety purposes, traffic classification, filtering, detection rogue activity require to grasp in network using journey
The type of sequence stream, network operator can detect according to malicious traffic stream and react to potential event rapidly.On the other hand, a kind of
New application (for example, P2P, VoIP and video flowing) in internet there are significant increases.These applications are particularly difficult to classify,
And usually there is stringent resource to bandwidth (for example, P2P) or qos requirement (for example, low latency and shake of VoIP application)
Demand, this challenge that network operator is constituted.
In the prior art, the method for the classification of widely used network flow is deep packet inspection method, this method
By identifying the signature for including in network protocol stream and fingerprint, being classified using method for mode matching to network flow.
Present invention applicant is in implementing the present invention, it may, discovery at least has the following technical problems in the prior art:
Since new network protocol constantly occurs and the replacement of network protocol version, deep packet inspection method need people
Work safeguards signature database;In addition, leading to depth because proprietary protocol stream or application protocol stream on the zero cannot directly obtain signature
Degree packet inspection method classification effectiveness sharply declines.Due to the increase of network flow, traditional deep packet inspection method be limited by compared with
High computation complexity is difficult to cope with the demand of high bandwidth network.
That is, deep packet inspection method needs manual maintenance signature database in the prior art, there are heavy workload,
Time-consuming, low efficiency, it is difficult to the technical issues of being applied to high bandwidth network.Therefore the efficient network path based on keyword point
Class method is of great significance.
Summary of the invention
In view of this, the present invention provides a kind of network path classification method and device based on keyword, to solve
Or at least partly solve method in the prior art there is technical issues that heavy workload,.
First aspect present invention provides a kind of network path classification method based on keyword, comprising:
Step S1: the K-means method based on stream statistics feature is tentatively divided the hybrid protocol track of input
Class obtains K the first track clusters;In each cluster, arranged according to the length inverted order of track stream, using Needlman_Wunsch
Method compares the agreement track of similar length two-by-two, track is divided into fixed field IF and variation field VF, and calculate each
The length IF_l of the fixed field and location information IF_s of fixed field;
Step S2: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_w of IF;
Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF position distribution curve using IF_w and IF_s as input;
Step S3: obtaining IF position distribution curve according to fitting, solves Curve Maximization using Curve Maximization method for solving, and
IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance;Each extreme value area is exported again
Between included IF type and all kinds of IF quantity;
Step S4: marking all tracks according to the most IF of each midrange amount, then use K-Means clustering,
Export the second track cluster;
Step S5: according to the second track cluster in step S4, track corresponding with targeted security agreement cluster is chosen;Then,
Step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, is obtained corresponding with targeted security agreement
The quantity of the type for the IF that extreme value section is included and all kinds of IF, then separator inference method is used, by comparing adjacent IF's
From beginning to end, infer separator, and according to separator from IF Keywords' Partition, eventually form signature database;
Step S6: marking track to be processed to flow using the signature database of formation and be converted into vector, by conversion to
It measures and classifies as the input of k-means method.
In one implementation, step S1 is specifically included:
Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network rail
Mark, obtains feature vector<FlowID, and Vector>, K first is obtained using feature vector as the input of K-means clustering method
Track cluster<Cluster, FlowID>;
Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each
Flow is flowed in track in one track cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort side
Flow is sorted according to Flow_length ascending sequence and forms a queue Sequence < Cluster by method,
FlowID,Flow>;
Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i
With j<Flow_i, Flow_j>;Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain
Flow_i, and the public fixed field IF and variable field VF of Flow_j;The length for counting IF again obtains IF_l and IF distance
The distance of flow_i starting point obtains IF_s.
In one implementation, step S2 is specifically included:
Step S2.1: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_ of IF
The weight distribution method of w, IF_l method of weighting are as follows:
Wherein, when the length IF_l of fixed field is bigger, the probability for occurring keyword in fixed field is bigger, the power of distribution
Value is bigger, and when the length of fixed field is between 1byte to 8byte, the length of weight 0, fixed field is arrived in 9byte
When between 16byte, weight 1, when the length of fixed field is between 17byte to 24byte, weight 2, fixed field
When length is more than or equal to 25byte, weight 3;
Step S2.2: ambient noise is eliminated using noise cancellation method to the weight that step S2.1 is calculated, after being corrected
IF_l weight;
Step S2.3: revised IF_l weight and fixed field length are fitted using default B-spline curves
IF position distribution curve is obtained, shown in the curvilinear equation of B-spline such as formula (1):
In formula (1), di(i=0,1 ..., n) indicate control point, Ni,k(u) (i=0,1 ..., n) indicate k specification B
Spline base function.
In one implementation, step S3 is specifically included:
Step S3.1: seeking the derived function f ' (x) of IF position distribution curve, and L array is divided between finding out in domain
xi, L is smaller, and solves f ' (xi), then filter out f ' (xi)×f′(xi+1The x of)≤0i, an array is formed, by xiArray is made
For the starting point of newton CG method, extreme point is solved;
Step S3.2: the IF_l in each IF distributed area is sorted from large to small IF using quicksort method
To the IF being sorted, then since first IF, unlabelled IF is selected, successively calculates Levenshtein distance backward, when
When Levenshtein distance value meets threshold value, then IF is labeled as known class, otherwise IF is added to new classification, repeated
Using the other step of method marking class based on Levenshtein distance, until all IF are classified, IF distribution is finally exported
Section, IF type and IF quantity.
In one implementation, step S4 is specifically included:
Step S4.1: using IF distributed area, IF type and IF quantity as input, quantity in each IF distributed area is chosen
Most IF, and all tracks are marked according to IF the and IF distributed area of selection, it is translated into IF distribution vector IFVector;
Step S4.2: input IF distribution vector IFVector obtains the second track cluster using K-means method.
In one implementation, step S5 is specifically included:
S5.1: the second track cluster that step S4 is generated executes the track dividing method of step S1, acquisition pair as input
The length IF_l of the fixed field IF, fixed field that answer and the location information IF_s of fixed field;
S5.2: by fixed field IF, the length IF_l of fixed field obtained in step S5.1 and the position of fixed field
Information IF_s executes the IF distribution fitting method of step S2 as input, obtains IF position distribution curve;
S5.3: using position distribution curve in step S5.3 as input, the IF classification method in step S3 is executed, output is every
All IF in a IF distributed area;
S5.4: using separator inference method, input all IF in each distributed area, marks corresponding track with IF,
The head and the tail that keyword is appeared according to separator count and infer separator, and separator is combined to extract the keyword in IF;
S5.5: by the IF of extraction keyword and deduction obtain separator and store, the label as track flow point class
Name database.
Based on same inventive concept, second aspect of the present invention provides a kind of network path classification dress based on keyword
It sets, comprising:
Module is divided in track, for the K-means method based on stream statistics feature, by the hybrid protocol track of input into
Row preliminary classification obtains K the first track clusters;In each cluster, arranges, use according to the length inverted order of track stream
Needlman_Wunsch method compares the agreement track of similar length two-by-two, and track is divided into fixed field IF and variation word
Section VF, and calculate the length IF_l of each fixed field and location information IF_s of fixed field;
IF distributed problem solving module is obtained for being weighted using IF_l method of weighting to the length IF_l of fixed field
The weight IF_w of IF;Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF using IF_w and IF_s as input
Set distribution curve;
IF categorization module is solved bent for obtaining IF position distribution curve according to fitting using Curve Maximization method for solving
Line extreme value, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance;It is defeated again
The type for the IF that each extreme value section is included out and the quantity of all kinds of IF;
Trajectory clustering module for the IF label all tracks most according to each midrange amount, then uses K-Means
Clustering exports the second track cluster;
Keyword inference module, for choosing and targeted security agreement according to the second track cluster in trajectory clustering module
Corresponding track cluster;Then, track segmentation module is sequentially input using the track cluster of targeted security agreement as input, IF distribution is asked
Solve module and IF categorization module, obtain IF included in extreme value corresponding with targeted security agreement section type and all kinds of IF
Quantity, then use separator inference method, by comparing the head and the tail of adjacent IF, infer separator, and according to separator from IF
Middle Keywords' Partition, eventually forms signature database;
Track categorization module marks track to be processed to flow and is converted into vector for the signature database using formation,
Classify the vector of conversion as the input of k-means method.
In one implementation, segmentation module in track is specifically used for executing following step:
Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network rail
Mark, obtains feature vector<FlowID, and Vector>, K first is obtained using feature vector as the input of K-means clustering method
Track cluster<Cluster, FlowID>;
Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each
Flow is flowed in track in one track cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort side
Flow is sorted according to Flow_length ascending sequence and forms a queue Sequence < Cluster by method,
FlowID,Flow>;
Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i
With j<Flow_i, Flow_j>;Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain
Flow_i, and the public fixed field IF and variable field VF of Flow_j;The length for counting IF again obtains IF_l and IF distance
The distance of flow_i starting point obtains IF_s.
Based on same inventive concept, third aspect present invention provides a kind of computer readable storage medium, deposits thereon
Computer program is contained, which, which is performed, realizes method described in first aspect.
Based on same inventive concept, fourth aspect present invention provides a kind of computer equipment, including memory, processing
On a memory and the computer program that can run on a processor, when processor execution described program, is realized for device and storage
Method as described in relation to the first aspect.
Said one or multiple technical solutions in the embodiment of the present application at least have following one or more technology effects
Fruit:
The network path classification method based on keyword that the invention proposes a kind of, this method using hybrid network track as
Input exports the type and signature database of target protocol track.Track is divided by fixation by track dividing method first
Field IF and variation field VF, and calculate the length IF_l of each fixed field and location information IF_s of fixed field;So
It carries out curve fitting to obtain IF position distribution curve using IF (Inverible Field) distribution fitting method IF afterwards;Then it adopts
The type of the IF for being included in each extreme value section and the quantity of all kinds of IF are obtained with IF classification method, then use trajectory clustering side
Method export the second track cluster, then using keyword estimating method infer separator, and according to separator from IF Keywords' Partition,
Eventually form signature database;Finally, marking track to be processed to flow using the signature database of formation, and it is converted into vector,
Classify again using the vector of conversion as the input of k-means method.
For deep packet inspection method in compared with the existing technology, the present invention is by track dividing method, using statistics
It learns feature to classify, obtaining a classification results, less accurately mixing track is gathered.Then pass through IF (Inverible
Field) distribution fitting method, IF classification method, method of trajectory clustering, be marked by peak value section IF and be converted into
Amount, and the IF distribution curve where this peak value are that the mixing track set exported according to K-means method in step S1 obtains
Pass through the quantity and distance of all kinds of IF at input extreme value;Then all tracks are marked according to the most IF of each midrange amount;
K-Means clustering is used again, exports cluster result, available accurate result., i.e., can be obtained by step S4
To an accurate classification results.The present invention further passes through step S5 keyword estimating method and infers separator, passes through
It is quite high, very accurate can to export a purity for the cluster (cluster refers to a class) of peak value section IF and the output of K-means method
Cluster improves the accuracy of classification, further, the present invention forms number of signature by the IF in the quite high cluster of DNA purity
According to library, it then only can carry out collecting label with signature database and be classified with K-means, improve classification accuracy
Meanwhile the efficiency of classification can be greatly improved, solving method in the prior art, there are heavy workload, the technologies of low efficiency to ask
Topic.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow chart of the network path classification method based on keyword in a kind of embodiment;
Fig. 2 is a kind of overall procedure of the network path classification method based on keyword in embodiment;
Fig. 3 is the flow diagram of track dividing method in step S1;
Fig. 4 is the flow diagram of IF distribution fitting method in step S2;
Fig. 5 is the flow diagram of IF classification method in step S3;
Fig. 6 is the flow diagram of method of trajectory clustering in step S4;
Fig. 7 is the flow diagram of keyword estimating method in step S5;
Fig. 8 is the schematic diagram of track partitioning algorithm;
Fig. 9 is noise cancellation method flow diagram;
Figure 10 is the code schematic diagram of IF statistic of classification algorithm;
Figure 11 is a kind of structural block diagram of the network path sorter based on keyword in a kind of embodiment;
Figure 12 is the structure chart of computer readable storage medium in the embodiment of the present invention;
Figure 13 is the structure chart of computer equipment in the embodiment of the present invention.
Specific embodiment
The network path classification method based on keyword that the embodiment of the invention provides a kind of, by track dividing method,
It IF (Inverible Field) distribution fitting method, IF classification method, method of trajectory clustering, keyword estimating method and is based on
Keyword estimating method formed signature database classifies to network path to be processed.Namely club, the present invention are logical
The IF in the quite high cluster of DNA purity is crossed, signature database is formed, then only can be carried out collecting mark with signature database
Remember and classified with K-means, while improving classification accuracy, the efficiency of classification can be greatly improved, solved existing
Method in technology there is technical issues that heavy workload,.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment one
A kind of network path classification method based on keyword is present embodiments provided, referring to Figure 1, this method comprises:
Step S1: the K-means method based on stream statistics feature is tentatively divided the hybrid protocol track of input
Class obtains K the first track clusters;In each cluster, arranged according to the length inverted order of track stream, using Needlman_Wunsch
Method compares the agreement track of similar length two-by-two, track is divided into fixed field IF and variation field VF, and calculate each
The length IF_l of the fixed field and location information IF_s of fixed field.
Specifically, step S1 is to obtain the first track cluster an of preliminary classification using track dividing method.Track point
The input of segmentation method is hybrid network track stream, firstly, stream is tentatively divided by the K-means method based on stream statistics feature
Cluster;Then, it in a cluster, is arranged according to the length inverted order of track stream, length is compared using Needlman_Wunsch method two-by-two
Similar agreement track is spent, Needlman_Wunsch method can mark the fixed field in track;Finally, track is divided
For fixed field IF and variation field VF and calculate the length IF_l of each IF and the location information IF_s of fixed field, wherein
The location information IF_s of fixed field refers to, distance of the fixed field to track first character.The process of track dividing method
Schematic diagram such as Fig. 3 shows.
Step S2: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_w of IF;
Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF position distribution curve using IF_w and IF_s as input;
Specifically, step S2 is to obtain IF position distribution song on the basis of step S1 using IF distribution fitting method
Line.IF distribution fitting method is as shown in figure 4, firstly, input IF, IF_l and IF_s;Then, IF_l is carried out with IF method of weighting
Weighting;As optional, noise cancellation method can be used and eliminate ambient noise;Finally, using curve-fitting method, to the IF of weighting
It carries out curve fitting, exports IF distribution curve.
Step S3: obtaining IF position distribution curve according to fitting, solves Curve Maximization using Curve Maximization method for solving, and
IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance;Each extreme value area is exported again
Between included IF type and all kinds of IF quantity.
Specifically, Fig. 5 is referred to, step S3 is that each extreme value area is obtained on the basis of step S2 using IF classification method
Between included IF type and all kinds of IF quantity.The distance of obtained IF refers in each peak value section, has multiple
Different IF, because the type of gesture of input is different, then different tracks, corresponds in same peak value section, have multiple
IF.What is exported in so step S3 corresponds to all IF and IF_l and IF_s in each extreme value section.
Step S4: marking all tracks according to the most IF of each midrange amount, then use K-Means clustering,
Export the second track cluster.
Specifically, method of trajectory clustering inputs all kinds of IF set in a distributed area, chooses in each maximum region
The most IF of quantity;Then, all tracks are marked according to the IF of selection, and is converted into vector;Finally, using K-Means method
Cluster exports cluster result.Method of trajectory clustering schematic diagram is as shown in Figure 6.
Due to including many IF and VF, heterogeneous networks track IF distribution difference in a network path.Firstly, step S1 is adopted
It is statistics feature, length, offset, arrival time etc..These are the feature of track entirety, if to identify that track is studied carefully
It unexpectedly is which class, it is necessary to pass through the feature for the IF that track includes.Wherein IF feature includes, the type of IF, IF_l (IF length),
IF_s (the IF distance of distance to first character in track).
Step S5: according to the second track cluster in step S4, track corresponding with targeted security agreement cluster is chosen;Then,
Step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, is obtained corresponding with targeted security agreement
The quantity of the type of IF included in extreme value section and all kinds of IF, then separator inference method is used, by comparing adjacent IF
Head and the tail, infer separator, and according to separator from IF Keywords' Partition, eventually form signature database.
Specifically, step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, that is, carried out
The track dividing method of step S1, IF distribution fitting method, IF classification method.Keyword estimating method schematic diagram is as shown in Figure 7.
Firstly, input trajectory cluster, the track cluster of selection target security protocol;Then, the track cluster for the security protocol being classified is inputted
To track dividing method, by IF distribution fitting method and IF classification method, the quantity statistics of all kinds of IF at peak value are obtained;Most
Afterwards, infer separator, and the Keywords' Partition from IF by comparing the head and the tail of adjacent IF using separator inference method, will close
Keyword and separator are as new signature database.
Step S6: marking track to be processed to flow using the signature database of formation and be converted into vector, by conversion to
It measures and classifies as the input of k-means method.
Specifically, after forming signed data, then can be used to mark track to flow and be converted into vector as k-means
The input of method is classified.Include in signature database is the feature that can classify, thus later can be by this
Signature database, which carries out subsequent classification, can greatly improve the efficiency of classification while improving classification accuracy.
Generally, Fig. 2 is referred to, is a kind of totality of the network path classification method based on keyword in embodiment
Process.Wherein contain the specific implementation step in step S1~step S5.
In one embodiment, step S1 is specifically included:
Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network rail
Mark, obtains feature vector<FlowID, and Vector>, K first is obtained using feature vector as the input of K-means clustering method
Track cluster<Cluster, FlowID>;
Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each
Flow is flowed in track in one track cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort side
Flow is sorted according to Flow_length ascending sequence and forms a queue Sequence < Cluster by method,
FlowID,Flow>;
Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i
With j<Flow_i, Flow_j>;Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain
Flow_i, and the public fixed field IF and variable field VF of Flow_j;The length for counting IF again obtains IF_l and IF distance
The distance of flow_i starting point obtains IF_s.
Specifically, stream statistics feature set has 18 groups of features, such as: average data packet length, parlor arrival time
Standard deviation, total flow length (as unit of byte and/or data packet), Fourier transformation of parlor arrival time etc..Based on stream
Statistics feature set it is as shown in table 1.Track stream is stored in the file of PCAP format, can be obtained by simple statistics method
Circulation is simultaneously turned to feature vector<FlowID by 18 groups of statistical natures of stream, and Vector>.Wherein Vector includes 18 groups of characteristic values,
As shown in table 1.
The statistics feature set that table 1 flows
Wherein, the input of K-means clustering method is<FlowID, Vector>, class number K can be preset, and be seen
Cluster result is examined, true defining K value simultaneously obtains track cluster<Cluster, FlowID>.K-means clustering method can use existing
Clustering method, detailed process are no longer described in detail.
Idea of the invention is that the state that the security protocol track of similar length is undergone is similar, therefore the crucial lexeme generated
The distribution set also more is concentrated.The present invention is arranged by the length inverted order of track, and adjacent track is grouped two-by-two, is used
Needlman_Wunsch method compares security protocol track similar in total length, and the longer keyword of preferential alignment, last defeated
IF and count its position out.Specific algorithm is as shown in Figure 8.
Specifically, Needlman_Wunsch method is common in bioinformatics, is earliest utilization dynamic programming technique pair
Than one of the method for Biological Sequence, it finds the homologous pass between protein or gene by aligned protein or gene order
System.Because N-W method is capable of handling a large amount of structurings and complicated data.
In the embodiment of the present invention, the input of message dividing method is security protocol flow, and each flow includes multiple keys
Word, separator, and variation field.Such as: report/dc00321, report are keywords ,/it is separator, dc00321 is to become
Change field.N-W method is common with DNA or in protein Comparison Study.It fills in one two by similar reward and gap penalty
Matrix, selected backtracking path are tieed up, and then exports comparison result.Such as report/dc00321 and report/dc04369, pass through
The available two parts of sequence alignment, fixed field IF:report/dc0 change field VF:0321 and 4369.Wherein due to dividing
It not changing every symbol, changes the less variation of first three character of field, N-W method cannot distinguish between, if compared using three roads,
Such as report/dc00321, report/dc04369 and report/da45813, the IF of comparison result are still report/
Dc0, the result compared only with multichannel can only look after most of identical sequence.
Input trajectory collection FlowSet arranges to obtain DFlowSet first, in accordance with the length inverted order of Flow, then, from
DFlow1 and DFlow2 are once exported in DFlowSet, until DFlowSet is sky, are then input to DFlow1 and DFlow2
In Needlman_Wunsch method, IF is obtained, and count IF_i, IF_s and VF.
Preferably, the present invention also improves Needlman_Wunsch, to improve accuracy of classifying to obtain.Wherein,
The points-scoring system of improved Needlman_Wunsch method follows three principles: matching as much as possible, preferential alignment consecutive word
Section only generates vacancy corresponding with first sequence.MijIt is the value of the i-th row j column of rating matrix, MijCalculation formula such as
Shown in formula (2), w is punishment gap, SijFor the similarity of j-th of character of i-th of character and b of message a, SijScoring
Standard such as formula 1 is recalled from MijStart, selects MijUpper, left, the maximum value of upper left lattice, such as encounter the lattice of same size, it is excellent
First choose upper left and upper.SijBy continuous coupling score, can preferentially by longest most like fields match together, only produce
Raw vacancy corresponding with first sequence is to be able to export the position of keyword directly from matching result.It is improved to comment
Subsystem SijMathematical formulae such as formula (3).
Mij=max { Mi-1,j-1+Sij,Mi,j-1+w,Mi-1,j+ w } formula (2)
Wherein, in formula (3), ai-1Refer to (i-1)-th character of message a, bj-1Similar, k and p are that constant controls continuous
With score.
In one embodiment, step S2 is specifically included:
Step S2.1: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_ of IF
The weight distribution method of w, IF_l method of weighting are as follows:
Wherein, when the length IF_l of fixed field is bigger, the probability for occurring keyword in fixed field is bigger, the power of distribution
Value is bigger, and when the length of fixed field is between 1byte to 8byte, the length of weight 0, fixed field is arrived in 9byte
When between 16byte, weight 1, when the length of fixed field is between 17byte to 24byte, weight 2, fixed field
When length is more than or equal to 25byte, weight 3;
Step S2.2: ambient noise is eliminated using noise cancellation method to the weight that step S2.1 is calculated, after being corrected
IF_l weight;
Step S2.3: revised IF_l weight and fixed field length are fitted using default B-spline curves
IF position distribution curve is obtained, shown in the curvilinear equation of B-spline such as formula (1):
In formula (1), di(i=0,1 ..., n) indicate control point, Ni,k(u) (i=0,1 ..., n) indicate k specification B
Spline base function.
Specifically, when the length of fixed field (IF_l) is bigger, the probability for occurring keyword in fixed field is bigger, point
The weight matched is bigger.When the length of fixed field reaches 32byte, show there are continuous 4 characters to appear in two tracks
On, maximum value 3 is set by weight at this time.
In the specific implementation process, due to the inherent shortcoming of Needlman_Wunsch, can generate the matching of mistake to
Influence experimental result.The flow chart of noise cancellation method is as shown in Figure 9.This method deletes the weight for the IF that weight is 1 first,
The data type of secondary judgement IF, the IF_w-k*0.192IF_w=IF_w if IF is 10 system numbers;Wherein k value takes 2, if IF
It is 16 system numbers then IF_w-k*0.107IF_w=IF_w, wherein k takes 1.5.Rule of thumb calculating 0.192 is two 10 systems
The average noise that random number generates, 0.107 is the average noise that two 16 system random numbers generate.
In step 2.3, curve matching refers to the distribution function calculated from many discrete point sets convenient for computer disposal, should
Distribution function can eliminate the data and interference of mistake, can compensate for lacking and predicting following trend.
In one embodiment, step S3 is specifically included:
Step S3.1: seeking the derived function f ' (x) of IF position distribution curve, and L array is divided between finding out in domain
xi, L is smaller, and solves f ' (xi), then filter out f ' (xi)×f′(xi+1The x of)≤0i, an array is formed, by xiArray is made
For the starting point of newton CG method, extreme point is solved;
Step S3.2: the IF_l in each IF distributed area is sorted from large to small IF using quicksort method
To the IF being sorted, then since first IF, unlabelled IF is selected, successively calculates Levenshtein distance backward, when
When Levenshtein distance value meets threshold value, then IF is labeled as known class, otherwise IF is added to new classification, repeated
Using the other step of method marking class based on Levenshtein distance, until all IF are classified, IF distribution is finally exported
Section, IF type and IF quantity.
Specifically, the curve of B-spline fitting is the curve of complicated multinomial composition, and newton CG method is that one kind asks bent
The optimization algorithm of line extreme value, this method inputs starting point and first derivative objective function progress quadratic function is fast approximately through iteration
Speed acquires the approximate extreme value for meeting precision.
Step 3.2 carries out IF classified statistic method, uses xa, iIt indicates minimum point, uses xb,i+1Indicate maximum point.Test table
Bright, IF integrated distribution existsBetween, the present invention is IF distributed area at the section.IF classified statistic method
IF distributed area, IF_s, IF_l and IF are inputted, the first step uses quicksort side to the IF_l in each IF distributed area
IF is sorted from large to small the IF being sorted by method;Second step selects unlabelled IF since first IF, successively to
Levenshtein distance is calculated afterwards, thinks that IF is labeled as in known class and by IF when Levenshtein distance value meets threshold value
Otherwise new classification is added in IF by the known class;Third step repeats second step, until all IF are classified.Finally export IF
Distributed area, IF type and IF quantity.The algorithm of IF classified statistic method such as Figure 10.
Levenshtein distance is one kind of common editing distance, and Levenshtein distance method inputs two characters
String A and B converts character string B for character string A, and export number of operations by the rule of one character of once-through operation.
In one embodiment, step S4 is specifically included:
Step S4.1: using IF distributed area, IF type and IF quantity as input, quantity in each IF distributed area is chosen
Most IF, and all tracks are marked according to IF the and IF distributed area of selection, it is translated into IF distribution vector IFVector;
Step S4.2: input IF distribution vector IFVector obtains the second track cluster using K-means method.
In one embodiment, step S5 is specifically included:
S5.1: the second track cluster that step S4 is generated executes the track dividing method of step S1, acquisition pair as input
The length IF_l of the fixed field IF, fixed field that answer and the location information IF_s of fixed field;
S5.2: by fixed field IF, the length IF_l of fixed field obtained in step S5.1 and the position of fixed field
Information IF_s executes the IF distribution fitting method of step S2 as input, obtains IF position distribution curve;
S5.3: using position distribution curve in step S5.3 as input, the IF classification method in step S3 is executed, output is every
All IF in a IF distributed area;
S5.4: using separator inference method, input all IF in each distributed area, marks corresponding track with IF,
The head and the tail that keyword is appeared according to separator count and infer separator, and separator is combined to extract the keyword in IF;
S5.5: by the IF of extraction keyword and deduction obtain separator and store, the label as track flow point class
Name database.
Embodiment two
A kind of network path sorter based on keyword is present embodiments provided, referring to Figure 11, which includes:
Module is divided in track, for the K-means method based on stream statistics feature, by the hybrid protocol track of input into
Row preliminary classification obtains K the first track clusters;In each cluster, arranges, use according to the length inverted order of track stream
Needlman_Wunsch method compares the agreement track of similar length two-by-two, and track is divided into fixed field IF and variation word
Section VF, and calculate the length IF_l of each fixed field and location information IF_s of fixed field;
IF distributed problem solving module is obtained for being weighted using IF_l method of weighting to the length IF_l of fixed field
The weight IF_w of IF;Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF using IF_w and IF_s as input
Set distribution curve;
IF categorization module is solved bent for obtaining IF position distribution curve according to fitting using Curve Maximization method for solving
Line extreme value, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance;It is defeated again
The type for the IF that each extreme value section is included out and the quantity of all kinds of IF;
Trajectory clustering module for the IF label all tracks most according to each midrange amount, then uses K-Means
Clustering exports the second track cluster;
Keyword inference module, for choosing and targeted security agreement according to the second track cluster in trajectory clustering module
Corresponding track cluster;Then, track segmentation module is sequentially input using the track cluster of targeted security agreement as input, IF distribution is asked
Module and IF categorization module are solved, the type of the IF for being included in extreme value corresponding with targeted security agreement section and all kinds of is obtained
The quantity of IF, then use separator inference method infers separator by comparing the head and the tail of adjacent IF, and according to separator from
Keywords' Partition in IF, eventually forms signature database;
Track categorization module marks track to be processed to flow and is converted into vector for the signature database using formation,
Classify the vector of conversion as the input of k-means method.
In one implementation, segmentation module 201 in track is specifically used for executing following step:
Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network rail
Mark, obtains feature vector<FlowID, and Vector>, K first is obtained using feature vector as the input of K-means clustering method
Track cluster<Cluster, FlowID>;
Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each
Flow is flowed in track in one track cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort side
Flow is sorted according to Flow_length ascending sequence and forms a queue Sequence < Cluster by method,
FlowID,Flow>;
Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i
With j<Flow_i, Flow_j>;Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain
Flow_i, and the public fixed field IF and variable field VF of Flow_j;The length for counting IF again obtains IF_l and IF distance
The distance of flow_i starting point obtains IF_s.
In one implementation, IF distributed problem solving module 202 is specifically used for executing following step:
Step S2.1: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_ of IF
The weight distribution method of w, IF_l method of weighting are as follows:
Wherein, when the length IF_l of fixed field is bigger, the probability for occurring keyword in fixed field is bigger, the power of distribution
Value is bigger, and when the length of fixed field is between 1byte to 8byte, the length of weight 0, fixed field is arrived in 9byte
When between 16byte, weight 1, when the length of fixed field is between 17byte to 24byte, weight 2, fixed field
When length is more than or equal to 25byte, weight 3;
Step S2.2: ambient noise is eliminated using noise cancellation method to the weight that step S2.1 is calculated, after being corrected
IF_l weight;
Step S2.3: revised IF_l weight and fixed field length are fitted using default B-spline curves
IF position distribution curve is obtained, shown in the curvilinear equation of B-spline such as formula (1):
In formula (1), di(i=0,1 ..., n) indicate control point, Ni,k(u) (i=0,1 ..., n) indicate k specification B
Spline base function.
In one implementation, IF categorization module 203 is specifically used for executing following step:
Step S3.1: seeking the derived function f ' (x) of IF position distribution curve, and L array is divided between finding out in domain
xi, L is smaller, and solves f ' (xi), then filter out f ' (xi)×f′(xi+1The x of)≤0i, an array is formed, by xiArray is made
For the starting point of newton CG method, extreme point is solved;
Step S3.2: the IF_l in each IF distributed area is sorted from large to small IF using quicksort method
To the IF being sorted, then since first IF, unlabelled IF is selected, successively calculates Levenshtein distance backward, when
When Levenshtein distance value meets threshold value, then IF is labeled as known class, otherwise IF is added to new classification, repeated
Using the other step of method marking class based on Levenshtein distance, until all IF are classified, IF distribution is finally exported
Section, IF type and IF quantity.
In one implementation, trajectory clustering module 204 is specifically used for executing following step:
Step S4.1: using IF distributed area, IF type and IF quantity as input, quantity in each IF distributed area is chosen
Most IF, and all tracks are marked according to IF the and IF distributed area of selection, it is translated into IF distribution vector IFVector;
Step S4.2: input IF distribution vector IFVector obtains the second track cluster using K-means method.
In one implementation, keyword inference module 205 is specifically used for executing following step:
S5.1: the second track cluster that step S4 is generated executes the track dividing method of step S1, acquisition pair as input
The length IF_l of the fixed field IF, fixed field that answer and the location information IF_s of fixed field;
S5.2: by fixed field IF, the length IF_l of fixed field obtained in step S5.1 and the position of fixed field
Information IF_s executes the IF distribution fitting method of step S2 as input, obtains IF position distribution curve;
S5.3: using position distribution curve in step S5.3 as input, the IF classification method in step S3 is executed, output is every
All IF in a IF distributed area;
S5.4: using separator inference method, input all IF in each distributed area, marks corresponding track with IF,
The head and the tail that keyword is appeared according to separator count and infer separator, and separator is combined to extract the keyword in IF;
S5.5: by the IF of extraction keyword and deduction obtain separator and store, the label as track flow point class
Name database.
By the device that the embodiment of the present invention two is introduced, to implement the network based on keyword in the embodiment of the present invention one
Device used by the classification method of track, so based on the method that the embodiment of the present invention one is introduced, the affiliated personnel's energy in this field
The specific structure much of that for solving the device and deformation, so details are not described herein.The method of all embodiment of the present invention one is used
Device belong to the range to be protected of the invention.
Embodiment three
Based on the same inventive concept, present invention also provides a kind of computer readable storage medium 300, referring to Figure 12,
It is stored thereon with computer program 311, which is performed the method realized in embodiment one.
By the computer readable storage medium that the embodiment of the present invention three is introduced, to implement base in the embodiment of the present invention one
The computer readable storage medium used by the network path classification method of keyword, so it is based on one institute of the embodiment of the present invention
The method of introduction, the affiliated personnel in this field can understand specific structure and the deformation of the computer readable storage medium, so
This is repeated no more.Computer readable storage medium used by the method for all embodiment of the present invention one belongs to the present invention and is intended to
The range of protection.
Example IV
Based on the same inventive concept, present invention also provides a kind of computer equipment, referring to Figure 13, including storage 401,
On a memory and the computer program 403 that can run on a processor, processor 402 executes above-mentioned for processor 402 and storage
The method in embodiment one is realized when program.
Since the computer equipment that the embodiment of the present invention four is introduced is to implement in the embodiment of the present invention one based on keyword
Network path classification method used by computer equipment, so based on the method that the embodiment of the present invention one is introduced, ability
The affiliated personnel in domain can understand specific structure and the deformation of the computer equipment, so details are not described herein.All present invention are real
It applies computer equipment used by method in example one and belongs to the range of the invention to be protected.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications can be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out various modification and variations without departing from this hair to the embodiment of the present invention
The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention
And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.