CN110061869A

CN110061869A - A kind of network path classification method and device based on keyword

Info

Publication number: CN110061869A
Application number: CN201910281096.6A
Authority: CN
Inventors: 孟博; 何旭东; 王德军; 李子茂
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2019-07-26
Anticipated expiration: 2039-04-09
Also published as: CN110061869B

Abstract

The present invention provides a kind of network path classification method and device based on keyword, classification method obtains the first track cluster by stream statistics feature combination K-means method first, using each cluster of the first track cluster as the input of track dividing method, it is divided into fixed field IF and variation field VF, and calculates the length of each fixed field and the location information of fixed field；Then it carries out curve fitting to obtain IF position distribution curve using IF distribution fitting method；Then the type of the IF for being included in each extreme value section and the quantity of all kinds of IF are obtained using IF classification method, then by the quantity input trajectory clustering method of the type of the IF for being included in each extreme value section and all kinds of IF, export the second track cluster, separator is inferred using keyword estimating method again, and according to separator from IF Keywords' Partition, eventually form signature database.The present invention can greatly improve the efficiency of classification while improving classification accuracy.

Description

A kind of network path classification method and device based on keyword

Technical field

The present invention relates to field of information security technology, and in particular to a kind of network path classification method based on keyword and Device.

Background technique

The classification of network flow is the basis of Logistics networks space safety.Traffic classification identifies different types of network protocol Stream, for ensureing that the fields such as communication security, network management, network-combination yarn, intrusion detection and agreement be reverse are of great importance.

With the development of internet, i.e., 5G all things on earth Internet age will be welcome.Computer, mobile phone, the terminals such as sensor generate big The Classification Management of the flow of amount, a large amount of flows proposes challenge to existing traffic classification scheme.Traffic classification is for network pipe Manage it is most important, such as monitoring Internet resources, in time discovery and processing network failure, Logistics networks service quality, Logistics networks High efficiency etc..On the one hand, for safety purposes, traffic classification, filtering, detection rogue activity require to grasp in network using journey The type of sequence stream, network operator can detect according to malicious traffic stream and react to potential event rapidly.On the other hand, a kind of New application (for example, P2P, VoIP and video flowing) in internet there are significant increases.These applications are particularly difficult to classify, And usually there is stringent resource to bandwidth (for example, P2P) or qos requirement (for example, low latency and shake of VoIP application) Demand, this challenge that network operator is constituted.

In the prior art, the method for the classification of widely used network flow is deep packet inspection method, this method By identifying the signature for including in network protocol stream and fingerprint, being classified using method for mode matching to network flow.

Present invention applicant is in implementing the present invention, it may, discovery at least has the following technical problems in the prior art:

Since new network protocol constantly occurs and the replacement of network protocol version, deep packet inspection method need people Work safeguards signature database；In addition, leading to depth because proprietary protocol stream or application protocol stream on the zero cannot directly obtain signature Degree packet inspection method classification effectiveness sharply declines.Due to the increase of network flow, traditional deep packet inspection method be limited by compared with High computation complexity is difficult to cope with the demand of high bandwidth network.

That is, deep packet inspection method needs manual maintenance signature database in the prior art, there are heavy workload, Time-consuming, low efficiency, it is difficult to the technical issues of being applied to high bandwidth network.Therefore the efficient network path based on keyword point Class method is of great significance.

Summary of the invention

In view of this, the present invention provides a kind of network path classification method and device based on keyword, to solve Or at least partly solve method in the prior art there is technical issues that heavy workload,.

First aspect present invention provides a kind of network path classification method based on keyword, comprising:

Step S1: the K-means method based on stream statistics feature is tentatively divided the hybrid protocol track of input Class obtains K the first track clusters；In each cluster, arranged according to the length inverted order of track stream, using Needlman_Wunsch Method compares the agreement track of similar length two-by-two, track is divided into fixed field IF and variation field VF, and calculate each The length IF_l of the fixed field and location information IF_s of fixed field；

Step S2: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_w of IF； Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF position distribution curve using IF_w and IF_s as input；

Step S3: obtaining IF position distribution curve according to fitting, solves Curve Maximization using Curve Maximization method for solving, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance；Each extreme value area is exported again Between included IF type and all kinds of IF quantity；

Step S4: marking all tracks according to the most IF of each midrange amount, then use K-Means clustering, Export the second track cluster；

Step S5: according to the second track cluster in step S4, track corresponding with targeted security agreement cluster is chosen；Then, Step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, is obtained corresponding with targeted security agreement The quantity of the type for the IF that extreme value section is included and all kinds of IF, then separator inference method is used, by comparing adjacent IF's From beginning to end, infer separator, and according to separator from IF Keywords' Partition, eventually form signature database；

Step S6: marking track to be processed to flow using the signature database of formation and be converted into vector, by conversion to It measures and classifies as the input of k-means method.

In one implementation, step S1 is specifically included:

Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network rail Mark, obtains feature vector<FlowID, and Vector>, K first is obtained using feature vector as the input of K-means clustering method Track cluster<Cluster, FlowID>；

Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each Flow is flowed in track in one track cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort side Flow is sorted according to Flow_length ascending sequence and forms a queue Sequence < Cluster by method, FlowID,Flow>；

Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i With j<Flow_i, Flow_j>；Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain Flow_i, and the public fixed field IF and variable field VF of Flow_j；The length for counting IF again obtains IF_l and IF distance The distance of flow_i starting point obtains IF_s.

In one implementation, step S2 is specifically included:

Step S2.1: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_ of IF The weight distribution method of w, IF_l method of weighting are as follows:

Wherein, when the length IF_l of fixed field is bigger, the probability for occurring keyword in fixed field is bigger, the power of distribution Value is bigger, and when the length of fixed field is between 1byte to 8byte, the length of weight 0, fixed field is arrived in 9byte When between 16byte, weight 1, when the length of fixed field is between 17byte to 24byte, weight 2, fixed field When length is more than or equal to 25byte, weight 3；

Step S2.2: ambient noise is eliminated using noise cancellation method to the weight that step S2.1 is calculated, after being corrected IF_l weight；

Step S2.3: revised IF_l weight and fixed field length are fitted using default B-spline curves IF position distribution curve is obtained, shown in the curvilinear equation of B-spline such as formula (1):

In formula (1), d_i(i=0,1 ..., n) indicate control point, N_i,k(u) (i=0,1 ..., n) indicate k specification B Spline base function.

In one implementation, step S3 is specifically included:

Step S3.1: seeking the derived function f ' (x) of IF position distribution curve, and L array is divided between finding out in domain x_i, L is smaller, and solves f ' (x_i), then filter out f ' (x_i)×f′(x_i+1The x of)≤0_i, an array is formed, by x_iArray is made For the starting point of newton CG method, extreme point is solved；

Step S3.2: the IF_l in each IF distributed area is sorted from large to small IF using quicksort method To the IF being sorted, then since first IF, unlabelled IF is selected, successively calculates Levenshtein distance backward, when When Levenshtein distance value meets threshold value, then IF is labeled as known class, otherwise IF is added to new classification, repeated Using the other step of method marking class based on Levenshtein distance, until all IF are classified, IF distribution is finally exported Section, IF type and IF quantity.

In one implementation, step S4 is specifically included:

Step S4.1: using IF distributed area, IF type and IF quantity as input, quantity in each IF distributed area is chosen Most IF, and all tracks are marked according to IF the and IF distributed area of selection, it is translated into IF distribution vector IFVector；

Step S4.2: input IF distribution vector IFVector obtains the second track cluster using K-means method.

In one implementation, step S5 is specifically included:

S5.1: the second track cluster that step S4 is generated executes the track dividing method of step S1, acquisition pair as input The length IF_l of the fixed field IF, fixed field that answer and the location information IF_s of fixed field；

S5.2: by fixed field IF, the length IF_l of fixed field obtained in step S5.1 and the position of fixed field Information IF_s executes the IF distribution fitting method of step S2 as input, obtains IF position distribution curve；

S5.3: using position distribution curve in step S5.3 as input, the IF classification method in step S3 is executed, output is every All IF in a IF distributed area；

S5.4: using separator inference method, input all IF in each distributed area, marks corresponding track with IF, The head and the tail that keyword is appeared according to separator count and infer separator, and separator is combined to extract the keyword in IF；

S5.5: by the IF of extraction keyword and deduction obtain separator and store, the label as track flow point class Name database.

Based on same inventive concept, second aspect of the present invention provides a kind of network path classification dress based on keyword It sets, comprising:

Module is divided in track, for the K-means method based on stream statistics feature, by the hybrid protocol track of input into Row preliminary classification obtains K the first track clusters；In each cluster, arranges, use according to the length inverted order of track stream Needlman_Wunsch method compares the agreement track of similar length two-by-two, and track is divided into fixed field IF and variation word Section VF, and calculate the length IF_l of each fixed field and location information IF_s of fixed field；

IF distributed problem solving module is obtained for being weighted using IF_l method of weighting to the length IF_l of fixed field The weight IF_w of IF；Curve-fitting method is used again, and IF is carried out curve fitting to obtain IF using IF_w and IF_s as input Set distribution curve；

IF categorization module is solved bent for obtaining IF position distribution curve according to fitting using Curve Maximization method for solving Line extreme value, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance；It is defeated again The type for the IF that each extreme value section is included out and the quantity of all kinds of IF；

Trajectory clustering module for the IF label all tracks most according to each midrange amount, then uses K-Means Clustering exports the second track cluster；

Keyword inference module, for choosing and targeted security agreement according to the second track cluster in trajectory clustering module Corresponding track cluster；Then, track segmentation module is sequentially input using the track cluster of targeted security agreement as input, IF distribution is asked Solve module and IF categorization module, obtain IF included in extreme value corresponding with targeted security agreement section type and all kinds of IF Quantity, then use separator inference method, by comparing the head and the tail of adjacent IF, infer separator, and according to separator from IF Middle Keywords' Partition, eventually forms signature database；

Track categorization module marks track to be processed to flow and is converted into vector for the signature database using formation, Classify the vector of conversion as the input of k-means method.

In one implementation, segmentation module in track is specifically used for executing following step:

Based on same inventive concept, third aspect present invention provides a kind of computer readable storage medium, deposits thereon Computer program is contained, which, which is performed, realizes method described in first aspect.

Based on same inventive concept, fourth aspect present invention provides a kind of computer equipment, including memory, processing On a memory and the computer program that can run on a processor, when processor execution described program, is realized for device and storage Method as described in relation to the first aspect.

Said one or multiple technical solutions in the embodiment of the present application at least have following one or more technology effects Fruit:

The network path classification method based on keyword that the invention proposes a kind of, this method using hybrid network track as Input exports the type and signature database of target protocol track.Track is divided by fixation by track dividing method first Field IF and variation field VF, and calculate the length IF_l of each fixed field and location information IF_s of fixed field；So It carries out curve fitting to obtain IF position distribution curve using IF (Inverible Field) distribution fitting method IF afterwards；Then it adopts The type of the IF for being included in each extreme value section and the quantity of all kinds of IF are obtained with IF classification method, then use trajectory clustering side Method export the second track cluster, then using keyword estimating method infer separator, and according to separator from IF Keywords' Partition, Eventually form signature database；Finally, marking track to be processed to flow using the signature database of formation, and it is converted into vector, Classify again using the vector of conversion as the input of k-means method.

For deep packet inspection method in compared with the existing technology, the present invention is by track dividing method, using statistics It learns feature to classify, obtaining a classification results, less accurately mixing track is gathered.Then pass through IF (Inverible Field) distribution fitting method, IF classification method, method of trajectory clustering, be marked by peak value section IF and be converted into Amount, and the IF distribution curve where this peak value are that the mixing track set exported according to K-means method in step S1 obtains Pass through the quantity and distance of all kinds of IF at input extreme value；Then all tracks are marked according to the most IF of each midrange amount； K-Means clustering is used again, exports cluster result, available accurate result., i.e., can be obtained by step S4 To an accurate classification results.The present invention further passes through step S5 keyword estimating method and infers separator, passes through It is quite high, very accurate can to export a purity for the cluster (cluster refers to a class) of peak value section IF and the output of K-means method Cluster improves the accuracy of classification, further, the present invention forms number of signature by the IF in the quite high cluster of DNA purity According to library, it then only can carry out collecting label with signature database and be classified with K-means, improve classification accuracy Meanwhile the efficiency of classification can be greatly improved, solving method in the prior art, there are heavy workload, the technologies of low efficiency to ask Topic.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow chart of the network path classification method based on keyword in a kind of embodiment；

Fig. 2 is a kind of overall procedure of the network path classification method based on keyword in embodiment；

Fig. 3 is the flow diagram of track dividing method in step S1；

Fig. 4 is the flow diagram of IF distribution fitting method in step S2；

Fig. 5 is the flow diagram of IF classification method in step S3；

Fig. 6 is the flow diagram of method of trajectory clustering in step S4；

Fig. 7 is the flow diagram of keyword estimating method in step S5；

Fig. 8 is the schematic diagram of track partitioning algorithm；

Fig. 9 is noise cancellation method flow diagram；

Figure 10 is the code schematic diagram of IF statistic of classification algorithm；

Figure 11 is a kind of structural block diagram of the network path sorter based on keyword in a kind of embodiment；

Figure 12 is the structure chart of computer readable storage medium in the embodiment of the present invention；

Figure 13 is the structure chart of computer equipment in the embodiment of the present invention.

Specific embodiment

The network path classification method based on keyword that the embodiment of the invention provides a kind of, by track dividing method, It IF (Inverible Field) distribution fitting method, IF classification method, method of trajectory clustering, keyword estimating method and is based on Keyword estimating method formed signature database classifies to network path to be processed.Namely club, the present invention are logical The IF in the quite high cluster of DNA purity is crossed, signature database is formed, then only can be carried out collecting mark with signature database Remember and classified with K-means, while improving classification accuracy, the efficiency of classification can be greatly improved, solved existing Method in technology there is technical issues that heavy workload,.

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Embodiment one

A kind of network path classification method based on keyword is present embodiments provided, referring to Figure 1, this method comprises:

Step S1: the K-means method based on stream statistics feature is tentatively divided the hybrid protocol track of input Class obtains K the first track clusters；In each cluster, arranged according to the length inverted order of track stream, using Needlman_Wunsch Method compares the agreement track of similar length two-by-two, track is divided into fixed field IF and variation field VF, and calculate each The length IF_l of the fixed field and location information IF_s of fixed field.

Specifically, step S1 is to obtain the first track cluster an of preliminary classification using track dividing method.Track point The input of segmentation method is hybrid network track stream, firstly, stream is tentatively divided by the K-means method based on stream statistics feature Cluster；Then, it in a cluster, is arranged according to the length inverted order of track stream, length is compared using Needlman_Wunsch method two-by-two Similar agreement track is spent, Needlman_Wunsch method can mark the fixed field in track；Finally, track is divided For fixed field IF and variation field VF and calculate the length IF_l of each IF and the location information IF_s of fixed field, wherein The location information IF_s of fixed field refers to, distance of the fixed field to track first character.The process of track dividing method Schematic diagram such as Fig. 3 shows.

Specifically, step S2 is to obtain IF position distribution song on the basis of step S1 using IF distribution fitting method Line.IF distribution fitting method is as shown in figure 4, firstly, input IF, IF_l and IF_s；Then, IF_l is carried out with IF method of weighting Weighting；As optional, noise cancellation method can be used and eliminate ambient noise；Finally, using curve-fitting method, to the IF of weighting It carries out curve fitting, exports IF distribution curve.

Step S3: obtaining IF position distribution curve according to fitting, solves Curve Maximization using Curve Maximization method for solving, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance；Each extreme value area is exported again Between included IF type and all kinds of IF quantity.

Specifically, Fig. 5 is referred to, step S3 is that each extreme value area is obtained on the basis of step S2 using IF classification method Between included IF type and all kinds of IF quantity.The distance of obtained IF refers in each peak value section, has multiple Different IF, because the type of gesture of input is different, then different tracks, corresponds in same peak value section, have multiple IF.What is exported in so step S3 corresponds to all IF and IF_l and IF_s in each extreme value section.

Step S4: marking all tracks according to the most IF of each midrange amount, then use K-Means clustering, Export the second track cluster.

Specifically, method of trajectory clustering inputs all kinds of IF set in a distributed area, chooses in each maximum region The most IF of quantity；Then, all tracks are marked according to the IF of selection, and is converted into vector；Finally, using K-Means method Cluster exports cluster result.Method of trajectory clustering schematic diagram is as shown in Figure 6.

Due to including many IF and VF, heterogeneous networks track IF distribution difference in a network path.Firstly, step S1 is adopted It is statistics feature, length, offset, arrival time etc..These are the feature of track entirety, if to identify that track is studied carefully It unexpectedly is which class, it is necessary to pass through the feature for the IF that track includes.Wherein IF feature includes, the type of IF, IF_l (IF length), IF_s (the IF distance of distance to first character in track).

Step S5: according to the second track cluster in step S4, track corresponding with targeted security agreement cluster is chosen；Then, Step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, is obtained corresponding with targeted security agreement The quantity of the type of IF included in extreme value section and all kinds of IF, then separator inference method is used, by comparing adjacent IF Head and the tail, infer separator, and according to separator from IF Keywords' Partition, eventually form signature database.

Specifically, step S1~step S3 is successively executed using the track cluster of targeted security agreement as input, that is, carried out The track dividing method of step S1, IF distribution fitting method, IF classification method.Keyword estimating method schematic diagram is as shown in Figure 7. Firstly, input trajectory cluster, the track cluster of selection target security protocol；Then, the track cluster for the security protocol being classified is inputted To track dividing method, by IF distribution fitting method and IF classification method, the quantity statistics of all kinds of IF at peak value are obtained；Most Afterwards, infer separator, and the Keywords' Partition from IF by comparing the head and the tail of adjacent IF using separator inference method, will close Keyword and separator are as new signature database.

Specifically, after forming signed data, then can be used to mark track to flow and be converted into vector as k-means The input of method is classified.Include in signature database is the feature that can classify, thus later can be by this Signature database, which carries out subsequent classification, can greatly improve the efficiency of classification while improving classification accuracy.

Generally, Fig. 2 is referred to, is a kind of totality of the network path classification method based on keyword in embodiment Process.Wherein contain the specific implementation step in step S1~step S5.

In one embodiment, step S1 is specifically included:

Specifically, stream statistics feature set has 18 groups of features, such as: average data packet length, parlor arrival time Standard deviation, total flow length (as unit of byte and/or data packet), Fourier transformation of parlor arrival time etc..Based on stream Statistics feature set it is as shown in table 1.Track stream is stored in the file of PCAP format, can be obtained by simple statistics method Circulation is simultaneously turned to feature vector<FlowID by 18 groups of statistical natures of stream, and Vector>.Wherein Vector includes 18 groups of characteristic values, As shown in table 1.

The statistics feature set that table 1 flows

Wherein, the input of K-means clustering method is<FlowID, Vector>, class number K can be preset, and be seen Cluster result is examined, true defining K value simultaneously obtains track cluster<Cluster, FlowID>.K-means clustering method can use existing Clustering method, detailed process are no longer described in detail.

Idea of the invention is that the state that the security protocol track of similar length is undergone is similar, therefore the crucial lexeme generated The distribution set also more is concentrated.The present invention is arranged by the length inverted order of track, and adjacent track is grouped two-by-two, is used Needlman_Wunsch method compares security protocol track similar in total length, and the longer keyword of preferential alignment, last defeated IF and count its position out.Specific algorithm is as shown in Figure 8.

Specifically, Needlman_Wunsch method is common in bioinformatics, is earliest utilization dynamic programming technique pair Than one of the method for Biological Sequence, it finds the homologous pass between protein or gene by aligned protein or gene order System.Because N-W method is capable of handling a large amount of structurings and complicated data.

In the embodiment of the present invention, the input of message dividing method is security protocol flow, and each flow includes multiple keys Word, separator, and variation field.Such as: report/dc00321, report are keywords ,/it is separator, dc00321 is to become Change field.N-W method is common with DNA or in protein Comparison Study.It fills in one two by similar reward and gap penalty Matrix, selected backtracking path are tieed up, and then exports comparison result.Such as report/dc00321 and report/dc04369, pass through The available two parts of sequence alignment, fixed field IF:report/dc0 change field VF:0321 and 4369.Wherein due to dividing It not changing every symbol, changes the less variation of first three character of field, N-W method cannot distinguish between, if compared using three roads, Such as report/dc00321, report/dc04369 and report/da45813, the IF of comparison result are still report/ Dc0, the result compared only with multichannel can only look after most of identical sequence.

Input trajectory collection FlowSet arranges to obtain DFlowSet first, in accordance with the length inverted order of Flow, then, from DFlow1 and DFlow2 are once exported in DFlowSet, until DFlowSet is sky, are then input to DFlow1 and DFlow2 In Needlman_Wunsch method, IF is obtained, and count IF_i, IF_s and VF.

Preferably, the present invention also improves Needlman_Wunsch, to improve accuracy of classifying to obtain.Wherein, The points-scoring system of improved Needlman_Wunsch method follows three principles: matching as much as possible, preferential alignment consecutive word Section only generates vacancy corresponding with first sequence.M_ijIt is the value of the i-th row j column of rating matrix, M_ijCalculation formula such as Shown in formula (2), w is punishment gap, S_ijFor the similarity of j-th of character of i-th of character and b of message a, S_ijScoring Standard such as formula 1 is recalled from M_ijStart, selects M_ijUpper, left, the maximum value of upper left lattice, such as encounter the lattice of same size, it is excellent First choose upper left and upper.S_ijBy continuous coupling score, can preferentially by longest most like fields match together, only produce Raw vacancy corresponding with first sequence is to be able to export the position of keyword directly from matching result.It is improved to comment Subsystem S_ijMathematical formulae such as formula (3).

M_ij=max { M_i-1,j-1+S_ij,M_i,j-1+w,M_i-1,j+ w } formula (2)

Wherein, in formula (3), a_i-1Refer to (i-1)-th character of message a, b_j-1Similar, k and p are that constant controls continuous With score.

In one embodiment, step S2 is specifically included:

Specifically, when the length of fixed field (IF_l) is bigger, the probability for occurring keyword in fixed field is bigger, point The weight matched is bigger.When the length of fixed field reaches 32byte, show there are continuous 4 characters to appear in two tracks On, maximum value 3 is set by weight at this time.

In the specific implementation process, due to the inherent shortcoming of Needlman_Wunsch, can generate the matching of mistake to Influence experimental result.The flow chart of noise cancellation method is as shown in Figure 9.This method deletes the weight for the IF that weight is 1 first, The data type of secondary judgement IF, the IF_w-k*0.192IF_w=IF_w if IF is 10 system numbers；Wherein k value takes 2, if IF It is 16 system numbers then IF_w-k*0.107IF_w=IF_w, wherein k takes 1.5.Rule of thumb calculating 0.192 is two 10 systems The average noise that random number generates, 0.107 is the average noise that two 16 system random numbers generate.

In step 2.3, curve matching refers to the distribution function calculated from many discrete point sets convenient for computer disposal, should Distribution function can eliminate the data and interference of mistake, can compensate for lacking and predicting following trend.

In one embodiment, step S3 is specifically included:

Specifically, the curve of B-spline fitting is the curve of complicated multinomial composition, and newton CG method is that one kind asks bent The optimization algorithm of line extreme value, this method inputs starting point and first derivative objective function progress quadratic function is fast approximately through iteration Speed acquires the approximate extreme value for meeting precision.

Step 3.2 carries out IF classified statistic method, uses x_a, ⁱIt indicates minimum point, uses x_b,i+1Indicate maximum point.Test table Bright, IF integrated distribution existsBetween, the present invention is IF distributed area at the section.IF classified statistic method IF distributed area, IF_s, IF_l and IF are inputted, the first step uses quicksort side to the IF_l in each IF distributed area IF is sorted from large to small the IF being sorted by method；Second step selects unlabelled IF since first IF, successively to Levenshtein distance is calculated afterwards, thinks that IF is labeled as in known class and by IF when Levenshtein distance value meets threshold value Otherwise new classification is added in IF by the known class；Third step repeats second step, until all IF are classified.Finally export IF Distributed area, IF type and IF quantity.The algorithm of IF classified statistic method such as Figure 10.

Levenshtein distance is one kind of common editing distance, and Levenshtein distance method inputs two characters String A and B converts character string B for character string A, and export number of operations by the rule of one character of once-through operation.

In one embodiment, step S4 is specifically included:

In one embodiment, step S5 is specifically included:

Embodiment two

A kind of network path sorter based on keyword is present embodiments provided, referring to Figure 11, which includes:

Keyword inference module, for choosing and targeted security agreement according to the second track cluster in trajectory clustering module Corresponding track cluster；Then, track segmentation module is sequentially input using the track cluster of targeted security agreement as input, IF distribution is asked Module and IF categorization module are solved, the type of the IF for being included in extreme value corresponding with targeted security agreement section and all kinds of is obtained The quantity of IF, then use separator inference method infers separator by comparing the head and the tail of adjacent IF, and according to separator from Keywords' Partition in IF, eventually forms signature database；

In one implementation, segmentation module 201 in track is specifically used for executing following step:

In one implementation, IF distributed problem solving module 202 is specifically used for executing following step:

In one implementation, IF categorization module 203 is specifically used for executing following step:

In one implementation, trajectory clustering module 204 is specifically used for executing following step:

In one implementation, keyword inference module 205 is specifically used for executing following step:

By the device that the embodiment of the present invention two is introduced, to implement the network based on keyword in the embodiment of the present invention one Device used by the classification method of track, so based on the method that the embodiment of the present invention one is introduced, the affiliated personnel's energy in this field The specific structure much of that for solving the device and deformation, so details are not described herein.The method of all embodiment of the present invention one is used Device belong to the range to be protected of the invention.

Embodiment three

Based on the same inventive concept, present invention also provides a kind of computer readable storage medium 300, referring to Figure 12, It is stored thereon with computer program 311, which is performed the method realized in embodiment one.

By the computer readable storage medium that the embodiment of the present invention three is introduced, to implement base in the embodiment of the present invention one The computer readable storage medium used by the network path classification method of keyword, so it is based on one institute of the embodiment of the present invention The method of introduction, the affiliated personnel in this field can understand specific structure and the deformation of the computer readable storage medium, so This is repeated no more.Computer readable storage medium used by the method for all embodiment of the present invention one belongs to the present invention and is intended to The range of protection.

Example IV

Based on the same inventive concept, present invention also provides a kind of computer equipment, referring to Figure 13, including storage 401, On a memory and the computer program 403 that can run on a processor, processor 402 executes above-mentioned for processor 402 and storage The method in embodiment one is realized when program.

Since the computer equipment that the embodiment of the present invention four is introduced is to implement in the embodiment of the present invention one based on keyword Network path classification method used by computer equipment, so based on the method that the embodiment of the present invention one is introduced, ability The affiliated personnel in domain can understand specific structure and the deformation of the computer equipment, so details are not described herein.All present invention are real It applies computer equipment used by method in example one and belongs to the range of the invention to be protected.

Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications can be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, those skilled in the art can carry out various modification and variations without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of network path classification method based on keyword characterized by comprising

Step S1: the hybrid protocol track of input is carried out preliminary classification, obtained by the K-means method based on stream statistics feature To K the first track clusters；In each cluster, arranged according to the length inverted order of track stream, using Needlman_Wunsch method two Two compare the agreement track of similar length, track are divided into fixed field IF and variation field VF, and calculate each fixed word The length IF_l of the section and location information IF_s of fixed field；

Step S2: it is weighted using length IF_l of the IF_l method of weighting to fixed field, obtains the weight IF_w of IF；It adopts again With curve-fitting method, IF is carried out curve fitting to obtain IF position distribution curve using IF_w and IF_s as input；

Step S3: obtaining IF position distribution curve according to fitting, solves Curve Maximization using Curve Maximization method for solving, and in song IF is extracted in each extreme value section of line, and IF statistic of classification is carried out based on Levenshtein distance；Each extreme value section institute is exported again The quantity of the type for the IF for including and all kinds of IF；

Step S4: marking all tracks according to the most IF of each midrange amount, then use K-Means clustering, output Second track cluster；

Step S5: according to the second track cluster in step S4, track corresponding with targeted security agreement cluster is chosen；Then, by mesh The track cluster for marking security protocol successively executes step S1~step S3 as input, obtains extreme value corresponding with targeted security agreement The quantity of the type for the IF that section is included and all kinds of IF, then separator inference method is used, by comparing the head and the tail of adjacent IF, Infer separator, and according to separator from IF Keywords' Partition, eventually form signature database；

Step S6: marking track to be processed to flow using the signature database of formation and is converted into vector, and the vector of conversion is made Input for k-means method is classified.

2. the method as described in claim 1, which is characterized in that step S1 is specifically included:

Step S1.1: input hybrid network track stream<Flow, FlowID>, it is based on stream statistics signature network path, is obtained To feature vector<FlowID, Vector>, K the first tracks are obtained using feature vector as the input of K-means clustering method Cluster<Cluster, FlowID>；

Step S1.2: general<Cluster, the input of FlowID, Flow>as track inverted order aligning method, for each first rail Flow is flowed in track in mark cluster Cluster, calculates the length Flow_length of Flow, then uses quicksort method, presses Flow is sorted according to Flow_length ascending sequence and forms queue Sequence < Cluster, a FlowID, Flow>；

Step S1.3: input Sequence<Cluster, FlowID, Flow>, two streams are taken out from head of the queue, number is i and j< Flow_i,Flow_j>；Then the input of general<Flow_i, Flow_j>as Needlman_Wunsch method, obtain Flow_i, With the public fixed field IF and variable field VF of Flow_j；The length for counting IF again obtains IF_l and IF distance flow_i and rises The distance of initial point obtains IF_s.

3. the method as described in claim 1, which is characterized in that step S2 is specifically included:

Step S2.1: being weighted using length IF_l of the IF_l method of weighting to fixed field, obtain the weight IF_w of IF, The weight distribution method of IF_l method of weighting are as follows:

Wherein, when the length IF_l of fixed field is bigger, the probability for occurring keyword in fixed field is bigger, and the weight of distribution is got over Greatly, when the length of fixed field is between 1byte to 8byte, weight 0, the length of fixed field is in 9byte to 16byte Between when, weight 1, when the length of fixed field is between 17byte to 24byte, the length of weight 2, fixed field is big When being equal to 25byte, weight 3；

Step S2.2: ambient noise is eliminated using noise cancellation method to the weight that step S2.1 is calculated, obtains revised IF_ L weight；

Step S2.3: revised IF_l weight and fixed field length are fitted to obtain using default B-spline curves IF position distribution curve, shown in the curvilinear equation of B-spline such as formula (1):

In formula (1), d_i(i=0,1 ..., n) indicate control point, N_i,k(u) (i=0,1 ..., n) indicate k specification B-spline Basic function.

4. method as claimed in claim 3, which is characterized in that step S3 is specifically included:

Step S3.1: seeking the derived function f ' (x) of IF position distribution curve, and L array x is divided between finding out in domain_i, L compared with It is small, and solve f ' (x_i), then filter out f ' (x_i)×f′(x_i+1The x of)≤0_i, an array is formed, by x_iArray is as newton The starting point of CG method solves extreme point；

Step S3.2: to the IF_l in each IF distributed area using quicksort method by IF sort from large to small to obtain by The IF of sequence selects unlabelled IF then since first IF, successively calculates Levenshtein distance backward, when When Levenshtein distance value meets threshold value, then IF is labeled as known class, otherwise IF is added to new classification, repeated Using the other step of method marking class based on Levenshtein distance, until all IF are classified, IF distribution is finally exported Section, IF type and IF quantity.

5. method as claimed in claim 4, which is characterized in that step S4 is specifically included:

Step S4.1: using IF distributed area, IF type and IF quantity as input, it is most to choose quantity in each IF distributed area IF, and all tracks are marked according to IF the and IF distributed area of selection, are translated into IF distribution vector IFVector；

6. the method as described in claim 1, which is characterized in that step S5 is specifically included:

S5.1: the second track cluster that step S4 is generated executes the track dividing method of step S1 as input, obtains corresponding The location information IF_s of fixed field IF, the length IF_l of fixed field and fixed field；

S5.2: by fixed field IF, the length IF_l of fixed field obtained in step S5.1 and the location information of fixed field IF_s executes the IF distribution fitting method of step S2 as input, obtains IF position distribution curve；

S5.3: using position distribution curve in step S5.3 as input, the IF classification method in step S3 is executed, each IF is exported All IF in distributed area；

S5.4: using separator inference method, input all IF in each distributed area, marks corresponding track with IF, according to Separator appears in the head and the tail of keyword, counts and infer separator, and separator is combined to extract the keyword in IF；

S5.5: by the IF of extraction keyword and deduction obtain separator and store, the number of signature as track flow point class According to library.

7. a kind of network path sorter based on keyword characterized by comprising

Module is divided in track, and for the K-means method based on stream statistics feature, the hybrid protocol track of input is carried out just Step classification obtains K the first track clusters；In each cluster, arranged according to the length inverted order of track stream, using Needlman_ Wunsch method compares the agreement track of similar length two-by-two, track is divided into fixed field IF and variation field VF, and count Calculate the length IF_l of each fixed field and location information IF_s of fixed field；

IF distributed problem solving module obtains IF's for being weighted using IF_l method of weighting to the length IF_l of fixed field Weight IF_w；Curve-fitting method is used again, and IF is carried out curve fitting using IF_w and IF_s as input to obtain the position IF point Cloth curve；

IF categorization module solves curve pole using Curve Maximization method for solving for obtaining IF position distribution curve according to fitting Value, and IF is extracted in each extreme value section of curve, IF statistic of classification is carried out based on Levenshtein distance；It exports again each The quantity of the type for the IF that extreme value section is included and all kinds of IF；

Trajectory clustering module for the IF label all tracks most according to each midrange amount, then uses K-Means method Cluster exports the second track cluster；

Keyword inference module, for choosing corresponding with targeted security agreement according to the second track cluster in trajectory clustering module Track cluster；Then, track segmentation module, IF distributed problem solving mould are sequentially input using the track cluster of targeted security agreement as input Block and IF categorization module obtain the type for the IF that extreme value corresponding with targeted security agreement section is included and the number of all kinds of IF Amount, then separator inference method is used, by comparing the head and the tail of adjacent IF, infer separator, and divide from IF according to separator From keyword, signature database is eventually formed；

Track categorization module marks track to be processed to flow and is converted into vector for the signature database using formation, will turn The vector of change is classified as the input of k-means method.

8. device as claimed in claim 7, which is characterized in that divide module and be specifically used for executing following step in track:

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is performed reality The now method as described in any one of claims 1 to 6 claim.

10. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that realized when the processor executes described program as any one of claims 1 to 6 right is wanted Seek the method.