CN110519128A

CN110519128A - A kind of operating system recognition methods based on random forest

Info

Publication number: CN110519128A
Application number: CN201910893976.9A
Authority: CN
Inventors: 范建存; 张子豪; 樊志甲; 李瀛
Original assignee: Xian Jiaotong University; Beijing NSFocus Information Security Technology Co Ltd
Current assignee: Xian Jiaotong University; Beijing NSFocus Information Security Technology Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2019-11-29
Anticipated expiration: 2039-09-20
Also published as: CN110519128B

Abstract

The operating system recognition methods based on random forest that the invention discloses a kind of, carries out that random sampling is combined into training set and test set is gone forward side by side row vectorization processing using monte carlo method to fingerprint base；Data Passivation Treatment is carried out by the way of branch mailbox；Based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and operating system detailed version identification layer, random forest grader is respectively trained, construct more decision trees, each tree is higher than the precision threshold of setting with the outer measuring accuracy estimated of respectively packet, then is added among random forest；The training of layer architecture local regularity, tune consider and handle reason with lift scheme precision；Identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, and using equal rights ballot mode, the classification for selecting poll most is as final prediction result.It can effectively identify unknown fingerprint, improve the accuracy rate of identification.

Description

A kind of operating system recognition methods based on random forest

Technical field

The invention belongs to field of computer technology, and in particular to a kind of operating system recognition methods based on random forest.

Background technique

Quick with internet is popularized, and the importance of network safety filed more highlights, the detection and knowledge of operating system Other assessment and protection to network security has great significance and the important step of asset identification.

Currently, most of prospecting tools are based primarily upon known operation system fingerprint library, using traditional static fingerprint Judge there is a problem of unknown fingerprint identification difficulty with mode, and introduce machine learning related algorithm, is further dug from feature The sufficient and necessary condition for digging fingerprint can effectively solve unknown fingerprint identification problem, and ensure fingerprint in higher dimension Reliability.Zou Tiezheng proposes the operating system recognition methods based on support vector machines, by construct a large amount of two classifier into Row classification, however there are limitations for this method, with increasing for OS Type, the training expense mistake of support vector machine method Greatly, application performance causes bottleneck.

Summary of the invention

In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of based on random The operating system recognition methods of forest obtains universality to construct high reliablity, the easily novel random forest dactylotype that extends Strong Fingerprint Model is to identify that extensive operation system of software and Internet of things system equipment, this method also support that user rs environment is privately owned The training of model.

The invention adopts the following technical scheme:

A kind of operating system recognition methods based on random forest, comprising the following steps:

S1, using monte carlo method, analyzed based on third party's fingerprint base and determine characteristic attribute used in training, each belong to Property the value range and fingerprint collection that occurs of most probable, random sampling is carried out to fingerprint base and is combined into training set and test set, and it is right The data of training set and test set carry out vectorization processing；

S2, by the way of branch mailbox, to Partial Feature attribute in third party's fingerprint base be value range dimension carry out data Passivation Treatment；

S3, based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and Random forest grader is respectively trained in operating system detailed version identification layer, constructs more decision trees, and each tree is outer with respectively packet The measuring accuracy of estimation is compared with the precision threshold of setting, if be higher than threshold value, the decision tree be added to random forest it In；

S4, the training of layer architecture local regularity, tune consider and handle reason with lift scheme precision；

S5, identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, adopts With equal rights ballot mode, the classification for selecting poll most is as final prediction result.

Specifically, being constructed in step S1 based on Nmap fingerprint base, Nmap sends corresponding generate of 16 data packets and responds Sequence, each response sequence correspond to different flag bits, and the sequence flag position totally 119 that 16 data packets obtain removes SEQ.SP and two flag bits of SEQ.ISR, using remaining 117 flag bits as the characteristic attribute of training data, every kind of operation system The corresponding one or more rule sets of system, each rule set are made of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7 sequence group, The value of each sequence group mark position is value range or many-valued formal arranged side by side；Finger print data is disassembled out from each rule set, The complete or collected works for first obtaining fingerprint sequence select 4 samples if sequence complete or collected works' quantity is greater than 500 at random on sequence complete or collected works； If sequence complete or collected works' quantity is greater than 10 and less than 100,2 samples are selected at random in complete or collected works；If sequence complete or collected works quantity less than 10, Then select 1 sample at random on complete or collected works；The cartesian product that the sample that all response sequences extract in the rule set is constituted is formed Nominal data is mapped as number by the way of natural number coding and carries out vectorization processing by the simulated data sets under the rule.

Further, this four test response sequences of SEQ, OPS, WIN, T1 can be generated by sending 6 TCP SYN detection packets, SEQ is the sequencing results based on detection packet, including TCP ISN sequence predictability index (SP), TCP ISN highest common divisor Number (GCD), TCP ISN counting rate (ISR), ID sequence, which generate, mutually responds (TI, CI, II), shared IP ID sequence Boolean (SS), TCP timestamp option algorithm (TS)；OPS is the TCP option that each detection packet receives, and WIN is that each detection packet receives TCP initial window size, T1 include the test value of data packet 1, including response (R), IP forbid fragment position (DF), IP initially to survive Time (T), the initial life span conjecture (TG) of IP, TCP sequence number (S), TCP acknowledgment number (A), TCP indicate (F), TCPRST number According to verification and (RD), TCP miscellaneous (Q)；

It sends 2 ICMP echo detection packet and generates IE sequences, including response (R), fragment position (DFI), IP is forbidden initially to give birth to Deposit the time (T), initial life span conjecture (TG), ICMP answer code (CD) of IP；

1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response Sequence, including response (R), forbid initial life span conjecture (TG), TCP in fragment position (DFI), the initial life span of IP (T), IP Initial window size (W), TCP option (O), TCP miscellaneous (Q)；

6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response (R), IP forbids fragment position (DF), the initial life span of IP (T), initial life span conjecture (TG), TCP initial window size of IP (W), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP Miscellaneous (Q)；

It sends 1 UDP detection packet and forbids fragment position (DF), IP to close port, formation sequence U1, including response (R), IP The initial life span conjecture (TG) of initial life span (T), IP, IP total length (IPL), not used port do not reach field Non-zero (UN) returns to detection IP total length (RIPL), the detection IP ID value (RID) returned, returns to the complete of detection IP checksum value Whole property (RIPCK), the integrality (RUCK) for returning to detection UDP verification sum, the integrality (RUD) of the UDP message returned.

Specifically, step S2 specifically:

S201, the feature extracted to third party's fingerprint base are analyzed, in conjunction with analogue data random sampling generating mode, really The fixed attribute because of data pick-up not total loss information；

S202, according to above-mentioned attribute in the characteristic distributions of third party's fingerprint base, determine reasonable branch mailbox passivation interval number Mesh and passivation mode are concentrated in initial data and carry out branch mailbox Passivation Treatment and the corresponding passivation scale text of generation to above-mentioned attribute Part；

S203, during subsequent training and prediction, original training set test set by unified passivation scale carry out branch mailbox Passivation Treatment.

Specifically, step S3 specifically:

If S301, training set have n sample, Bootstrap mode sampling with replacement n times from current training set are used, 63.2% sample is pumped in new sample set in former training set, and the stand-alone training collection as the base learner is instructed Practice, 36.8% sample not being pumped to carries out wrapping outer estimation as the test set of the base learner；

S302, current sample set comentropy is calculated:

Wherein, D is the sample set of training, and n is class number included in current sample set, p_iFor sample of all categories Quantity ratio shared in total sample；

S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate and wait Select each characteristic condition entropy, conditional entropy formula in attribute set are as follows:

Wherein, D is current sample set, and A is a certain candidate division attribute of the sample set, and A shares n possible value (a₁, a₂,...,a_n), p_iIt is a for the sample set feature A value_iSample shared by ratio；

S304, the information gain for calculating each feature, for using information gain-ratio as criteria for classifying, being partial to select successive value Feature will subtract log to the information gain of successive value₂(N-1)/| D | it is modified, N is possible split point number, | D | be Data set size；For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A)；For connection attribute, choosing Select best splitting point of the maximum split point of information gain as the attribute after correcting；

S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than and set The feature of fixed information gain precision threshold calculates its information gain ratio, and it is more special to divide than highest feature to select information gain Sign；

S306, according to the value for dividing feature, divide data set, each branch successively recurrence aforesaid operations, until meet with Lower three kinds of situations then terminate division；All trained samples of current node belong to a class, then mark the node with the category, and should Node is as leaf node；The attribute that all samples of current node can not be used to divide；Current node training samples number It is less than smallest sample amount threshold value very little.

Further, in step S303, for discrete features, subsample collection, meter are divided according to each value of this feature The comentropy of operator sample set, then find out its conditional entropy；For continuous feature, first connection attribute is ranked up, is only met Under conditions of the difference of the place that category label change and two successive values is greater than the attribute setup threshold value, take in two successive values Digit converts Category Attributes for connection attribute and calculates as candidate cut-off.

Further, in step S305, for feature A, information gain is than calculation formula Wherein division information is

。

Specifically, first layer operating system classification layer is constructed in step S3, first in the simulated data sets of generation according to category Label stratified sampling goes out m data and constitutes new training set Mset, and measuring accuracy is obtained by the way of the outer estimation of packet, according to each behaviour The utilization rate and identification demand for making system, determine operating system { x₁,x₂,...,x_nN big class labels, by original training set Belong to the class label x that the other most fine-grained label of this n major class is remapped to coarseness_i, it is not belonging to this n big classifications Operating system label mapping be Others type, by after mapping data set input random forest method be trained to obtain os Multi-class classification device；The threshold accuracy k1 of operating system multi-class classification device is set, tests the classification using outer estimated data is wrapped Device precision, when precision reaches threshold requirement, the building of first layer classifier is completed, if not up to threshold accuracy, adjustment algorithm Parameter increases training sample amount re -training to promote precision to threshold requirement；

This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset N group data are divided by big classification, and its respective class label is mapped into the form that classification adds major version number, every group of data are adopted N main version classification devices are generated with the training of random forest method, the setting main version classification device threshold accuracy of operating system is k2, is surveyed Examination mode estimates that mode, every group of test data assess the precision of its corresponding main version classification device respectively, work as essence using packet is outer Degree reaches threshold requirement, then retains the classifier, if not up to threshold requirement, by adjusting ginseng or enriching from fingerprint base again The mode re -training of training sample data to be promoted to threshold requirement, in n main version classification devices composition layer architectures the Two layers of key plate this identification layer；

Third layer is detailed version identification layer, if the main release label generated after second layer mapping is v, by training set Mset is divided into v group data by the main release label of the second layer, and every group of data generate v in detail using random forests algorithm training Version classification device, setting detailed version classifier threshold accuracy are k3, assess each detailed version using the outer estimation mode of packet and classify Device, if the precision of certain classifier is lower than threshold requirement, this point of re -training by way of adjusting ginseng or resampling abundant data Class device is until meet threshold requirement；K1 > k2 > k3 should be met when precision threshold.

Specifically, in step S4, give out a contract for a project the identical scale of mode to the destination host that need to be detected using with third party's detection Detection sequence packet is sent, probe response sequence is obtained, therefrom extracts fingerprint, and be converted into numerical value vector and be input to hierarchical mode In, granularity is identified according to it, obtains the prediction result of response；Special systematic label and IP list are obtained, by sending to IP Finger print data is mapped as in vector deposit training set, with obtaining the finger print data that it is responded when newly-increased instruction by probe data packet After practicing the threshold value that data reach setting, re -training is to realize algorithm Fingerprint Model to special systematic expansion effect.

Specifically, in step S5, give out a contract for a project the identical scale of mode using with Nmap detection, to the destination host hair that need to be detected 6 TCP SYN detection packet detection sequence packets are sent to generate this four test response sequences of SEQ, OPS, WIN, T1, send 2 ICMP Echo detection packet generate IE sequence, send 1 TCP SYN detection packet obtain description TCP clearly specified congestion notification when feature, 6 TCP probe data packets are sent to generate six response sequences of T2-T7 respectively, send 1 UDP detection packet to close port generation Sequence U1, this 13 response sequences are mapped to numerical value vector in the way of step S1 and are input to after corresponding Passivation Treatment In hierarchical mode, granularity is identified according to it, obtains the prediction result of response, when prediction, test data enters top-level categories device, knows Not Chu the corresponding big classification of operating system, data enter the corresponding category classifier of the second layer, identify the master under its category Version number finally enters back into third layer and corresponds in detailed version classifier of the key plate under this, obtains final detailed operating system Information.

Compared with prior art, the present invention at least has the advantages that

The present invention judges that operating system unknown fingerprint recognition capability is weak for traditional static fingerprint matching, rate of false alarm, leakage The high problem of report rate proposes a kind of operating system recognition methods based on random forests algorithm, to improve identification precision, for Unknown fingerprint also can be carried out effective identification；It is further by random forests algorithm on the basis of third party's fingerprint base rule set The sufficient and necessary condition for excavating operation system fingerprint ensure that the reliability of identification method on more high-dimensional；Pass through building Order training method framework can identify operating system from different grain size, and each level training is mutually indepedent, the knowledge to different operating system It is not easy to local directed complete set, avoids repeatedly the full dose training heavy expense of bring；Problem unbalanced for training data, can be with Reasonably data are merged or decomposed by layer architecture, with equilibrium data collection；Order training method framework is easily expanded, when to mould When type increases the classification of identification, it is only necessary to which the training of adjustment member classifier reaches without the training of model entirety full dose The effect of approximate incremental training.

Further, it is the pervasive operating system classifier that conformation identification ability is strong, identification range is wide, needs various systems Sample data, however it is infeasible actually to build various operating system environments.It is rationally disassembled based on third party's fingerprint base, determines and belong to Property feature after generate analogue data can effectively solve the above problems.

It further, is value range for Partial Feature attribute in third party's fingerprint base, being difficult in random sampling should All values are all extracted into range, then carry out data Passivation Treatment to value range dimension by the way of branch mailbox, random to reduce Sampling bring information loss.

Further, disaggregated model is constructed using random forests algorithm, random forests algorithm will be more by certain strategy Decision tree combines, to convert strong Fingerprint Model for weak Fingerprint Model, significantly lifting operating system fingerprint recognition Accuracy and stability, in addition random forest bootstrap sample mode supports packet is outer to estimate mode verification algorithm performance, phase Computing cost can be reduced for modes such as k folding cross validations.

Further, an attribute is first randomly choosed from the attribute set of the node when selection divides attribute every time Subset, then optimum division attribute is selected from subset, increase attribute disturbance to reduce model variance.S303 calculates each attribute conditions Entropy can further find out the information gain of each attribute than the selection criteria as optimum division attribute, and the calculating of conditional entropy is also Connection attribute chooses optimum division point and provides foundation.

Further, when each node chooses optimum division attribute, by calculating information gain ratio, information gain ratio is chosen Maximum attribute is optimum division attribute.Selection information gain ratio is criteria for classifying, can effectively reduce information gain criterion preference Selection can the more attribute of value number and bring adversely affect.

Further, three layers of order training method framework are constructed, it can be from three layers of big classification of system, key plate sheet, detailed version granularity Upper identification operating system, the training of three levels is mutually indepedent, is easy to local directed complete set to the identification of different operating system, avoids repeatedly Full dose trains the heavy expense of bring；For training data imbalance problem, can by layer architecture reasonably to data into Row merges and decomposes, with equilibrium data collection；Order training method framework is easily expanded, when increasing model the system classification of identification, only The training of adjustment member classifier is needed, without the training of model entirety full dose, has achieved the effect that approximate incremental training.

Further, to expand new finger print data and fingerprint classification, step S4 can be to the target master of known system type Machine sends detection packet acquisition finger print data, and is mapped as vector data, is added in training set, runs up to certain magnitude, instruct again Practice to realize the effect to Fingerprint Model long term maintenance and expansion.

Further, the data packet constructed meticulously is sent to destination host when prediction, and then obtains response sequence, will responded Sequence vector is input in three-tier architecture Random Forest model.By the way of equal rights ballot, classification that each base is learnt As a result be combined, can effectively solve the problems, such as that single learner causes Generalization Capability bad because falsely dropping, reduce model variance with Deviation, and then reach effective identification prediction of operating system.

In conclusion the present invention judges that operating system unknown fingerprint recognition capability is weak for traditional static fingerprint matching, The high problem of rate of false alarm, rate of failing to report proposes a kind of operating system recognition methods based on random forests algorithm, can effectively identify not Know fingerprint, improves the accuracy rate of identification.Using order training method framework, varigrained recognition capability, each level difference point are provided Class device parallel training, it is independent of one another with first-level class device, it is easy to local directed complete set, increases the scalability and maintainability of model.

Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Detailed description of the invention

Fig. 1 is stratified random forest structure figure；

Fig. 2 is random forest training process figure；

Fig. 3 is actual flow precision figure.

Specific embodiment

A kind of operating system recognition methods based on random forest of the present invention, it is random gloomy based on the building of C4.5 decision Tree algorithms Woods, comprising the following steps:

S1, data preparation and feature extraction: using monte carlo method, is analyzed based on third party's fingerprint base and determines training institute The fingerprint collection that characteristic attribute, each attribute value range and the most probable used occurs carries out fingerprint base sufficiently a large amount of random Sampling is combined into training set and test set, and carries out vectorization processing to the data of training set and test set；

It is constructed based on Nmap fingerprint base, according to Nmap system detection principle:

Nmap sends the corresponding generation response sequence of 16 data packets, and each response sequence corresponds to some flag bits, passes through ratio The matching degree of flag bit in dynamic fingerprint and static fingerprint base to detection, to determine system classification.

This four test response sequences of SEQ, OPS, WIN, T1 can be generated by sending 6 TCP SYN detection packets, and SEQ is to be based on Detect packet the sequencing results, including TCP ISN sequence predictability index (SP), TCP ISN greatest common divisor (GCD), TCP ISN counting rate (ISR), ID sequence, which generate, mutually responds (TI, CI, II), shared IP ID sequence Boolean (SS), TCP time It stabs option algorithm (TS).OPS is the TCP option that each detection packet receives.WIN is the TCP home window that each detection packet receives Size.T1 includes the test value of data packet 1, including fragment position (DF), the initial life span of IP (T), IP are forbidden in response (R), IP Initial life span conjecture (TG), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP RST data check and (RD), TCP miscellaneous (Q).

It sends 2 ICMP echo detection packet and generates IE sequences, including response (R), fragment position (DFI), IP is forbidden initially to give birth to Deposit the time (T), initial life span conjecture (TG), ICMP answer code (CD) of IP.

1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response Sequence, including response (R), forbid initial life span conjecture (TG), TCP in fragment position (DFI), the initial life span of IP (T), IP Initial window size (W), TCP option (O), TCP miscellaneous (Q).

6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response (R), IP forbids fragment position (DF), the initial life span of IP (T), initial life span conjecture (TG), TCP initial window size of IP (W), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP Miscellaneous (Q).

The sequence flag position totally 119 that 16 data packets obtain is sent above, and by the analysis to flag bit, this method is determined It is fixed to use wherein characteristic attribute of 117 flag bits (removing SEQ.SP and SEQ.ISR) as training data.Due to being difficult to obtain Meet the various operating system real traffics that training needs, so generating a large amount of simulation numbers by the way of laboratory data emulation It is trained according to for algorithm.Nmap fingerprint base is made of many rule sets, the corresponding one or more rule sets of every kind of operating system, Each rule set is made of this 13 sequence groups of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7, each sequence group mark position Value be value range or many-valued formal arranged side by side.

Finger print data is disassembled out from each rule set, first to obtain the complete or collected works of fingerprint sequence, if sequence complete or collected works' quantity Greater than 500, then 4 samples are selected at random on sequence complete or collected works；If sequence complete or collected works' quantity is greater than 10 and less than 100, in complete or collected works In select 2 samples at random；If sequence complete or collected works quantity less than 10, selects 1 sample at random on complete or collected works.In the rule set The cartesian product that the sample that all response sequences extract is constituted forms the simulated data sets under the rule.Due to part flag bit Nominal data is mapped as digital progress by the way of natural number coding here for the ease of algorithm process for nominal data Vectorization processing.

In the above manner, the analogue data set that fingerprint base strictly all rules collection generates constitutes the required data of training Collection.

S2, data prediction；It is value range for Partial Feature attribute in third party's fingerprint base, is difficult in random sampling Value all in the range is all extracted into, then data Passivation Treatment is carried out to value range dimension by the way of branch mailbox, to reduce Random sampling bring information loss；

S201, the feature extracted to third party's fingerprint base are analyzed, in conjunction with analogue data random sampling generating mode, really Determine not lose the attribute of information entirely because of data pick-up.

S202, according to above-mentioned attribute in the characteristic distributions of third party's fingerprint base, determine reasonable branch mailbox passivation interval number Mesh and passivation mode (generally equidistant branch mailbox or waiting frequency divisions case), concentrate in initial data and carry out at branch mailbox passivation to above-mentioned attribute It manages and generates corresponding passivation scale file.

S203, subsequent training and prediction during, original training set test set all need to be according to unified passivation scale Carry out branch mailbox Passivation Treatment.

By the analysis to Nmap fingerprint base, branch mailbox is passivated for section with 10 to this 8 attributes of ECN.T, T1-T7.T Processing.

S3, order training method stage, based on the layer architecture of setting according to operating system classification identification layer, the big version of operating system Random forest grader is respectively trained in this number identification layer and operating system detailed version identification layer parallel；

Training algorithm on the basis of the C4.5 decision tree optimized, using random forests algorithm improve model prediction accuracy with Generalization ability, as shown in Figure 2.

If S301, training set have n sample, using Bootstrap mode, the sampling with replacement n times from current training set, In the case where sample size is sufficiently large, about 63.2% sample can be pumped in new sample set in former training set, as The stand-alone training collection of the base learner is trained, and 36.8% sample not being pumped to can be used as the test set of the base learner It carries out wrapping outer estimation.

S302, current sample set comentropy is calculatedWherein D is the sample set of training, n For class number included in current sample set, p_iFor sample size of all categories ratio shared in total sample.

S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate and wait Each characteristic condition entropy in attribute set is selected, for discrete features, subsample collection is divided according to each value of this feature, is calculated The comentropy of subsample collection further finds out its conditional entropy again；For continuous feature, first connection attribute is ranked up, only Meet under conditions of the difference of the place that category label change and two successive values is greater than the threshold value of the attribute setup, takes two companies Continuous value median is calculated as candidate cut-off to convert Category Attributes for connection attribute.

If D is current sample set, A is a certain candidate division attribute of the sample set, and A shares n possible value (a₁, a₂,...,a_n), pi is that the sample set feature A value is a_iSample shared by ratio, then for feature A, conditional entropy formula For

S304, the information gain for calculating each feature.The conditional entropy that the empirical entropy of data set subtracts each feature obtains each feature Information gain.For using information gain-ratio as criteria for classifying, being partial to select continuous value tag, the information of successive value is increased Benefit subtracts log₂(N-1)/| D | it is modified, N is possible split point number, | D | it is data set size.

For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A).

For connection attribute, best division of the maximum split point of information gain as the attribute after amendment is selected Point.

S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than and set The feature of fixed information gain precision threshold calculates its information gain ratio, and it is more special to divide than highest feature to select information gain Sign.

For feature A, information gain is than calculation formulaWherein division information is

S306, according to the value for dividing feature, divide data set, each branch successively recurrence aforesaid operations, until meet with Lower three kinds of situations then terminate division: all training samples of current node belong to a class, then mark the node with the category, and Using the node as leaf node；The attribute that all samples of current node can not be used to divide；Current node training sample Quantity is less than smallest sample amount threshold value very little.

By the above-mentioned means, constructing single decision tree, random forest is considered as the set of more decision trees.Constantly press According to aforesaid way, more decision trees are constructed, each tree is carried out with the precision threshold of the outer measuring accuracy estimated of respective packet and setting It compares, if being higher than threshold value, which is added among random forest.When prediction, each tree in forest all provides one Classification results, using equal rights ballot mode, the classification for selecting poll most, as final prediction result.

Using random forests algorithm, while ensuring precision, generalization ability is greatly improved, but directly detailed with operating system Thin version is that class label is trained, because operating system classification is numerous, generally requires a large amount of sample data and is trained, leads to Toning is joined or is enriched in a manner of data-driven sample data and generally requires again full dose training with lift scheme precision, and expense is huge Greatly, and operating system itself has a hierarchy, including classification, key plate sheet, detailed version, therefore according to operating system label level Characteristic proposes that the framework of stratified random forest can effectively solve the above problems, and provides different levels varigrained identification effect Fruit, as shown in Figure 1.

First layer operating system classification layer is constructed, first goes out m item according to class label stratified sampling in the simulated data sets of generation Data constitute new training set Mset, and measuring accuracy can be used the outer mode estimated of packet and obtain, according to the use of each operating system Rate and identification demand, determine operating system { x₁,x₂,...,x_nThis n big class labels, original training set is belonged to this n big The most fine-grained label of classification is remapped to the class label x of coarseness_i, it is not belonging to this other operating system of n major class Label mapping is Others type, and the data set after mapping is inputted above-mentioned random forests algorithm and is trained to obtain the big classification of os Classifier.Because classification note number of types is significantly less, and relatively small number of sample size can instruct by remapping for note Practise the higher operating system multi-class classification device of precision.The threshold accuracy k1 of operating system multi-class classification device is set, is used It wraps outer estimated data and tests the classifier precision, when precision reaches threshold requirement, the building of first layer classifier is completed, if not reaching To threshold accuracy, then adjustment algorithm parameter or increase training sample amount re -training are to promote precision to threshold requirement.

This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset N group data are divided by big classification, and its respective class label is mapped into the form that classification adds major version number, every group of data are adopted N main version classification devices are generated with random forests algorithm training.The setting main version classification device threshold accuracy of operating system is k2, is surveyed Examination mode estimates that mode, every group of test data assess the precision of its corresponding main version classification device respectively, work as essence using packet is outer Degree reaches threshold requirement, then retains the classifier, if not up to threshold requirement, by adjusting ginseng or enriching from fingerprint base again The mode re -training of training sample data is to be promoted to threshold requirement.This n main version classification devices constitute in layer architecture Second layer key plate this identification layer.

Third layer is detailed version identification layer, can identify that Grained Requirements flexibly construct according to each operating system.If the second layer The main release label generated after mapping is v, training set Mset is divided into v group data by the main release label of the second layer, often Group data generate v detailed version classifier using random forests algorithm training.Detailed version classifier threshold accuracy, which is arranged, is K3 wraps outer estimation mode and assesses each detailed version classifier, if the precision of certain classifier is lower than threshold requirement, by adjust ginseng or The mode re -training of the resampling abundant data classifier is until meet threshold requirement.

Three-tier architecture by training data station work, can parallel training simultaneously, reduce time overhead.The classifier of same level Independently of each other, it is easily managed and expands.When prediction, test data enters top-level categories device, identifies that corresponding operating system is big Classification, data enter the corresponding category classifier of the second layer in turn, identify the major version number under its category, finally enter back into Third layer corresponds in detailed version classifier of the key plate under this, obtains final detailed operation system information.

Because when prediction, upper and lower level connection is close, and the reliability of upper layer recognition result directly affects the prediction knot of lower layer Fruit, so, the classifier on more upper layer is higher to reliability requirement, should meet k1 > k2 > k3 when precision threshold is arranged.

Training data is layered station work by different grain size by layer architecture, when adjusting to classifier, need to only be instructed in part Re -training is carried out to partial classifier on the basis of white silk data, and does not have to all training sample data of re -training, so It is so approximate that regard as a kind of incremental training mode.

According to this method, sample incremental mode and Class increment mode are supported.

For sample incremental mode, three layers of correspondence are sent into according to the operating system classification of increment sample, major version number Classifier in identified, if classifier can be identified correctly, without processing, increment sample data is otherwise added to identification The classifier training of mistake is concentrated.The fault-tolerant sample number threshold value p of classifier is set, if certain classifier increases the sample number of training set newly Greater than threshold value p, then the classifier uses former training set and newly-increased sample data re -training.

It, will if the big classification of the increasing sample data fails correctly to identify in top-level categories device for Class increment mode The sample data is added in top layer training set, when newly-increased sample number is greater than threshold value p, then the classifier using former training set and Newly-increased sample data re -training；If newly-increased classification does not have corresponding classifier in the second layer or third layer, newly-increased pair is needed The classifier answered, when newly-increased classification sample number is greater than q, uses the sample set as newly if sample number needed for newly-increased classifier is q The training set for increasing classifier is trained.

Operating system identification prediction is carried out to real traffic, because the attributive character and data of training set are by third The analysis mode experiment of square fingerprint base comes, so the real traffic acquisition modes of prediction should be formed with third party's fingerprint base fingerprint Mode is identical.Mode of giving out a contract for a project is detected referring to third party, using identical scale, sends detection sequence to the destination host that need to be detected Packet obtains probe response sequence, therefrom extracts fingerprint, and be converted into numerical value vector and be input in hierarchical mode, according to its knowledge Other granularity, obtains the prediction result of response.

For peculiar system identification prolem under the privately owned environment not occurred in third party's fingerprint base, the present invention supports private There is environment fingerprint training that can effectively solve the problems, such as this.Special systematic label and IP list are obtained, is visited by being sent to these IP Finger print data is mapped as in vector deposit training set, with obtaining the finger print data that it is responded when newly-increased training by needle data packet After data reach the threshold value of setting, re -training is to realize algorithm Fingerprint Model to special systematic expansion effect.

S5, identification prediction is carried out to true detection flow.

Operating system identification prediction is carried out to real traffic, because the attributive character and data of training set are by Nmap The analysis mode experiment of fingerprint base comes, so the real traffic acquisition modes of prediction should be with Nmap fingerprint base fingerprint generation type It is identical.Mode of giving out a contract for a project is detected referring to Nmap, using identical scale, sends 6 TCP SYN to the destination host that need to be detected Packet detection sequence packet is detected to generate this four test response sequences of SEQ, OPS, WIN, T1, send 2 ICMP echo detection Bao Sheng At IE sequence, send 1 TCPSYN detection packet obtain description TCP clearly specified congestion notification when feature, send 6 TCP and visit Measured data packet generates six response sequences of T2-T7 respectively, 1 UDP detection packet of transmission arrives close port formation sequence U1, and this 13 Response sequence is mapped to numerical value vector in the way of step S1 and is input in hierarchical mode after corresponding Passivation Treatment, root Granularity is identified according to it, obtains the prediction result of response.

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real The component for applying example can be arranged and be designed by a variety of different configurations.Therefore, below to the present invention provided in the accompanying drawings The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of of the invention selected Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts The every other embodiment obtained, shall fall within the protection scope of the present invention.

Compliance test result

(1) estimate to verify outside analogue data packet

Random forests algorithm has 63.2% in the case where sample size is big using Bootstrap mode sampling with replacement Sample can be taken out as the training set of base learner, then remaining 36.8% sample can be used as verifying collection and carry out that " oob packet is estimated outside Meter ", this method is small relative to k folding cross validation expense, and can effectively be verified to model generalization performance.

The outer estimation evaluation index of packet is as follows:

The 1 macro index of operating system classification layer verification the verifying results of table

The 2 detailed verification the verifying results of operating system classification layer of table

The 3 macro index of Windows key plate sheet of table

The 4 detailed index of Windows key plate sheet of table

The 5 macro index of Linux key plate sheet of table

The 6 detailed index of Linux key plate sheet of table

As shown in table 1 to table 6, by wrapping outer estimated data to operating system classification layer and the key plate of Windows this identification Layer is made that compliance test result, it was demonstrated that the validity and feasibility of this method, wherein 8 recall ratio of Windows is lower, possible It is transition system that factor, which is Windows 8, and fingerprint difference is not compared to being it is obvious that causing for Windows other systems The difficulty of identification, but from general, this method is feasible effectively.

(2) true environment flow rate effects are verified

By Windows host under certain network segment to Office Network and Linux server referring to Nmap detection mode, spy of giving out a contract for a project Survey, the fingerprint of acquisition predicted by algorithm model, be used herein as April 15 and April 16 to same network segment Windows and The detection data of Linux, compliance test result are as follows:

The survival Host List 1 of table 7

The survival Host List 2 of table 8

Table 7, table 8 and Fig. 3 are please referred to, since real system environment is limited, so to necessary being in Office Network Windows is identified that recognition accuracy is higher with Linux host, and this method is feasible effectively.

In conclusion the present invention, on the basis of third party's fingerprint base, constructing analog data are simultaneously mapped as algorithm study Vector, in order to construct the resource dactylotype of high reliablity, easy expansion management, using random forests algorithm by a plurality of weak fingerprint It is combined into strong fingerprint, with lifting operating system recognition accuracy and identifiable categorical measure, and designs order training method framework, with Realize the long expansion and maintenance to the operation system fingerprint library constructed based on random forest, multi-classification algorithm is compared to two classification Algorithm can effectively avoid in operating system identification problem because training expense due to bring using bottleneck.

The above content is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, all to press According to technical idea proposed by the present invention, any changes made on the basis of the technical scheme each falls within claims of the present invention Protection scope within.

Claims

1. a kind of operating system recognition methods based on random forest, which comprises the following steps:

S1, using monte carlo method, analyzed based on third party's fingerprint base and determine that characteristic attribute used in training, each attribute take It is worth the fingerprint collection of range and most probable appearance, random sampling is carried out to fingerprint base and is combined into training set and test set, and to training The data of collection and test set carry out vectorization processing；

S2, by the way of branch mailbox, to Partial Feature attribute in third party's fingerprint base be value range dimension carry out data passivation Processing；

S3, based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and operation Random forest grader is respectively trained in system detailed version identification layer, constructs more decision trees, each tree outer estimation of respectively packet Measuring accuracy be compared with the precision threshold of setting, if being higher than threshold value, which is added among random forest；

S5, identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, using flat Ballot mode is weighed, the classification for selecting poll most is as final prediction result.

2. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S1, It is constructed based on Nmap fingerprint base, Nmap sends the corresponding generation response sequence of 16 data packets, and each response sequence is corresponding not Same flag bit, the sequence flag position totally 119 that 16 data packets obtain remove SEQ.SP and two flag bits of SEQ.ISR, will Characteristic attribute of remaining 117 flag bits as training data, the corresponding one or more rule sets of every kind of operating system, Mei Gegui Then collection is made of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7 sequence group, and the value of each sequence group mark position is range Value or many-valued formal arranged side by side；Finger print data is disassembled out from each rule set, the complete or collected works of fingerprint sequence is first obtained, if the sequence Complete or collected works' quantity is greater than 500, then selects 4 samples at random on sequence complete or collected works；If sequence complete or collected works' quantity is greater than 10 and is less than 100, select 2 samples at random in complete or collected works；If sequence complete or collected works quantity less than 10, selects 1 sample at random on complete or collected works； The cartesian product that the sample that all response sequences extract in the rule set is constituted forms the simulated data sets under the rule, using certainly Nominal data is mapped as number and carries out vectorization processing by the mode of right number encoder.

3. the operating system recognition methods according to claim 2 based on random forest, which is characterized in that send 6 TCP SYN detection packet can generate this four test response sequences of SEQ, OPS, WIN, T1, and SEQ is the sequence analysis knot based on detection packet Fruit, including TCP ISN sequence predictability index (SP), TCP ISN greatest common divisor (GCD), TCP ISN counting rate (ISR), ID sequence, which generates, mutually responds (TI, CI, II), shared IP ID sequence Boolean (SS), TCP timestamp option algorithm (TS)；OPS It is the TCP option that each detection packet receives, WIN is the TCP initial window size that each detection packet receives, and T1 includes data packet 1 Test value, including response (R), IP forbid the initial life span conjecture in fragment position (DF), the initial life span of IP (T), IP (TG), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP RST data check and (RD), TCP miscellaneous (Q)；

Send 2 ICMP echo detection packet and generate IE sequences, including response (R), when fragment position (DFI), IP being forbidden initially to survive Between the initial life span of (T), IP guess (TG), ICMP answer code (CD)；

1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response sequence, Including response (R), forbid fragment position (DFI), the initial life span of IP (T), the initial life span conjecture (TG) of IP, TCP initial Window size (W), TCP option (O), TCP miscellaneous (Q)；

6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response (R), IP Forbid the initial life span conjecture (TG) in fragment position (DF), the initial life span of IP (T), IP, TCP initial window size (W), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP are miscellaneous (Q)；

It sends 1 UDP detection packet and forbids fragment position (DF), IP initial to close port, formation sequence U1, including response (R), IP The initial life span conjecture (TG) of life span (T), IP, IP total length (IPL), not used port do not reach field non-zero (UN), detection IP total length (RIPL), the detection IP ID value (RID) returned, the integrality for returning to detection IP checksum value are returned (RIPCK), integrality (RUCK), the integrality (RUD) of the UDP message returned of detection UDP verification sum are returned.

4. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that step S2 is specific Are as follows:

S201, to third party's fingerprint base extract feature analyze, in conjunction with analogue data random sampling generating mode, determine because The attribute of data pick-up not total loss information；

S202, according to above-mentioned attribute third party's fingerprint base characteristic distributions, determine reasonable branch mailbox passivation section number with Passivation mode is concentrated in initial data and above-mentioned attribute is carried out branch mailbox Passivation Treatment and generated to be passivated scale file accordingly；

S203, during subsequent training and prediction, original training set test set by unified passivation scale carry out branch mailbox passivation Processing.

5. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that step S3 is specific Are as follows:

If S301, training set have n sample, taken out from current training set using sampling with replacement according to Bootstrap mode N times are taken, generate the sampling subset comprising n sample, when n numerical value tends to be infinite, each sample is not drawn in former training set Probability be 36.8%, residue 63.2% sample be pumped in new sampling subset, the stand-alone training as the base learner Collection is trained, and 36.8% sample not being pumped to carries out wrapping outer estimation as the test set of the base learner；

S302, current sample set comentropy is calculated:

Wherein, D is the sample set of training, and n is class number included in current sample set, p_iFor sample size of all categories The shared ratio in total sample；

S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate candidate belong to Temper concentrates each characteristic condition entropy, conditional entropy formula are as follows:

Wherein, D is current sample set, and A is a certain candidate division attribute of the sample set, and A shares n possible value a₁, a₂,...,a_n, p_iIt is a for the sample set feature A value_iSample shared by ratio；

S304, the information gain for calculating each feature, for being partial to using information gain-ratio as criteria for classifying, selection successive value is special Sign, will subtract log to the information gain of successive value₂(N-1)/| D | it is modified, N is possible split point number, | D | it is several According to collection size；For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A)；For connection attribute, selection Best splitting point of the maximum split point of information gain as the attribute after amendment；

S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than setting The feature of information gain precision threshold calculates its information gain ratio, and selecting information gain than highest feature is division feature；

S306, according to the value for dividing feature, data set, each branch successively recurrence aforesaid operations are divided, until meeting following three Kind situation then terminates division；All trained samples of current node belong to a class, then mark the node with the category, and by the node As leaf node；The attribute that all samples of current node can not be used to divide；Current node training samples number is very little Less than smallest sample amount threshold value.

6. the operating system recognition methods according to claim 5 based on random forest, which is characterized in that step S303 In, for discrete features, subsample collection is divided according to each value of this feature, calculates the comentropy of subsample collection, then is found out Its conditional entropy；For continuous feature, first connection attribute is ranked up, is only met in the place that category label change and two The difference of a successive value is greater than under conditions of the attribute setup threshold value, takes two successive value medians as candidate cut-off, will be continuous Attribute is converted into Category Attributes calculating.

7. the operating system recognition methods according to claim 5 based on random forest, which is characterized in that step S305 In, for feature A, information gain is than calculation formulaWherein division information is

8. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S3, First layer operating system classification layer is constructed, first goes out m data according to class label stratified sampling in the simulated data sets of generation and constitutes New training set Mset, measuring accuracy are obtained by the way of the outer estimation of packet, according to the utilization rate of each operating system and identification need It asks, determines operating system { x₁,x₂,...,x_nN big class labels, original training set is belonged into the other most particulate of this n major class The label of degree is remapped to the class label x of coarseness_i, being not belonging to the other operating system label mapping of this n major class is Others type is trained the data set input random forest method after mapping to obtain os multi-class classification device；Setting behaviour The threshold accuracy k1 for making system multi-class classification device tests the classifier precision using outer estimated data is wrapped, when precision reaches threshold When value requires, the building of first layer classifier is completed, if not up to threshold accuracy, adjustment algorithm parameter or increase training sample amount Re -training is to promote precision to threshold requirement；

This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset by big Category division maps to the form that classification adds major version number at n group data, and by its respective class label, every group of data use with The training of machine forest method generates n main version classification devices, and the setting main version classification device threshold accuracy of operating system is k2, test side Formula estimates mode using packet is outer, and every group of test data assesses the precision of its corresponding main version classification device respectively, when precision reaches To threshold requirement, then retain the classifier, if not up to threshold requirement, is trained by adjusting ginseng or being enriched from fingerprint base again The second layer of the mode re -training of sample data to be promoted to threshold requirement, in n main version classification device composition layer architectures This identification layer of key plate；

Third layer is detailed version identification layer, if the main release label generated after second layer mapping is v, training set Mset is pressed The main release label of the second layer is divided into v group data, and every group of data generate v detailed version point using random forests algorithm training Class device, setting detailed version classifier threshold accuracy are k3, assess each detailed version classifier using the outer estimation mode of packet, if certain The precision of classifier be lower than threshold requirement, then adjust ginseng or resampling abundant data by way of the re -training classifier until Meet threshold requirement；K1 > k2 > k3 should be met when precision threshold.

9. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S4, To the destination host that need to be detected detection sequence packet is sent using with third party's detection identical scale of mode of giving out a contract for a project, obtains detection and ring Sequence is answered, fingerprint is therefrom extracted, and is converted into numerical value vector and is input in hierarchical mode, granularity is identified according to it, is rung The prediction result answered；Special systematic label and IP list are obtained, by sending what probe data packet was responded to obtain it to IP Finger print data is mapped as in vector deposit training set by finger print data, after newly-increased training data reaches the threshold value of setting, weight New training is to realize algorithm Fingerprint Model to special systematic expansion effect.

10. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S5, Give out a contract for a project the identical scale of mode using with Nmap detection, sends 6 TCP SYN detection detectives sequencings to the destination host that need to be detected Column packet generates this four test response sequences of SEQ, OPS, WIN, T1,2 ICMP echo detection packets of transmission generate IE sequence, hair Send 1 TCP SYN detection packet obtain description TCP clearly specified congestion notification when feature, send 6 TCP probe data packets point Not Sheng Cheng six response sequences of T2-T7, send 1 UDP detection packet and arrive close port formation sequence U1, this 13 response sequences are pressed Numerical value vector is mapped to according to the mode of step S1 and is input in hierarchical mode after corresponding Passivation Treatment, and grain is identified according to it Degree, obtains the prediction result of response, and when prediction, test data enters top-level categories device, identifies corresponding operating system major class Not, data enter the corresponding category classifier of the second layer, identify the major version number under its category, finally enter back into third layer In detailed version classifier of the corresponding key plate under this, final detailed operation system information is obtained.