CN110519128B

CN110519128B - Random forest based operating system identification method

Info

Publication number: CN110519128B
Application number: CN201910893976.9A
Authority: CN
Inventors: 范建存; 张子豪; 樊志甲; 李瀛
Original assignee: Xian Jiaotong University; Beijing NSFocus Information Security Technology Co Ltd
Current assignee: Xian Jiaotong University; Beijing NSFocus Information Security Technology Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2021-02-19
Anticipated expiration: 2039-09-20
Also published as: CN110519128A

Abstract

The invention discloses an operating system identification method based on random forest, which adopts a Monte Carlo method to randomly sample a fingerprint database to combine a training set and a testing set and carry out vectorization treatment; carrying out data passivation treatment in a box separation mode; respectively training random forest classifiers according to an operating system category identification layer, an operating system large version number identification layer and an operating system detailed version identification layer based on a set layered architecture, constructing a plurality of decision trees, and adding each tree into a random forest if the test precision of each tree estimated outside the tree is higher than a set precision threshold; local incremental training of a layered architecture, and parameter adjusting processing to improve the model precision; and identifying and predicting the real detection flow, giving a classification result for each tree in the random forest, and selecting the category with the most votes as a final prediction result by adopting a flat voting mode. The unknown fingerprint can be effectively identified, and the identification accuracy is improved.

Description

Random forest based operating system identification method

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an operating system identification method based on a random forest.

Background

With the rapid popularization of the internet, the importance of the field of network security is more and more prominent, and the detection and identification of the operating system have important significance for the evaluation and protection of the network security and are also important steps of asset identification.

At present, most detection tools are mainly based on a known operating system fingerprint library, a traditional static fingerprint matching mode is adopted for judgment, the problem of difficulty in identification of unknown fingerprints exists, a machine learning related algorithm is introduced, sufficient necessary conditions of fingerprints are further mined from characteristics, the difficult problem of identification of unknown fingerprints can be effectively solved, and the reliability of the fingerprints is ensured in a higher dimensionality. In the method, a large number of two classifiers are constructed for classification, but the method has limitation, and with the increase of types of operating systems, the training cost of the support vector machine method is too large, and the application performance causes a bottleneck.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an operating system identification method based on random forest aiming at the defects in the prior art, so as to construct a novel random forest fingerprint structure with high reliability and easy expansion, obtain a fingerprint model with strong universality and identify a wide range of software operating systems and Internet of things system equipment, and the method also supports the training of a private model of a client environment.

The invention adopts the following technical scheme:

an operating system identification method based on random forests comprises the following steps:

s1, determining characteristic attributes, attribute value ranges and most likely-to-occur fingerprint sets used for training based on third-party fingerprint library analysis by adopting a Monte Carlo method, randomly sampling the fingerprint libraries to combine the training sets and the test sets, and vectorizing the data of the training sets and the test sets;

s2, performing data passivation treatment on the dimension with the range value as part of characteristic attributes in the third-party fingerprint library in a box-separating mode;

s3, training a random forest classifier respectively based on a set layered architecture according to an operating system type recognition layer, an operating system large version number recognition layer and an operating system detailed version recognition layer, constructing a plurality of decision trees, comparing the test precision of each tree with the respective out-of-package estimation with a set precision threshold, and adding the decision trees into a random forest if the test precision is higher than the threshold;

s4, local incremental training of the layered architecture, and parameter adjusting processing to improve model accuracy;

and S5, identifying and predicting the real detection flow, giving a classification result for each tree in the random forest, and selecting the category with the largest number of votes as a final prediction result by adopting a flat-weight voting mode.

Specifically, in step S1, constructing based on an Nmap fingerprint library, the Nmap sends 16 data packets to generate response sequences correspondingly, each response sequence corresponds to different flag bits, the sequence flag bits obtained by the 16 data packets are 119, two flag bits of seq.sp and seq.isr are removed, the remaining 117 flag bits are used as characteristic attributes of training data, each operating system corresponds to one or more rule sets, each rule set is composed of sequence groups of SEQ, OPS, WIN, T1, IE, U1, ECN, and T2-T7, and the value of the flag bit of each sequence group is a range value or a parallel multi-value form; the fingerprint data is disassembled from each rule set, a full set of fingerprint sequences is firstly obtained, and if the number of the full set of sequences is more than 500, 4 samples are randomly selected from the full set of sequences; if the number of the sequence complete sets is more than 10 and less than 100, randomly selecting 2 samples in the complete sets; if the number of the sequence complete sets is less than 10, randomly selecting 1 sample on the complete sets; a Cartesian product formed by samples extracted from all response sequences in the rule set forms an analog data set under the rule, and the nominal data is mapped into numbers in a natural number coding mode to carry out vectorization processing.

Further, sending 6 TCP SYN probe packets generates four test response sequences SEQ, OPS, WIN, T1, where SEQ is a result of sequence analysis based on the probe packets, and includes a TCP ISN sequence predictability index (SP), a TCP ISN maximum common divisor (GCD), a TCP ISN count rate (ISR), an ID sequence generation phase response (TI, CI, II), a shared IP ID sequence boolean value (SS), and a TCP timestamp option algorithm (TS); OPS is the TCP option accepted by each probe packet, WIN is the TCP initial window size accepted by each probe packet, T1 contains the test value of packet 1, including response (R), IP prohibited fragmentation bit (DF), IP initial time-to-live (T), IP initial time-to-live guess (TG), TCP sequence number (S), TCP acknowledgement number (a), TCP flag (F), TCPRST data checksum (RD), TCP miscellaneous (Q);

sending 2 ICMP echo detection packets to generate an IE sequence, wherein the IE sequence comprises a response (R), a prohibited fragment bit (DFI), an IP initial time-to-live (T), an IP initial time-to-live guess (TG) and an ICMP response Code (CD);

sending 1 TCP SYN detection packet, obtaining the characteristics describing that TCP definitely specifies congestion notification, and generating an ECN response sequence which comprises a response (R), a prohibited fragment bit (DFI), IP initial survival time (T), IP initial survival Time Guess (TG), TCP initial window size (W), TCP options (O) and TCP miscellaneous items (Q);

sending 6 TCP detection data packets, and respectively generating six response sequences T2-T7, wherein each sequence comprises a response (R), an IP prohibited fragmentation bit (DF), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), a TCP initial window size (W), a TCP sequence number (S), a TCP acknowledgement number (A), a TCP mark (F), a TCP option (O), a TCP RST data checksum (RD) and a TCP miscellaneous item (Q);

sending 1 UDP probe packet to the closed port, generating a sequence U1 including a response (R), an IP disable fragmentation bit (DF), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), an IP total length (IPL), an unused port unreachable field non-zero (UN), a return probe IP total length (RIPL), a return probe IP ID value (RID), an integrity of a return probe IP checksum value (RIPCK), an integrity of a return probe UDP checksum (RUCK), an integrity of a Return UDP Data (RUD).

Specifically, step S2 specifically includes:

s201, analyzing the characteristics extracted from the third-party fingerprint database, and determining the attribute of incomplete loss information caused by data extraction by combining a random sampling generation mode of analog data;

s202, determining the number and the passivation mode of reasonable box-dividing passivation intervals according to the distribution characteristics of the attributes in a third-party fingerprint library, and performing box-dividing passivation treatment on the attributes in the original data set to generate corresponding passivation scale files;

s203, in the subsequent training and prediction process, the original training set \ test set is subjected to box-dividing passivation treatment according to a unified passivation scale.

Specifically, step S3 specifically includes:

s301, if the training set has n samples, sampling is carried out from the current training set for n times in a Bootstrap mode, 63.2% of samples in the original training set are extracted into a new sample subset to be used as an independent training set of the base learner for training, and 36.8% of samples which are not extracted are used as a test set of the base learner for out-of-package estimation;

s302, calculating the information entropy of the current sample set:

wherein D is the training sample set, n is the number of classes contained in the current sample set, p_iThe proportion of the number of the samples of each category in the total samples is calculated;

s303, randomly selecting a candidate attribute subset containing k attributes from the attribute set of the current sample set, and calculating the conditional entropy of each feature in the candidate attribute subset, wherein the conditional entropy formula is as follows:

wherein D is a current sample set, A is a candidate partition attribute of the sample set, and A has n possible values (a)₁,a₂,...,a_n)，p_iTaking the value of the characteristic A of the sample set as a_iThe ratio of the sample (c);

s304, calculating the information gain of each feature, and for the information gain rate as the division criterion, selecting continuous value features preferentially, subtracting log from the continuous value information gain₂(N-1)/| D | is corrected, N is the number of possible split points, and | D | is the size of the data set; for feature a, the information gain is gain (D, a) ═ Ent (D) -Ent (D | a); for the continuous attribute, selecting the splitting point with the maximum corrected information gain as the optimal splitting point of the attribute;

s305, selecting the characteristics of which the information gain is higher than the average value of all characteristic information gains in the current candidate attribute subset and higher than a set information gain precision threshold, calculating the information gain ratio of the characteristics, and selecting the characteristics with the highest information gain ratio as the division characteristics;

s306, dividing the data set according to the values of the division characteristics, and sequentially recursing the operations by each branch until the following three conditions are met, and terminating the division; if all the training samples of the current node belong to one class, marking the node by using the class, and taking the node as a leaf node; all samples of the current node have no attribute which can be used for dividing; the current node training sample number is too small to be smaller than the minimum sample size threshold.

Further, in step S303, for the discrete feature, dividing a sub-sample set according to each value of the feature, calculating the information entropy of the sub-sample set, and then calculating the conditional entropy; for continuous features, continuous attributes are sorted, and the median of two continuous values is taken as a candidate segmentation point to convert the continuous attributes into discrete attribute calculation only under the condition that the difference between the two continuous values is greater than a set threshold value of the attributes at the place where the class label changes.

Further, in step S305, for the feature a, the information gain ratio calculation formula is

Wherein the splitting information is

Specifically, in step S3, a first operating system class layer is constructed, m pieces of data are sampled in a hierarchical manner according to class labels in the generated simulation data set to form a new training set Mset, the test accuracy is obtained in an out-of-package estimation manner, and the operating systems { x ] are determined according to the utilization rate and the identification requirement of each operating system₁,x₂,...,x_nN large category labels, remapping the finest-grained labels of the n large categories belonging to the original training set to coarse-grained category labels x_iMapping the operating system labels which do not belong to the n big categories into other types, and inputting the mapped data set into a random forest method for training to obtain an os big category classifier; setting operating system major classesThe threshold precision k1 of the classifier is tested by using the out-of-package estimation data, when the precision reaches the threshold requirement, the construction of the first-layer classifier is completed, if the precision does not reach the threshold precision, the algorithm parameters are adjusted or the training sample amount is increased to re-train so as to improve the precision to the threshold requirement;

constructing a second-layer main version identification layer, dividing a training set Mset into n groups of data according to large categories on the basis of n large category labels divided by the first layer, mapping respective category labels to a category plus main version number form, training each group of data by adopting a random forest method to generate n main version classifiers, setting the threshold precision of the main version classifier of an operating system to be k2, adopting an out-of-package estimation mode as a test mode, respectively evaluating the precision of the corresponding main version classifier by the test data of each group, retaining the classifier when the precision reaches the threshold requirement, and retraining the classifier by means of parameter adjustment or enrichment of training sample data from a fingerprint library again to increase the precision to the threshold requirement if the precision does not reach the threshold requirement, wherein the n main version classifiers form a second-layer main version identification layer in a component layer framework;

the third layer is a detailed version identification layer, v main version labels generated after the second layer is mapped are set, a training set Mset is divided into v groups of data according to the main version labels of the second layer, each group of data is trained by adopting a random forest algorithm to generate v detailed version classifiers, the threshold precision of the detailed version classifier is set to be k3, each detailed version classifier is evaluated by adopting an out-of-package estimation mode, and if the precision of a certain classifier is lower than the threshold requirement, the classifier is retrained again by a mode of parameter adjustment or data enrichment by re-sampling until the threshold requirement is met; the precision threshold value should satisfy k1> k2> k 3.

Specifically, in step S4, a detection sequence packet is sent to the target host to be detected by using a scale in the same manner as the third-party detection packet sending, a detection response sequence is obtained, a fingerprint is extracted from the detection response sequence, the detection response sequence is converted into a numerical vector and input into the hierarchical model, and a prediction result of the response is obtained according to the identification granularity of the numerical vector; and obtaining a mark and an IP list of a special system, sending a probe data packet to the IP to obtain fingerprint data responded by the IP, mapping the fingerprint data into vectors, storing the vectors into a training set, and retraining after newly added training data reaches a set threshold value to realize the expansion effect of the algorithm fingerprint model on the special system.

Specifically, in step S5, using a scale in the same manner as Nmap probe packet sending, sending 6 TCP SYN probe packet probe sequence packets to generate four test response sequences SEQ, OPS, WIN, and T1, sending 2 ICMP echo probe packet generation IE sequences, sending 1 TCP SYN probe packet to obtain a feature describing when TCP explicitly specifies congestion notification, sending 6 TCP probe data packets to generate six response sequences T2 to T7, sending 1 UDP probe packet to a closed port generation sequence U1, mapping the 13 response sequences into a vector of values in the manner of step S1, performing corresponding passivation processing, inputting the vector of values into a hierarchical model, obtaining a prediction result of response according to the recognition granularity, when predicting, entering test data into a top-level classifier, recognizing a corresponding large class of an operating system, entering data into a second-level corresponding class classifier, and identifying the major version number of the user under the category, and finally entering a detailed version classifier under a third layer corresponding to the major version to obtain final detailed operating system information.

Compared with the prior art, the invention has at least the following beneficial effects:

aiming at the problems that the conventional static fingerprint matching judgment operation system is poor in unknown fingerprint identification capability and high in false alarm rate and missed alarm rate, the invention provides an operation system identification method based on a random forest algorithm, so that the identification accuracy is improved, and the unknown fingerprint can be effectively identified; on the basis of a third-party fingerprint database rule set, sufficient necessary conditions of operating system fingerprints are further excavated through a random forest algorithm, and the reliability of an identification mode is guaranteed in a higher dimension; by constructing a layered training framework, the operating systems can be identified from different granularities, the training of each layer is independent, the identification of different operating systems is easy to be adjusted locally, and the heavy overhead caused by repeated full training is avoided; for the problem of unbalanced training data, data can be reasonably merged or decomposed through a layered architecture to balance a data set; the layered training framework is easy to expand, when the recognized categories are added to the model, only part of classifiers need to be adjusted for training, the whole full-scale training of the model is not needed, and the effect of approximate incremental training is achieved.

Furthermore, in order to construct a universal operating system classifier with strong recognition capability and wide recognition range, sample data of various systems is needed, but actually building various operating system environments is not feasible. The problems can be effectively solved by reasonably disassembling the fingerprint database based on a third party and generating simulation data after determining the attribute characteristics.

Furthermore, for the third-party fingerprint database, part of characteristic attributes are range values, and all values in the range are difficult to extract in random sampling, data passivation is performed on the range value dimension in a box dividing mode, so that information loss caused by random sampling is reduced.

Furthermore, a classification model is constructed by adopting a random forest algorithm, the random forest algorithm combines a plurality of decision trees through a certain strategy, so that a weak fingerprint model is converted into a strong fingerprint model, the accuracy and the stability of fingerprint identification of an operating system are greatly improved, in addition, the random forest bootstrap sampling mode supports the performance of the verification algorithm of an out-of-package estimation mode, and the calculation cost can be reduced compared with the modes of k-fold cross verification and the like.

Furthermore, when the partition attribute is selected, an attribute subset is randomly selected from the attribute set of the node, the optimal partition attribute is selected from the subset, and attribute disturbance is increased to reduce the model variance. S303, calculating conditional entropies of the attributes, further solving the information gain ratio of the attributes as a selection standard of the optimal division attribute, wherein the calculation of the conditional entropies also provides a basis for selecting the optimal division point for the continuous attributes.

Further, when each node selects the optimal partition attribute, the attribute with the largest information gain ratio is selected as the optimal partition attribute by calculating the information gain ratio. The information gain ratio is selected as the partition criterion, so that the adverse effect caused by preference of the information gain criterion on selecting the attribute with more values can be effectively reduced.

Furthermore, a three-layer layered training framework is constructed, the operating systems can be identified from three-layer granularity of large category, main version and detailed version of the system, the three-layer training is mutually independent, the identification of different operating systems is easy to be adjusted locally, and the heavy overhead caused by repeated full training is avoided; for the problem of unbalanced training data, data can be reasonably merged and decomposed through a layered architecture so as to balance a data set; the layered training framework is easy to expand, when the system categories of identification are added to the model, only part of classifiers need to be adjusted for training, the whole full-scale training of the model is not needed, and the effect of approximate incremental training is achieved.

Further, in order to expand new fingerprint data and fingerprint types, step S4 may send a probe packet to a target host of a known system type to acquire fingerprint data, and map the fingerprint data into vector data, add the vector data to a training set, accumulate the vector data to a certain magnitude, and retrain the vector data again to achieve the effect of maintaining and expanding the fingerprint model for a long time.

Furthermore, a well-constructed data packet is sent to the target host during prediction, a response sequence is further obtained, and the response sequence is vectorized and input into the three-layer architecture random forest model. By adopting a flat-weight voting mode and combining the classification results of each base learning, the problem of poor generalization performance of a single learner due to miselection can be effectively solved, the variance and deviation of a model are reduced, and effective identification and prediction of an operating system are achieved.

In summary, the invention provides an operating system identification method based on a random forest algorithm, which aims at the problems of weak identification capability, high false alarm rate and high false alarm rate of the conventional static fingerprint matching judgment operating system, and can effectively identify unknown fingerprints and improve the identification accuracy. The hierarchical training architecture is adopted to provide the recognition capability of different granularities, different classifiers of each hierarchy are trained in parallel, the same-level classifiers are independent of each other, local adjustment is easy, and the expansibility and maintainability of the model are improved.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a diagram of a hierarchical random forest structure;

FIG. 2 is a diagram of a random forest training process;

fig. 3 is an actual flow rate accuracy diagram.

Detailed Description

The invention discloses an operating system identification method based on a random forest, which constructs the random forest based on a C4.5 decision tree algorithm and comprises the following steps:

s1, data preparation and feature extraction: determining characteristic attributes, attribute value ranges and most likely-to-appear fingerprint sets used for training based on third-party fingerprint library analysis by adopting a Monte Carlo method, performing abundant random sampling on the fingerprint libraries to combine a training set and a test set, and performing vectorization processing on data of the training set and the test set;

constructing based on an Nmap fingerprint library, and according to the detection principle of an Nmap system:

the Nmap sends 16 data packets to generate response sequences correspondingly, each response sequence corresponds to a plurality of flag bits, and the system category is determined by comparing the matching degree of the detected dynamic fingerprint and the flag bits in the static fingerprint database.

Sending 6 TCP SYN probe packets generates four test response sequences SEQ, OPS, WIN, T1, SEQ is the sequence analysis result based on the probe packets, including TCP ISN sequence predictability index (SP), TCP ISN maximum common divisor (GCD), TCP ISN count rate (ISR), ID sequence generation phase response (TI, CI, II), shared IP ID sequence boolean value (SS), TCP timestamp option algorithm (TS). The OPS is the TCP option accepted by each probe packet. WIN is the TCP initial window size accepted by each probe packet. T1 contains test values for packet 1 including response (R), IP not-allowed fragmentation bit (DF), IP initial time-to-live (T), IP initial time-to-live guess (TG), TCP sequence number (S), TCP acknowledgement number (A), TCP flag (F), TCP RST data checksum (RD), and TCP miscellaneous (Q).

Sending 2 ICMP echo detection packets to generate an IE sequence comprising a response (R), a prohibited fragmentation bit (DFI), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), and an ICMP response Code (CD).

And sending 1 TCP SYN detection packet, acquiring the characteristics describing that the TCP explicitly specifies congestion notification, and generating an ECN response sequence, wherein the ECN response sequence comprises a response (R), a prohibited fragmentation bit (DFI), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), a TCP initial window size (W), a TCP option (O) and a TCP miscellaneous item (Q).

And 6 TCP probe data packets are sent, six response sequences T2-T7 are respectively generated, and each sequence comprises a response (R), an IP prohibited fragmentation bit (DF), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), a TCP initial window size (W), a TCP sequence number (S), a TCP acknowledgement number (A), a TCP mark (F), a TCP option (O), a TCP RST data checksum (RD) and a TCP miscellaneous item (Q).

The sequence flag bits obtained by transmitting the 16 data packets are 119 in total, and through the analysis of the flag bits, the method determines to adopt 117 flag bits (except for SEQ. SP and SEQ. ISR) as the characteristic attribute of the training data. Because the real flow of various operating systems meeting the training requirement is difficult to obtain, a large amount of simulation data is generated by adopting a laboratory data simulation mode to train an algorithm. The Nmap fingerprint database is composed of a plurality of rule sets, each operating system corresponds to one or a plurality of rule sets, each rule set is composed of 13 sequence groups of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7, and the value of each sequence group flag bit is a range value or a parallel multivalued form.

The fingerprint data is disassembled from each rule set, firstly, a full set of fingerprint sequences is obtained, and if the number of the full set of sequences is more than 500, 4 samples are randomly selected from the full set of sequences; if the number of the sequence complete sets is more than 10 and less than 100, randomly selecting 2 samples in the complete sets; if the number of the sequence complete set is less than 10, randomly selecting 1 sample on the complete set. The cartesian product of the samples taken for all response sequences in the rule set forms the simulated data set under the rule. Because part of the flag bits are nominal data, in order to facilitate algorithm processing, the nominal data is mapped into numbers by adopting a natural number coding mode to carry out vectorization processing.

In the above way, the simulation data set generated by all rule sets of the fingerprint database forms the data set required by training.

S2, preprocessing data; for the case that part of characteristic attributes in the third-party fingerprint library are range values, all values in the range are difficult to extract during random sampling, data passivation processing is carried out on the range value dimension in a box dividing mode so as to reduce information loss caused by random sampling;

s201, analyzing the characteristics extracted from the third-party fingerprint database, and determining the attribute of information loss possibly caused by incomplete data extraction by combining a random sampling generation mode of analog data.

S202, according to the distribution characteristics of the attributes in the third-party fingerprint database, reasonable binning passivation interval number and passivation modes (generally equal-distance binning or equal-frequency binning) are determined, and binning passivation processing is conducted on the attributes in the original data set to generate corresponding passivation scale files.

S203, in the subsequent training and prediction process, the original training set/test set is subjected to box-dividing passivation treatment according to a uniform passivation scale.

And (3) performing passivation binning treatment on 8 attributes of ECN.T and T1-T7.T in an interval of 10 through analysis of the Nmap fingerprint library.

S3, training a random forest classifier in parallel according to an operating system type recognition layer, an operating system large version number recognition layer and an operating system detailed version recognition layer based on a set hierarchical framework;

the training algorithm improves the model prediction accuracy and generalization ability by using a random forest algorithm based on the optimized C4.5 decision tree, as shown in FIG. 2.

S301, if the training set has n samples, sampling is performed from the current training set for n times in a Bootstrap mode, under the condition that the sample size is large enough, about 63.2% of samples in the original training set are extracted into a new sample subset to be used as an independent training set of the base learner for training, and 36.8% of samples which are not extracted can be used as a test set of the base learner for out-of-package estimation.

S302, calculating information entropy of current sample set

Where D is the training sample set, n is the number of classes contained in the current sample set, p_iThe ratio of the number of samples in each category in the total samples.

S303, randomly selecting a candidate attribute subset containing k attributes from the attribute set of the current sample set, calculating the conditional entropy of each feature in the candidate attribute subset, dividing a sub-sample set according to each value of the feature for discrete features, calculating the information entropy of the sub-sample set, and further calculating the conditional entropy; for continuous features, continuous attributes are sorted firstly, and the median of two continuous values is taken as a candidate segmentation point only under the condition that the difference between the two continuous values is greater than a threshold value set by the attributes at the position where the class label changes, so that the continuous attributes are converted into discrete attribute calculation.

Let D be the current sample set, A be a certain candidate partition attribute of the sample set, and A has n possible values (a)₁,a₂,...,a_n) Pi is the characteristic A value of the sample set as a_iThe conditional entropy formula for feature A is

S304, calculating the information gain of each characteristic. Empirical entropy of data set minus conditional entropy of featuresInformation gain of each feature is obtained. For the division criterion of information gain rate, the preference for continuous value features is biased by subtracting log from continuous value information gain₂(N-1)/| D | is corrected, N is the number of possible split points, | D | is the data set size.

For feature a, the information gain is gain (D, a) ═ Ent (D) -Ent (D | a).

For the continuous attribute, the split point at which the corrected information gain is maximum is selected as the optimal split point for the attribute.

S305, selecting the characteristics of which the information gain is higher than the average value of all the characteristic information gains in the current candidate attribute subset and higher than the set information gain precision threshold, calculating the information gain ratio of the characteristics, and selecting the characteristics with the highest information gain ratio as the division characteristics.

For feature A, the information gain ratio is calculated as

Wherein the splitting information is

S306, dividing the data set according to the values of the division characteristics, and sequentially recursing the operations by each branch until the following three conditions are met, and then terminating the division: if all the training samples of the current node belong to one class, marking the node by the class, and taking the node as a leaf node; all samples of the current node have no attribute which can be used for dividing; the current node training sample number is too small to be smaller than the minimum sample size threshold.

Through the method, a single decision tree is constructed, and the random forest can be regarded as a set of a plurality of decision trees. And continuously constructing a plurality of decision trees according to the mode, comparing the test precision estimated outside each bag of each tree with a set precision threshold, and adding the decision tree into the random forest if the test precision is higher than the threshold. During prediction, each tree in the forest gives a classification result, and the category with the largest number of votes is selected as a final prediction result by adopting a flat voting mode.

The random forest algorithm is adopted, accuracy is guaranteed, meanwhile, generalization capability is greatly improved, but training is directly carried out by taking the detailed version of the operating system as a class label, as the operating system has a plurality of classes, a large amount of sample data is often required for training, sample data is enriched in a parameter adjusting or data driving mode to improve model accuracy, the full training is often required again, the cost is huge, and the operating system has hierarchy and comprises the classes, the major version and the detailed version, so that the problem can be effectively solved by providing a hierarchical random forest framework according to the hierarchical characteristics of the label of the operating system, and the identification effects of different levels and different granularities are provided, as shown in figure 1.

Constructing a first operating system category layer, firstly, sampling m pieces of data in a generated simulation data set according to category labels in a layered mode to form a new training set Mset, obtaining the test precision in an out-of-package estimation mode, and determining operating systems { x ] according to the utilization rate and the identification requirement of each operating system₁,x₂,...,x_nThe n large class labels remap the finest-grained labels of the original training set belonging to the n large classes to coarse-grained class labels x_iAnd mapping the operating system labels which do not belong to the n big categories into the Others types, and inputting the mapped data set into the random forest algorithm for training to obtain the os big category classifier. Because the number of the type of the class notes is greatly reduced through the remapping of the notes, the operating system large-class classifier with higher precision can be trained by relatively less samples. Setting threshold accuracy k1 of the large-class classifier of the operating system, testing the accuracy of the classifier by using the out-of-package estimation data, completing construction of the first-layer classifier when the accuracy meets the threshold requirement, and adjusting algorithm parameters or increasing the quantity of training samples to retrain to improve the accuracy to the threshold requirement if the accuracy does not meet the threshold accuracy.

And constructing a second main version recognition layer, dividing the training set Mset into n groups of data according to the large categories on the basis of the n large category labels divided by the first layer, mapping the respective category labels to a category plus main version number form, and training each group of data by adopting a random forest algorithm to generate n main version classifiers. Setting the threshold precision of the major version classifier of the operating system as k2, adopting an out-of-package estimation mode as a test mode, respectively evaluating the precision of the corresponding major version classifier by the test data of each group, retaining the classifier when the precision meets the threshold requirement, and retraining in a mode of adjusting parameters or enriching training sample data from a fingerprint library again to increase the precision to the threshold requirement if the precision does not meet the threshold requirement. The n major version classifiers constitute a second level major version identification level in the hierarchical architecture.

The third layer is a detailed version identification layer and can be flexibly constructed according to the identification granularity requirement of each operating system. And setting the number of the main version labels generated after the second layer mapping to be v, dividing the training set Mset into v groups of data according to the main version labels of the second layer, and training each group of data by adopting a random forest algorithm to generate v detailed version classifiers. Setting the threshold precision of the detailed version classifier as k3, evaluating each detailed version classifier in an out-of-package estimation mode, and if the precision of a certain classifier is lower than the threshold requirement, retraining the classifier in a mode of parameter adjustment or resampling rich data until the threshold requirement is met.

The three-layer architecture trains the training data in groups, and can simultaneously train in parallel, thereby reducing the time overhead. The classifiers in the same layer are mutually independent and easy to manage and expand. During prediction, test data enters a top-level classifier to identify a corresponding large class of an operating system, the data enters a second-level corresponding class classifier to identify a major version number of the class, and finally enters a third-level detailed version classifier corresponding to the major version to obtain final detailed operating system information.

During prediction, the upper layer and the lower layer are closely connected, and the reliability of the recognition result of the upper layer directly influences the prediction result of the lower layer, so that the classifier of the upper layer has higher requirement on reliability, and the accuracy threshold value is set to meet k1> k2> k 3.

The layered architecture carries out layered grouping training on the training data according to different granularities, when the classifier is adjusted, only the partial classifier needs to be retrained on the basis of the partial training data, and all training sample data does not need to be retrained, so the method is approximately regarded as an incremental training mode.

According to the method, a sample increment mode and a category increment mode are supported.

And for the sample increment mode, the incremental sample is sent into a classifier corresponding to three layers according to the operating system type and the major version number of the incremental sample for identification, if the classifier can correctly identify, the processing is not needed, otherwise, the incremental sample data is added into a classifier training set with an identification error. And setting a fault-tolerant sample number threshold p of the classifier, and if the number of samples newly added to the training set by a certain classifier is greater than the threshold p, retraining the classifier by using the original training set and newly added sample data.

For the category increment mode, if the large category of the sample increasing data cannot be correctly identified in the top-level classifier, adding the sample data into a top-level training set, and when the number of newly added samples is greater than a threshold value p, retraining the classifier by using the original training set and the newly added sample data; and if the newly added category does not have a corresponding classifier on the second layer or the third layer, the corresponding classifier needs to be newly added, the number of samples required by the newly added classifier is set to be q, and when the number of samples of the newly added category is greater than q, the sample set is used as a training set of the newly added classifier for training.

and identifying and predicting the real flow by an operating system, wherein the attribute characteristics and data of the training set are obtained by an analysis simulation experiment of a third-party fingerprint library, so that the predicted real flow acquisition mode is the same as the fingerprint forming mode of the third-party fingerprint library. And sending a detection sequence packet to a target host to be detected by adopting the same scale according to a third-party detection packet sending mode to obtain a detection response sequence, extracting a fingerprint from the detection response sequence, converting the fingerprint into a numerical value vector, inputting the numerical value vector into a hierarchical model, and obtaining a response prediction result according to the identification granularity of the numerical value vector.

Aiming at the problem of unique system identification under the private environment which does not appear in a third-party fingerprint database, the invention supports the private environment fingerprint training and can effectively solve the problem. And obtaining marks and IP lists of a special system, sending probe data packets to the IPs to obtain fingerprint data responded by the IPs, mapping the fingerprint data into vectors, storing the vectors into a training set, and retraining after newly added training data reach a set threshold value to realize the expansion effect of the algorithm fingerprint model on the special system.

And S5, identifying and predicting the real detection flow.

And identifying and predicting the real flow by an operating system, wherein the attribute characteristics and data of the training set are obtained by an analysis simulation experiment on the Nmap fingerprint database, so that the predicted real flow acquisition mode is the same as the Nmap fingerprint database fingerprint formation mode. Referring to the Nmap detection packet sending mode, the same scale is adopted, four test response sequences of sending 6 TCP SYN detection sequence packets to generate SEQ, OPS, WIN and T1 are sent to a target host to be detected, 2 ICMP echo detection packet generation IE sequences are sent, 1 TCP SYN detection packet is sent to obtain characteristics describing that TCP explicitly specifies congestion notification, 6 TCP detection data packets are sent to respectively generate six response sequences of T2-T7, 1 UDP detection packet is sent to a closed port generation sequence U1, the 13 response sequences are mapped into numerical vectors according to the mode of step S1, and are input into a hierarchical model after corresponding passivation, and a response prediction result is obtained according to the identification granularity.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Effect verification

(1) Analog data packet out-of-packet estimation verification

The random forest algorithm adopts a Bootstrap mode and has back sampling, under the condition of large sample quantity, 63.2% of samples can be extracted as a training set of a base learner, and the rest 36.8% of samples can be used as a verification set to carry out 'oob out-of-package estimation', and the method has small cost relative to k-fold cross verification and can effectively verify the generalization performance of the model.

The out-of-package estimation evaluation index is as follows:

TABLE 1 operating System Category level verification Effect Macro index

TABLE 2 operating System Category level detail validation Effect

TABLE 3 Windows Master version Macro pointer

TABLE 4 Windows Master version detail indicators

TABLE 5 Linux Master version Macro index

TABLE 6 Linux Master version detailed index

As shown in tables 1 to 6, the effectiveness and feasibility of the method are proved by performing effect verification on the operating system class layer and the Windows major version identification layer through the out-of-package estimation data, wherein the recall ratio of Windows 8 is low, the possible factors are that Windows 8 is a transition system, the fingerprint difference is not obvious compared with other Windows systems, and the identification difficulty is caused, but the method is feasible and effective in general.

(2) True environment flow effect verification

The method is characterized in that a Windows host and a Linux server in a certain network segment of an office network are subjected to packet sending detection by referring to an Nmap detection mode, the obtained fingerprints are predicted through an algorithm model, detection data of Windows and Linux in the same network segment are detected by using No. 4/month 15 and No. 4/month 16, and the effect verification is as follows:

table 7 surviving hosts list 1

Table 8 surviving hosts list 2

Referring to table 7, table 8 and fig. 3, because the real system environment is limited, the real Windows and Linux hosts in the office network are identified, the identification accuracy is high, and the method is feasible and effective.

In summary, the present invention constructs simulation data and maps the simulation data into vectors for algorithm learning based on a third-party fingerprint library, and in order to construct a resource fingerprint structure with high reliability and easy expansion and management, a random forest algorithm is used to combine a plurality of weak fingerprints into a strong fingerprint to improve the identification accuracy and the recognizable category number of an operating system, and a layered training architecture is designed to realize long-term expansion and maintenance of the operating system fingerprint library based on the random forest structure.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. An operating system identification method based on random forests is characterized by comprising the following steps:

2. The operating system identification method based on the random forest as claimed in claim 1, wherein in step S1, a structure is performed based on an Nmap fingerprint database, Nmap sends 16 data packets to generate response sequences correspondingly, each response sequence corresponds to a different flag bit, the sequence flag bits obtained by 16 data packets are 119 in total, two flag bits of seq.sp and seq.isr are removed, the remaining 117 flag bits are used as characteristic attributes of training data, each operating system corresponds to one or more rule sets, each rule set is composed of sequences of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7, and the value of the flag bit of each sequence set is in a range value or a parallel multi-valued form; the fingerprint data is disassembled from each rule set, a full set of fingerprint sequences is firstly obtained, and if the number of the full set of sequences is more than 500, 4 samples are randomly selected from the full set of sequences; if the number of the sequence complete sets is more than 10 and less than 100, randomly selecting 2 samples in the complete sets; if the number of the sequence complete sets is less than 10, randomly selecting 1 sample on the complete sets; a Cartesian product formed by samples extracted from all response sequences in the rule set forms an analog data set under the rule, and the nominal data is mapped into numbers in a natural number coding mode to carry out vectorization processing.

3. The random forest-based operating system identification method as claimed in claim 2, wherein sending 6 TCP SYN probe packets generates four test response sequences of SEQ, OPS, WIN, T1, SEQ is a sequence analysis result based on the probe packets, and comprises a TCP ISN sequence predictability index (SP), a TCP ISN maximum common divisor (GCD), a TCP ISN count rate (ISR), an ID sequence generation phase response (TI, CI, II), a shared IP ID sequence Boolean value (SS) and a TCP timestamp option algorithm (TS); OPS is the TCP option accepted by each probe packet, WIN is the TCP initial window size accepted by each probe packet, T1 contains the test values for packet 1, including response (R), IP prohibited fragmentation bit (DF), IP initial time-to-live (T), IP initial time-to-live guess (TG), TCP sequence number (S), TCP acknowledgement number (a), TCP flag (F), TCP RST data checksum (RD), and TCP miscellaneous (Q);

sending 2 ICMP echo detection packets to generate an IE sequence comprising a response (R), a prohibited fragment bit (DFI), an IP initial time-to-live (T), an IP initial time-to-live guess (TG) and an ICMP response Code (CD);

sending 1 TCP SYN detection packet, obtaining the characteristics describing that TCP definitely specifies congestion notification, and generating an ECN response sequence which comprises a response (R), a prohibited fragment bit (DFI), IP initial survival time (T), IP initial survival Time Guess (TG), a TCP initial window size (W), a TCP option (O) and a TCP miscellaneous item (Q);

sending 1 UDP probe packet to the closed port, generating a sequence U1 including a response (R), an IP disable fragmentation bit (DF), an IP initial time-to-live (T), an IP initial time-to-live guess (TG), an IP total length (IPL), an unused port unreachable field non-zero (UN), a return probe IP total length (RIPL), a return probe IP ID value (RID), an integrity of a return probe IP checksum value (RIPCK), an integrity of a return probe UDP checksum (RUCK), and an integrity of a Return UDP Data (RUD).

4. The random forest-based operating system identification method as claimed in claim 1, wherein the step S2 is specifically:

5. The random forest-based operating system identification method as claimed in claim 1, wherein the step S3 is specifically:

s301, if the training set has n samples, sampling and extracting are carried out for n times by adopting a back sampling mode from the current training set according to a Bootstrap mode, a sampling subset containing n samples is generated, when the value of n tends to be infinite, the probability that each sample in the original training set is not extracted is 36.8%, the rest 63.2% of samples are extracted into a new sampling subset and used as an independent training set of a base learner for training, and 36.8% of samples which are not extracted are used as a test set of the base learner for out-of-package estimation;

s302, setting D as a current sample set, and calculating the information entropy of the current sample set:

where n is the number of classes contained in the current sample set, p_iThe proportion of the number of the samples of each category in the total samples is calculated;

s303, randomly selecting a candidate attribute subset containing k attributes from the attribute set of the current sample set D, and calculating the conditional entropy of each feature in the candidate attribute subset, wherein the conditional entropy formula is as follows:

wherein, A is a candidate partition attribute of the sample set, and A has n possible values a₁,a₂,...,a_n，p_aiTaking the value of the characteristic A of the sample set as a_iThe ratio of the sample (c);

s304, calculating the information gain of each feature of the current sample set D, selecting continuous value features preferentially by taking the information gain rate as a division criterion, and subtracting log from the continuous value information gain₂(N-1)/| D | is corrected, N is the number of possible split points, and | D | is the size of the data set; for feature a, the information gain is gain (D, a) ═ Ent (D) -Ent (D | a); for the continuous attribute, selecting the splitting point with the maximum corrected information gain as the optimal splitting point of the attribute;

6. The method for identifying the random forest-based operating system as claimed in claim 5, wherein in step S303, for the discrete features, the sub-sample set is divided according to each value of the feature, the information entropy of the sub-sample set is calculated, and then the conditional entropy is calculated; for continuous features, continuous attributes are sorted, and the median of two continuous values is taken as a candidate segmentation point to convert the continuous attributes into discrete attribute calculation only under the condition that the difference between the two continuous values is greater than a set threshold value of the attributes at the place where the class label changes.

7. The method of claim 5, wherein in step S305, the information gain ratio is calculated as

Wherein the splitting information is

Wherein D is_iThe number of samples occupied by different values of the characteristic A is shown, and | D | is the size of the data set.

8. The method for identifying the operating systems based on the random forest as claimed in claim 1, wherein in step S3, a first operating system category layer is constructed, m pieces of data are sampled according to category labels in a layered mode from a generated simulation data set to form a new training set Mset, test accuracy is obtained in an out-of-package estimation mode, and the operating systems { x are determined according to the utilization rate and identification requirements of each operating system₁,x₂,...,x_nN large category labels, remapping the finest-grained labels of the n large categories belonging to the original training set to coarse-grained category labels x_iMapping the operating system labels which do not belong to the n big categories into other types, and inputting the mapped data set into a random forest method for training to obtain an os big category classifier; setting a threshold precision k1 of a large-class classifier of an operating system, testing the precision of the classifier by using outsourcing estimation data, completing construction of a first-layer classifier when the precision meets a threshold requirement, and if the precision does not meet the threshold precision, adjusting algorithm parameters or increasing the quantity of training samples for retraining so as to improve the precision to the threshold requirement;

9. The operating system identification method based on the random forest as claimed in claim 1, wherein in step S4, a detection sequence packet is sent to a target host to be detected by using a scale in the same manner as a third party detection packet sending, a detection response sequence is obtained, a fingerprint is extracted from the detection sequence, the detection response sequence is converted into a numerical vector and input into the hierarchical model, and a prediction result of the response is obtained according to the identification granularity; and obtaining a mark and an IP list of a special system, sending a probe data packet to the IP to obtain fingerprint data responded by the IP, mapping the fingerprint data into vectors, storing the vectors into a training set, and retraining after newly added training data reaches a set threshold value to realize the expansion effect of the algorithm fingerprint model on the special system.

10. The operating system recognition method based on random forest as claimed in claim 1, wherein in step S5, using a scale in the same way as Nmap probe packet transmission, sending 6 TCP SYN probe packets to a target host to be probed to generate four test response sequences of SEQ, OPS, WIN, and T1, sending 2 ICMP echo probe packets to generate IE sequences, sending 1 TCP SYN probe packet to obtain the characteristics describing when TCP explicitly specifies congestion notification, sending 6 TCP probe packets to respectively generate six response sequences of T2-T7, sending 1 UDP probe packet to a closed port generation sequence U1, mapping the 13 response sequences into numerical vectors in the way of step S1 and inputting into a hierarchical model after corresponding passivation, obtaining the prediction result of response according to the identification granularity, when predicting, the test data enters a top level classifier, and identifying the corresponding large class of the operating system, enabling the data to enter a class classifier corresponding to the second layer, identifying the major version number under the class, and finally entering a detailed version classifier corresponding to the major version on the third layer to obtain the final detailed operating system information.