CN110519128A - A kind of operating system recognition methods based on random forest - Google Patents
A kind of operating system recognition methods based on random forest Download PDFInfo
- Publication number
- CN110519128A CN110519128A CN201910893976.9A CN201910893976A CN110519128A CN 110519128 A CN110519128 A CN 110519128A CN 201910893976 A CN201910893976 A CN 201910893976A CN 110519128 A CN110519128 A CN 110519128A
- Authority
- CN
- China
- Prior art keywords
- data
- tcp
- training
- operating system
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012549 training Methods 0.000 claims abstract description 129
- 238000001514 detection method Methods 0.000 claims abstract description 63
- 238000012360 testing method Methods 0.000 claims abstract description 35
- 238000002161 passivation Methods 0.000 claims abstract description 25
- 238000005070 sampling Methods 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000003066 decision tree Methods 0.000 claims abstract description 10
- 238000000342 Monte Carlo simulation Methods 0.000 claims abstract description 4
- 239000000523 sample Substances 0.000 claims description 93
- 230000004044 response Effects 0.000 claims description 49
- 238000004422 calculation algorithm Methods 0.000 claims description 31
- 239000012634 fragment Substances 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 7
- 241000208340 Araliaceae Species 0.000 claims description 6
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 6
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 6
- 235000008434 ginseng Nutrition 0.000 claims description 6
- GOLXNESZZPUPJE-UHFFFAOYSA-N spiromesifen Chemical compound CC1=CC(C)=CC(C)=C1C(C(O1)=O)=C(OC(=O)CC(C)(C)C)C11CCCC1 GOLXNESZZPUPJE-UHFFFAOYSA-N 0.000 claims description 6
- 230000009897 systematic effect Effects 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 238000012952 Resampling Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 2
- 239000012141 concentrate Substances 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims description 2
- 238000012300 Sequence Analysis Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/147—Network analysis or design for predicting network behaviour
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0823—Errors, e.g. transmission errors
- H04L43/0847—Transmission error
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
- H04L43/106—Active monitoring, e.g. heartbeat, ping or trace-route using time related information in packets, e.g. by adding timestamps
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/16—Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
- H04L69/163—In-band adaptation of TCP data exchange; In-band control procedures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/16—Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
- H04L69/164—Adaptation or special uses of UDP protocol
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The operating system recognition methods based on random forest that the invention discloses a kind of, carries out that random sampling is combined into training set and test set is gone forward side by side row vectorization processing using monte carlo method to fingerprint base;Data Passivation Treatment is carried out by the way of branch mailbox;Based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and operating system detailed version identification layer, random forest grader is respectively trained, construct more decision trees, each tree is higher than the precision threshold of setting with the outer measuring accuracy estimated of respectively packet, then is added among random forest;The training of layer architecture local regularity, tune consider and handle reason with lift scheme precision;Identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, and using equal rights ballot mode, the classification for selecting poll most is as final prediction result.It can effectively identify unknown fingerprint, improve the accuracy rate of identification.
Description
Technical field
The invention belongs to field of computer technology, and in particular to a kind of operating system recognition methods based on random forest.
Background technique
Quick with internet is popularized, and the importance of network safety filed more highlights, the detection and knowledge of operating system
Other assessment and protection to network security has great significance and the important step of asset identification.
Currently, most of prospecting tools are based primarily upon known operation system fingerprint library, using traditional static fingerprint
Judge there is a problem of unknown fingerprint identification difficulty with mode, and introduce machine learning related algorithm, is further dug from feature
The sufficient and necessary condition for digging fingerprint can effectively solve unknown fingerprint identification problem, and ensure fingerprint in higher dimension
Reliability.Zou Tiezheng proposes the operating system recognition methods based on support vector machines, by construct a large amount of two classifier into
Row classification, however there are limitations for this method, with increasing for OS Type, the training expense mistake of support vector machine method
Greatly, application performance causes bottleneck.
Summary of the invention
In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of based on random
The operating system recognition methods of forest obtains universality to construct high reliablity, the easily novel random forest dactylotype that extends
Strong Fingerprint Model is to identify that extensive operation system of software and Internet of things system equipment, this method also support that user rs environment is privately owned
The training of model.
The invention adopts the following technical scheme:
A kind of operating system recognition methods based on random forest, comprising the following steps:
S1, using monte carlo method, analyzed based on third party's fingerprint base and determine characteristic attribute used in training, each belong to
Property the value range and fingerprint collection that occurs of most probable, random sampling is carried out to fingerprint base and is combined into training set and test set, and it is right
The data of training set and test set carry out vectorization processing;
S2, by the way of branch mailbox, to Partial Feature attribute in third party's fingerprint base be value range dimension carry out data
Passivation Treatment;
S3, based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and
Random forest grader is respectively trained in operating system detailed version identification layer, constructs more decision trees, and each tree is outer with respectively packet
The measuring accuracy of estimation is compared with the precision threshold of setting, if be higher than threshold value, the decision tree be added to random forest it
In;
S4, the training of layer architecture local regularity, tune consider and handle reason with lift scheme precision;
S5, identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, adopts
With equal rights ballot mode, the classification for selecting poll most is as final prediction result.
Specifically, being constructed in step S1 based on Nmap fingerprint base, Nmap sends corresponding generate of 16 data packets and responds
Sequence, each response sequence correspond to different flag bits, and the sequence flag position totally 119 that 16 data packets obtain removes
SEQ.SP and two flag bits of SEQ.ISR, using remaining 117 flag bits as the characteristic attribute of training data, every kind of operation system
The corresponding one or more rule sets of system, each rule set are made of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7 sequence group,
The value of each sequence group mark position is value range or many-valued formal arranged side by side;Finger print data is disassembled out from each rule set,
The complete or collected works for first obtaining fingerprint sequence select 4 samples if sequence complete or collected works' quantity is greater than 500 at random on sequence complete or collected works;
If sequence complete or collected works' quantity is greater than 10 and less than 100,2 samples are selected at random in complete or collected works;If sequence complete or collected works quantity less than 10,
Then select 1 sample at random on complete or collected works;The cartesian product that the sample that all response sequences extract in the rule set is constituted is formed
Nominal data is mapped as number by the way of natural number coding and carries out vectorization processing by the simulated data sets under the rule.
Further, this four test response sequences of SEQ, OPS, WIN, T1 can be generated by sending 6 TCP SYN detection packets,
SEQ is the sequencing results based on detection packet, including TCP ISN sequence predictability index (SP), TCP ISN highest common divisor
Number (GCD), TCP ISN counting rate (ISR), ID sequence, which generate, mutually responds (TI, CI, II), shared IP ID sequence Boolean
(SS), TCP timestamp option algorithm (TS);OPS is the TCP option that each detection packet receives, and WIN is that each detection packet receives
TCP initial window size, T1 include the test value of data packet 1, including response (R), IP forbid fragment position (DF), IP initially to survive
Time (T), the initial life span conjecture (TG) of IP, TCP sequence number (S), TCP acknowledgment number (A), TCP indicate (F), TCPRST number
According to verification and (RD), TCP miscellaneous (Q);
It sends 2 ICMP echo detection packet and generates IE sequences, including response (R), fragment position (DFI), IP is forbidden initially to give birth to
Deposit the time (T), initial life span conjecture (TG), ICMP answer code (CD) of IP;
1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response
Sequence, including response (R), forbid initial life span conjecture (TG), TCP in fragment position (DFI), the initial life span of IP (T), IP
Initial window size (W), TCP option (O), TCP miscellaneous (Q);
6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response
(R), IP forbids fragment position (DF), the initial life span of IP (T), initial life span conjecture (TG), TCP initial window size of IP
(W), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP
Miscellaneous (Q);
It sends 1 UDP detection packet and forbids fragment position (DF), IP to close port, formation sequence U1, including response (R), IP
The initial life span conjecture (TG) of initial life span (T), IP, IP total length (IPL), not used port do not reach field
Non-zero (UN) returns to detection IP total length (RIPL), the detection IP ID value (RID) returned, returns to the complete of detection IP checksum value
Whole property (RIPCK), the integrality (RUCK) for returning to detection UDP verification sum, the integrality (RUD) of the UDP message returned.
Specifically, step S2 specifically:
S201, the feature extracted to third party's fingerprint base are analyzed, in conjunction with analogue data random sampling generating mode, really
The fixed attribute because of data pick-up not total loss information;
S202, according to above-mentioned attribute in the characteristic distributions of third party's fingerprint base, determine reasonable branch mailbox passivation interval number
Mesh and passivation mode are concentrated in initial data and carry out branch mailbox Passivation Treatment and the corresponding passivation scale text of generation to above-mentioned attribute
Part;
S203, during subsequent training and prediction, original training set test set by unified passivation scale carry out branch mailbox
Passivation Treatment.
Specifically, step S3 specifically:
If S301, training set have n sample, Bootstrap mode sampling with replacement n times from current training set are used,
63.2% sample is pumped in new sample set in former training set, and the stand-alone training collection as the base learner is instructed
Practice, 36.8% sample not being pumped to carries out wrapping outer estimation as the test set of the base learner;
S302, current sample set comentropy is calculated:
Wherein, D is the sample set of training, and n is class number included in current sample set, piFor sample of all categories
Quantity ratio shared in total sample;
S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate and wait
Select each characteristic condition entropy, conditional entropy formula in attribute set are as follows:
Wherein, D is current sample set, and A is a certain candidate division attribute of the sample set, and A shares n possible value (a1,
a2,...,an), piIt is a for the sample set feature A valueiSample shared by ratio;
S304, the information gain for calculating each feature, for using information gain-ratio as criteria for classifying, being partial to select successive value
Feature will subtract log to the information gain of successive value2(N-1)/| D | it is modified, N is possible split point number, | D | be
Data set size;For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A);For connection attribute, choosing
Select best splitting point of the maximum split point of information gain as the attribute after correcting;
S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than and set
The feature of fixed information gain precision threshold calculates its information gain ratio, and it is more special to divide than highest feature to select information gain
Sign;
S306, according to the value for dividing feature, divide data set, each branch successively recurrence aforesaid operations, until meet with
Lower three kinds of situations then terminate division;All trained samples of current node belong to a class, then mark the node with the category, and should
Node is as leaf node;The attribute that all samples of current node can not be used to divide;Current node training samples number
It is less than smallest sample amount threshold value very little.
Further, in step S303, for discrete features, subsample collection, meter are divided according to each value of this feature
The comentropy of operator sample set, then find out its conditional entropy;For continuous feature, first connection attribute is ranked up, is only met
Under conditions of the difference of the place that category label change and two successive values is greater than the attribute setup threshold value, take in two successive values
Digit converts Category Attributes for connection attribute and calculates as candidate cut-off.
Further, in step S305, for feature A, information gain is than calculation formula
Wherein division information is
。
Specifically, first layer operating system classification layer is constructed in step S3, first in the simulated data sets of generation according to category
Label stratified sampling goes out m data and constitutes new training set Mset, and measuring accuracy is obtained by the way of the outer estimation of packet, according to each behaviour
The utilization rate and identification demand for making system, determine operating system { x1,x2,...,xnN big class labels, by original training set
Belong to the class label x that the other most fine-grained label of this n major class is remapped to coarsenessi, it is not belonging to this n big classifications
Operating system label mapping be Others type, by after mapping data set input random forest method be trained to obtain os
Multi-class classification device;The threshold accuracy k1 of operating system multi-class classification device is set, tests the classification using outer estimated data is wrapped
Device precision, when precision reaches threshold requirement, the building of first layer classifier is completed, if not up to threshold accuracy, adjustment algorithm
Parameter increases training sample amount re -training to promote precision to threshold requirement;
This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset
N group data are divided by big classification, and its respective class label is mapped into the form that classification adds major version number, every group of data are adopted
N main version classification devices are generated with the training of random forest method, the setting main version classification device threshold accuracy of operating system is k2, is surveyed
Examination mode estimates that mode, every group of test data assess the precision of its corresponding main version classification device respectively, work as essence using packet is outer
Degree reaches threshold requirement, then retains the classifier, if not up to threshold requirement, by adjusting ginseng or enriching from fingerprint base again
The mode re -training of training sample data to be promoted to threshold requirement, in n main version classification devices composition layer architectures the
Two layers of key plate this identification layer;
Third layer is detailed version identification layer, if the main release label generated after second layer mapping is v, by training set
Mset is divided into v group data by the main release label of the second layer, and every group of data generate v in detail using random forests algorithm training
Version classification device, setting detailed version classifier threshold accuracy are k3, assess each detailed version using the outer estimation mode of packet and classify
Device, if the precision of certain classifier is lower than threshold requirement, this point of re -training by way of adjusting ginseng or resampling abundant data
Class device is until meet threshold requirement;K1 > k2 > k3 should be met when precision threshold.
Specifically, in step S4, give out a contract for a project the identical scale of mode to the destination host that need to be detected using with third party's detection
Detection sequence packet is sent, probe response sequence is obtained, therefrom extracts fingerprint, and be converted into numerical value vector and be input to hierarchical mode
In, granularity is identified according to it, obtains the prediction result of response;Special systematic label and IP list are obtained, by sending to IP
Finger print data is mapped as in vector deposit training set, with obtaining the finger print data that it is responded when newly-increased instruction by probe data packet
After practicing the threshold value that data reach setting, re -training is to realize algorithm Fingerprint Model to special systematic expansion effect.
Specifically, in step S5, give out a contract for a project the identical scale of mode using with Nmap detection, to the destination host hair that need to be detected
6 TCP SYN detection packet detection sequence packets are sent to generate this four test response sequences of SEQ, OPS, WIN, T1, send 2 ICMP
Echo detection packet generate IE sequence, send 1 TCP SYN detection packet obtain description TCP clearly specified congestion notification when feature,
6 TCP probe data packets are sent to generate six response sequences of T2-T7 respectively, send 1 UDP detection packet to close port generation
Sequence U1, this 13 response sequences are mapped to numerical value vector in the way of step S1 and are input to after corresponding Passivation Treatment
In hierarchical mode, granularity is identified according to it, obtains the prediction result of response, when prediction, test data enters top-level categories device, knows
Not Chu the corresponding big classification of operating system, data enter the corresponding category classifier of the second layer, identify the master under its category
Version number finally enters back into third layer and corresponds in detailed version classifier of the key plate under this, obtains final detailed operating system
Information.
Compared with prior art, the present invention at least has the advantages that
The present invention judges that operating system unknown fingerprint recognition capability is weak for traditional static fingerprint matching, rate of false alarm, leakage
The high problem of report rate proposes a kind of operating system recognition methods based on random forests algorithm, to improve identification precision, for
Unknown fingerprint also can be carried out effective identification;It is further by random forests algorithm on the basis of third party's fingerprint base rule set
The sufficient and necessary condition for excavating operation system fingerprint ensure that the reliability of identification method on more high-dimensional;Pass through building
Order training method framework can identify operating system from different grain size, and each level training is mutually indepedent, the knowledge to different operating system
It is not easy to local directed complete set, avoids repeatedly the full dose training heavy expense of bring;Problem unbalanced for training data, can be with
Reasonably data are merged or decomposed by layer architecture, with equilibrium data collection;Order training method framework is easily expanded, when to mould
When type increases the classification of identification, it is only necessary to which the training of adjustment member classifier reaches without the training of model entirety full dose
The effect of approximate incremental training.
Further, it is the pervasive operating system classifier that conformation identification ability is strong, identification range is wide, needs various systems
Sample data, however it is infeasible actually to build various operating system environments.It is rationally disassembled based on third party's fingerprint base, determines and belong to
Property feature after generate analogue data can effectively solve the above problems.
It further, is value range for Partial Feature attribute in third party's fingerprint base, being difficult in random sampling should
All values are all extracted into range, then carry out data Passivation Treatment to value range dimension by the way of branch mailbox, random to reduce
Sampling bring information loss.
Further, disaggregated model is constructed using random forests algorithm, random forests algorithm will be more by certain strategy
Decision tree combines, to convert strong Fingerprint Model for weak Fingerprint Model, significantly lifting operating system fingerprint recognition
Accuracy and stability, in addition random forest bootstrap sample mode supports packet is outer to estimate mode verification algorithm performance, phase
Computing cost can be reduced for modes such as k folding cross validations.
Further, an attribute is first randomly choosed from the attribute set of the node when selection divides attribute every time
Subset, then optimum division attribute is selected from subset, increase attribute disturbance to reduce model variance.S303 calculates each attribute conditions
Entropy can further find out the information gain of each attribute than the selection criteria as optimum division attribute, and the calculating of conditional entropy is also
Connection attribute chooses optimum division point and provides foundation.
Further, when each node chooses optimum division attribute, by calculating information gain ratio, information gain ratio is chosen
Maximum attribute is optimum division attribute.Selection information gain ratio is criteria for classifying, can effectively reduce information gain criterion preference
Selection can the more attribute of value number and bring adversely affect.
Further, three layers of order training method framework are constructed, it can be from three layers of big classification of system, key plate sheet, detailed version granularity
Upper identification operating system, the training of three levels is mutually indepedent, is easy to local directed complete set to the identification of different operating system, avoids repeatedly
Full dose trains the heavy expense of bring;For training data imbalance problem, can by layer architecture reasonably to data into
Row merges and decomposes, with equilibrium data collection;Order training method framework is easily expanded, when increasing model the system classification of identification, only
The training of adjustment member classifier is needed, without the training of model entirety full dose, has achieved the effect that approximate incremental training.
Further, to expand new finger print data and fingerprint classification, step S4 can be to the target master of known system type
Machine sends detection packet acquisition finger print data, and is mapped as vector data, is added in training set, runs up to certain magnitude, instruct again
Practice to realize the effect to Fingerprint Model long term maintenance and expansion.
Further, the data packet constructed meticulously is sent to destination host when prediction, and then obtains response sequence, will responded
Sequence vector is input in three-tier architecture Random Forest model.By the way of equal rights ballot, classification that each base is learnt
As a result be combined, can effectively solve the problems, such as that single learner causes Generalization Capability bad because falsely dropping, reduce model variance with
Deviation, and then reach effective identification prediction of operating system.
In conclusion the present invention judges that operating system unknown fingerprint recognition capability is weak for traditional static fingerprint matching,
The high problem of rate of false alarm, rate of failing to report proposes a kind of operating system recognition methods based on random forests algorithm, can effectively identify not
Know fingerprint, improves the accuracy rate of identification.Using order training method framework, varigrained recognition capability, each level difference point are provided
Class device parallel training, it is independent of one another with first-level class device, it is easy to local directed complete set, increases the scalability and maintainability of model.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
Fig. 1 is stratified random forest structure figure;
Fig. 2 is random forest training process figure;
Fig. 3 is actual flow precision figure.
Specific embodiment
A kind of operating system recognition methods based on random forest of the present invention, it is random gloomy based on the building of C4.5 decision Tree algorithms
Woods, comprising the following steps:
S1, data preparation and feature extraction: using monte carlo method, is analyzed based on third party's fingerprint base and determines training institute
The fingerprint collection that characteristic attribute, each attribute value range and the most probable used occurs carries out fingerprint base sufficiently a large amount of random
Sampling is combined into training set and test set, and carries out vectorization processing to the data of training set and test set;
It is constructed based on Nmap fingerprint base, according to Nmap system detection principle:
Nmap sends the corresponding generation response sequence of 16 data packets, and each response sequence corresponds to some flag bits, passes through ratio
The matching degree of flag bit in dynamic fingerprint and static fingerprint base to detection, to determine system classification.
This four test response sequences of SEQ, OPS, WIN, T1 can be generated by sending 6 TCP SYN detection packets, and SEQ is to be based on
Detect packet the sequencing results, including TCP ISN sequence predictability index (SP), TCP ISN greatest common divisor (GCD),
TCP ISN counting rate (ISR), ID sequence, which generate, mutually responds (TI, CI, II), shared IP ID sequence Boolean (SS), TCP time
It stabs option algorithm (TS).OPS is the TCP option that each detection packet receives.WIN is the TCP home window that each detection packet receives
Size.T1 includes the test value of data packet 1, including fragment position (DF), the initial life span of IP (T), IP are forbidden in response (R), IP
Initial life span conjecture (TG), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP RST data check and
(RD), TCP miscellaneous (Q).
It sends 2 ICMP echo detection packet and generates IE sequences, including response (R), fragment position (DFI), IP is forbidden initially to give birth to
Deposit the time (T), initial life span conjecture (TG), ICMP answer code (CD) of IP.
1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response
Sequence, including response (R), forbid initial life span conjecture (TG), TCP in fragment position (DFI), the initial life span of IP (T), IP
Initial window size (W), TCP option (O), TCP miscellaneous (Q).
6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response
(R), IP forbids fragment position (DF), the initial life span of IP (T), initial life span conjecture (TG), TCP initial window size of IP
(W), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP
Miscellaneous (Q).
It sends 1 UDP detection packet and forbids fragment position (DF), IP to close port, formation sequence U1, including response (R), IP
The initial life span conjecture (TG) of initial life span (T), IP, IP total length (IPL), not used port do not reach field
Non-zero (UN) returns to detection IP total length (RIPL), the detection IP ID value (RID) returned, returns to the complete of detection IP checksum value
Whole property (RIPCK), the integrality (RUCK) for returning to detection UDP verification sum, the integrality (RUD) of the UDP message returned.
The sequence flag position totally 119 that 16 data packets obtain is sent above, and by the analysis to flag bit, this method is determined
It is fixed to use wherein characteristic attribute of 117 flag bits (removing SEQ.SP and SEQ.ISR) as training data.Due to being difficult to obtain
Meet the various operating system real traffics that training needs, so generating a large amount of simulation numbers by the way of laboratory data emulation
It is trained according to for algorithm.Nmap fingerprint base is made of many rule sets, the corresponding one or more rule sets of every kind of operating system,
Each rule set is made of this 13 sequence groups of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7, each sequence group mark position
Value be value range or many-valued formal arranged side by side.
Finger print data is disassembled out from each rule set, first to obtain the complete or collected works of fingerprint sequence, if sequence complete or collected works' quantity
Greater than 500, then 4 samples are selected at random on sequence complete or collected works;If sequence complete or collected works' quantity is greater than 10 and less than 100, in complete or collected works
In select 2 samples at random;If sequence complete or collected works quantity less than 10, selects 1 sample at random on complete or collected works.In the rule set
The cartesian product that the sample that all response sequences extract is constituted forms the simulated data sets under the rule.Due to part flag bit
Nominal data is mapped as digital progress by the way of natural number coding here for the ease of algorithm process for nominal data
Vectorization processing.
In the above manner, the analogue data set that fingerprint base strictly all rules collection generates constitutes the required data of training
Collection.
S2, data prediction;It is value range for Partial Feature attribute in third party's fingerprint base, is difficult in random sampling
Value all in the range is all extracted into, then data Passivation Treatment is carried out to value range dimension by the way of branch mailbox, to reduce
Random sampling bring information loss;
S201, the feature extracted to third party's fingerprint base are analyzed, in conjunction with analogue data random sampling generating mode, really
Determine not lose the attribute of information entirely because of data pick-up.
S202, according to above-mentioned attribute in the characteristic distributions of third party's fingerprint base, determine reasonable branch mailbox passivation interval number
Mesh and passivation mode (generally equidistant branch mailbox or waiting frequency divisions case), concentrate in initial data and carry out at branch mailbox passivation to above-mentioned attribute
It manages and generates corresponding passivation scale file.
S203, subsequent training and prediction during, original training set test set all need to be according to unified passivation scale
Carry out branch mailbox Passivation Treatment.
By the analysis to Nmap fingerprint base, branch mailbox is passivated for section with 10 to this 8 attributes of ECN.T, T1-T7.T
Processing.
S3, order training method stage, based on the layer architecture of setting according to operating system classification identification layer, the big version of operating system
Random forest grader is respectively trained in this number identification layer and operating system detailed version identification layer parallel;
Training algorithm on the basis of the C4.5 decision tree optimized, using random forests algorithm improve model prediction accuracy with
Generalization ability, as shown in Figure 2.
If S301, training set have n sample, using Bootstrap mode, the sampling with replacement n times from current training set,
In the case where sample size is sufficiently large, about 63.2% sample can be pumped in new sample set in former training set, as
The stand-alone training collection of the base learner is trained, and 36.8% sample not being pumped to can be used as the test set of the base learner
It carries out wrapping outer estimation.
S302, current sample set comentropy is calculatedWherein D is the sample set of training, n
For class number included in current sample set, piFor sample size of all categories ratio shared in total sample.
S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate and wait
Each characteristic condition entropy in attribute set is selected, for discrete features, subsample collection is divided according to each value of this feature, is calculated
The comentropy of subsample collection further finds out its conditional entropy again;For continuous feature, first connection attribute is ranked up, only
Meet under conditions of the difference of the place that category label change and two successive values is greater than the threshold value of the attribute setup, takes two companies
Continuous value median is calculated as candidate cut-off to convert Category Attributes for connection attribute.
If D is current sample set, A is a certain candidate division attribute of the sample set, and A shares n possible value (a1,
a2,...,an), pi is that the sample set feature A value is aiSample shared by ratio, then for feature A, conditional entropy formula
For
S304, the information gain for calculating each feature.The conditional entropy that the empirical entropy of data set subtracts each feature obtains each feature
Information gain.For using information gain-ratio as criteria for classifying, being partial to select continuous value tag, the information of successive value is increased
Benefit subtracts log2(N-1)/| D | it is modified, N is possible split point number, | D | it is data set size.
For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A).
For connection attribute, best division of the maximum split point of information gain as the attribute after amendment is selected
Point.
S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than and set
The feature of fixed information gain precision threshold calculates its information gain ratio, and it is more special to divide than highest feature to select information gain
Sign.
For feature A, information gain is than calculation formulaWherein division information is
S306, according to the value for dividing feature, divide data set, each branch successively recurrence aforesaid operations, until meet with
Lower three kinds of situations then terminate division: all training samples of current node belong to a class, then mark the node with the category, and
Using the node as leaf node;The attribute that all samples of current node can not be used to divide;Current node training sample
Quantity is less than smallest sample amount threshold value very little.
By the above-mentioned means, constructing single decision tree, random forest is considered as the set of more decision trees.Constantly press
According to aforesaid way, more decision trees are constructed, each tree is carried out with the precision threshold of the outer measuring accuracy estimated of respective packet and setting
It compares, if being higher than threshold value, which is added among random forest.When prediction, each tree in forest all provides one
Classification results, using equal rights ballot mode, the classification for selecting poll most, as final prediction result.
Using random forests algorithm, while ensuring precision, generalization ability is greatly improved, but directly detailed with operating system
Thin version is that class label is trained, because operating system classification is numerous, generally requires a large amount of sample data and is trained, leads to
Toning is joined or is enriched in a manner of data-driven sample data and generally requires again full dose training with lift scheme precision, and expense is huge
Greatly, and operating system itself has a hierarchy, including classification, key plate sheet, detailed version, therefore according to operating system label level
Characteristic proposes that the framework of stratified random forest can effectively solve the above problems, and provides different levels varigrained identification effect
Fruit, as shown in Figure 1.
First layer operating system classification layer is constructed, first goes out m item according to class label stratified sampling in the simulated data sets of generation
Data constitute new training set Mset, and measuring accuracy can be used the outer mode estimated of packet and obtain, according to the use of each operating system
Rate and identification demand, determine operating system { x1,x2,...,xnThis n big class labels, original training set is belonged to this n big
The most fine-grained label of classification is remapped to the class label x of coarsenessi, it is not belonging to this other operating system of n major class
Label mapping is Others type, and the data set after mapping is inputted above-mentioned random forests algorithm and is trained to obtain the big classification of os
Classifier.Because classification note number of types is significantly less, and relatively small number of sample size can instruct by remapping for note
Practise the higher operating system multi-class classification device of precision.The threshold accuracy k1 of operating system multi-class classification device is set, is used
It wraps outer estimated data and tests the classifier precision, when precision reaches threshold requirement, the building of first layer classifier is completed, if not reaching
To threshold accuracy, then adjustment algorithm parameter or increase training sample amount re -training are to promote precision to threshold requirement.
This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset
N group data are divided by big classification, and its respective class label is mapped into the form that classification adds major version number, every group of data are adopted
N main version classification devices are generated with random forests algorithm training.The setting main version classification device threshold accuracy of operating system is k2, is surveyed
Examination mode estimates that mode, every group of test data assess the precision of its corresponding main version classification device respectively, work as essence using packet is outer
Degree reaches threshold requirement, then retains the classifier, if not up to threshold requirement, by adjusting ginseng or enriching from fingerprint base again
The mode re -training of training sample data is to be promoted to threshold requirement.This n main version classification devices constitute in layer architecture
Second layer key plate this identification layer.
Third layer is detailed version identification layer, can identify that Grained Requirements flexibly construct according to each operating system.If the second layer
The main release label generated after mapping is v, training set Mset is divided into v group data by the main release label of the second layer, often
Group data generate v detailed version classifier using random forests algorithm training.Detailed version classifier threshold accuracy, which is arranged, is
K3 wraps outer estimation mode and assesses each detailed version classifier, if the precision of certain classifier is lower than threshold requirement, by adjust ginseng or
The mode re -training of the resampling abundant data classifier is until meet threshold requirement.
Three-tier architecture by training data station work, can parallel training simultaneously, reduce time overhead.The classifier of same level
Independently of each other, it is easily managed and expands.When prediction, test data enters top-level categories device, identifies that corresponding operating system is big
Classification, data enter the corresponding category classifier of the second layer in turn, identify the major version number under its category, finally enter back into
Third layer corresponds in detailed version classifier of the key plate under this, obtains final detailed operation system information.
Because when prediction, upper and lower level connection is close, and the reliability of upper layer recognition result directly affects the prediction knot of lower layer
Fruit, so, the classifier on more upper layer is higher to reliability requirement, should meet k1 > k2 > k3 when precision threshold is arranged.
Training data is layered station work by different grain size by layer architecture, when adjusting to classifier, need to only be instructed in part
Re -training is carried out to partial classifier on the basis of white silk data, and does not have to all training sample data of re -training, so
It is so approximate that regard as a kind of incremental training mode.
According to this method, sample incremental mode and Class increment mode are supported.
For sample incremental mode, three layers of correspondence are sent into according to the operating system classification of increment sample, major version number
Classifier in identified, if classifier can be identified correctly, without processing, increment sample data is otherwise added to identification
The classifier training of mistake is concentrated.The fault-tolerant sample number threshold value p of classifier is set, if certain classifier increases the sample number of training set newly
Greater than threshold value p, then the classifier uses former training set and newly-increased sample data re -training.
It, will if the big classification of the increasing sample data fails correctly to identify in top-level categories device for Class increment mode
The sample data is added in top layer training set, when newly-increased sample number is greater than threshold value p, then the classifier using former training set and
Newly-increased sample data re -training;If newly-increased classification does not have corresponding classifier in the second layer or third layer, newly-increased pair is needed
The classifier answered, when newly-increased classification sample number is greater than q, uses the sample set as newly if sample number needed for newly-increased classifier is q
The training set for increasing classifier is trained.
S4, the training of layer architecture local regularity, tune consider and handle reason with lift scheme precision;
Operating system identification prediction is carried out to real traffic, because the attributive character and data of training set are by third
The analysis mode experiment of square fingerprint base comes, so the real traffic acquisition modes of prediction should be formed with third party's fingerprint base fingerprint
Mode is identical.Mode of giving out a contract for a project is detected referring to third party, using identical scale, sends detection sequence to the destination host that need to be detected
Packet obtains probe response sequence, therefrom extracts fingerprint, and be converted into numerical value vector and be input in hierarchical mode, according to its knowledge
Other granularity, obtains the prediction result of response.
For peculiar system identification prolem under the privately owned environment not occurred in third party's fingerprint base, the present invention supports private
There is environment fingerprint training that can effectively solve the problems, such as this.Special systematic label and IP list are obtained, is visited by being sent to these IP
Finger print data is mapped as in vector deposit training set, with obtaining the finger print data that it is responded when newly-increased training by needle data packet
After data reach the threshold value of setting, re -training is to realize algorithm Fingerprint Model to special systematic expansion effect.
S5, identification prediction is carried out to true detection flow.
Operating system identification prediction is carried out to real traffic, because the attributive character and data of training set are by Nmap
The analysis mode experiment of fingerprint base comes, so the real traffic acquisition modes of prediction should be with Nmap fingerprint base fingerprint generation type
It is identical.Mode of giving out a contract for a project is detected referring to Nmap, using identical scale, sends 6 TCP SYN to the destination host that need to be detected
Packet detection sequence packet is detected to generate this four test response sequences of SEQ, OPS, WIN, T1, send 2 ICMP echo detection Bao Sheng
At IE sequence, send 1 TCPSYN detection packet obtain description TCP clearly specified congestion notification when feature, send 6 TCP and visit
Measured data packet generates six response sequences of T2-T7 respectively, 1 UDP detection packet of transmission arrives close port formation sequence U1, and this 13
Response sequence is mapped to numerical value vector in the way of step S1 and is input in hierarchical mode after corresponding Passivation Treatment, root
Granularity is identified according to it, obtains the prediction result of response.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real
The component for applying example can be arranged and be designed by a variety of different configurations.Therefore, below to the present invention provided in the accompanying drawings
The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of of the invention selected
Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts
The every other embodiment obtained, shall fall within the protection scope of the present invention.
Compliance test result
(1) estimate to verify outside analogue data packet
Random forests algorithm has 63.2% in the case where sample size is big using Bootstrap mode sampling with replacement
Sample can be taken out as the training set of base learner, then remaining 36.8% sample can be used as verifying collection and carry out that " oob packet is estimated outside
Meter ", this method is small relative to k folding cross validation expense, and can effectively be verified to model generalization performance.
The outer estimation evaluation index of packet is as follows:
The 1 macro index of operating system classification layer verification the verifying results of table
The 2 detailed verification the verifying results of operating system classification layer of table
The 3 macro index of Windows key plate sheet of table
The 4 detailed index of Windows key plate sheet of table
The 5 macro index of Linux key plate sheet of table
The 6 detailed index of Linux key plate sheet of table
As shown in table 1 to table 6, by wrapping outer estimated data to operating system classification layer and the key plate of Windows this identification
Layer is made that compliance test result, it was demonstrated that the validity and feasibility of this method, wherein 8 recall ratio of Windows is lower, possible
It is transition system that factor, which is Windows 8, and fingerprint difference is not compared to being it is obvious that causing for Windows other systems
The difficulty of identification, but from general, this method is feasible effectively.
(2) true environment flow rate effects are verified
By Windows host under certain network segment to Office Network and Linux server referring to Nmap detection mode, spy of giving out a contract for a project
Survey, the fingerprint of acquisition predicted by algorithm model, be used herein as April 15 and April 16 to same network segment Windows and
The detection data of Linux, compliance test result are as follows:
The survival Host List 1 of table 7
The survival Host List 2 of table 8
Table 7, table 8 and Fig. 3 are please referred to, since real system environment is limited, so to necessary being in Office Network
Windows is identified that recognition accuracy is higher with Linux host, and this method is feasible effectively.
In conclusion the present invention, on the basis of third party's fingerprint base, constructing analog data are simultaneously mapped as algorithm study
Vector, in order to construct the resource dactylotype of high reliablity, easy expansion management, using random forests algorithm by a plurality of weak fingerprint
It is combined into strong fingerprint, with lifting operating system recognition accuracy and identifiable categorical measure, and designs order training method framework, with
Realize the long expansion and maintenance to the operation system fingerprint library constructed based on random forest, multi-classification algorithm is compared to two classification
Algorithm can effectively avoid in operating system identification problem because training expense due to bring using bottleneck.
The above content is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, all to press
According to technical idea proposed by the present invention, any changes made on the basis of the technical scheme each falls within claims of the present invention
Protection scope within.
Claims (10)
1. a kind of operating system recognition methods based on random forest, which comprises the following steps:
S1, using monte carlo method, analyzed based on third party's fingerprint base and determine that characteristic attribute used in training, each attribute take
It is worth the fingerprint collection of range and most probable appearance, random sampling is carried out to fingerprint base and is combined into training set and test set, and to training
The data of collection and test set carry out vectorization processing;
S2, by the way of branch mailbox, to Partial Feature attribute in third party's fingerprint base be value range dimension carry out data passivation
Processing;
S3, based on the layer architecture of setting according to operating system classification identification layer, operating system major release identification layer and operation
Random forest grader is respectively trained in system detailed version identification layer, constructs more decision trees, each tree outer estimation of respectively packet
Measuring accuracy be compared with the precision threshold of setting, if being higher than threshold value, which is added among random forest;
S4, the training of layer architecture local regularity, tune consider and handle reason with lift scheme precision;
S5, identification prediction is carried out to true detection flow, each tree in random forest all provides a classification results, using flat
Ballot mode is weighed, the classification for selecting poll most is as final prediction result.
2. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S1,
It is constructed based on Nmap fingerprint base, Nmap sends the corresponding generation response sequence of 16 data packets, and each response sequence is corresponding not
Same flag bit, the sequence flag position totally 119 that 16 data packets obtain remove SEQ.SP and two flag bits of SEQ.ISR, will
Characteristic attribute of remaining 117 flag bits as training data, the corresponding one or more rule sets of every kind of operating system, Mei Gegui
Then collection is made of SEQ, OPS, WIN, T1, IE, U1, ECN and T2-T7 sequence group, and the value of each sequence group mark position is range
Value or many-valued formal arranged side by side;Finger print data is disassembled out from each rule set, the complete or collected works of fingerprint sequence is first obtained, if the sequence
Complete or collected works' quantity is greater than 500, then selects 4 samples at random on sequence complete or collected works;If sequence complete or collected works' quantity is greater than 10 and is less than
100, select 2 samples at random in complete or collected works;If sequence complete or collected works quantity less than 10, selects 1 sample at random on complete or collected works;
The cartesian product that the sample that all response sequences extract in the rule set is constituted forms the simulated data sets under the rule, using certainly
Nominal data is mapped as number and carries out vectorization processing by the mode of right number encoder.
3. the operating system recognition methods according to claim 2 based on random forest, which is characterized in that send 6 TCP
SYN detection packet can generate this four test response sequences of SEQ, OPS, WIN, T1, and SEQ is the sequence analysis knot based on detection packet
Fruit, including TCP ISN sequence predictability index (SP), TCP ISN greatest common divisor (GCD), TCP ISN counting rate (ISR),
ID sequence, which generates, mutually responds (TI, CI, II), shared IP ID sequence Boolean (SS), TCP timestamp option algorithm (TS);OPS
It is the TCP option that each detection packet receives, WIN is the TCP initial window size that each detection packet receives, and T1 includes data packet 1
Test value, including response (R), IP forbid the initial life span conjecture in fragment position (DF), the initial life span of IP (T), IP
(TG), TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP RST data check and (RD), TCP miscellaneous (Q);
Send 2 ICMP echo detection packet and generate IE sequences, including response (R), when fragment position (DFI), IP being forbidden initially to survive
Between the initial life span of (T), IP guess (TG), ICMP answer code (CD);
1 TCP SYN detection packet is sent, feature when description TCP clearly specifies congestion notification is obtained, generates ECN response sequence,
Including response (R), forbid fragment position (DFI), the initial life span of IP (T), the initial life span conjecture (TG) of IP, TCP initial
Window size (W), TCP option (O), TCP miscellaneous (Q);
6 TCP probe data packets are sent, generate six response sequences of T2-T7 respectively, each sequence includes including response (R), IP
Forbid the initial life span conjecture (TG) in fragment position (DF), the initial life span of IP (T), IP, TCP initial window size (W),
TCP sequence number (S), TCP acknowledgment number (A), TCP mark (F), TCP option (O), TCP RST data check and (RD), TCP are miscellaneous
(Q);
It sends 1 UDP detection packet and forbids fragment position (DF), IP initial to close port, formation sequence U1, including response (R), IP
The initial life span conjecture (TG) of life span (T), IP, IP total length (IPL), not used port do not reach field non-zero
(UN), detection IP total length (RIPL), the detection IP ID value (RID) returned, the integrality for returning to detection IP checksum value are returned
(RIPCK), integrality (RUCK), the integrality (RUD) of the UDP message returned of detection UDP verification sum are returned.
4. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that step S2 is specific
Are as follows:
S201, to third party's fingerprint base extract feature analyze, in conjunction with analogue data random sampling generating mode, determine because
The attribute of data pick-up not total loss information;
S202, according to above-mentioned attribute third party's fingerprint base characteristic distributions, determine reasonable branch mailbox passivation section number with
Passivation mode is concentrated in initial data and above-mentioned attribute is carried out branch mailbox Passivation Treatment and generated to be passivated scale file accordingly;
S203, during subsequent training and prediction, original training set test set by unified passivation scale carry out branch mailbox passivation
Processing.
5. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that step S3 is specific
Are as follows:
If S301, training set have n sample, taken out from current training set using sampling with replacement according to Bootstrap mode
N times are taken, generate the sampling subset comprising n sample, when n numerical value tends to be infinite, each sample is not drawn in former training set
Probability be 36.8%, residue 63.2% sample be pumped in new sampling subset, the stand-alone training as the base learner
Collection is trained, and 36.8% sample not being pumped to carries out wrapping outer estimation as the test set of the base learner;
S302, current sample set comentropy is calculated:
Wherein, D is the sample set of training, and n is class number included in current sample set, piFor sample size of all categories
The shared ratio in total sample;
S303, the candidate attribute subset comprising k attribute is randomly selected from the attribute set of current sample set, calculate candidate belong to
Temper concentrates each characteristic condition entropy, conditional entropy formula are as follows:
Wherein, D is current sample set, and A is a certain candidate division attribute of the sample set, and A shares n possible value a1,
a2,...,an, piIt is a for the sample set feature A valueiSample shared by ratio;
S304, the information gain for calculating each feature, for being partial to using information gain-ratio as criteria for classifying, selection successive value is special
Sign, will subtract log to the information gain of successive value2(N-1)/| D | it is modified, N is possible split point number, | D | it is several
According to collection size;For feature A, information gain is gain (D, A)=Ent (D)-Ent (D | A);For connection attribute, selection
Best splitting point of the maximum split point of information gain as the attribute after amendment;
S305, information gain is selected higher than all characteristic information gain mean values in current candidate attribute set, and be higher than setting
The feature of information gain precision threshold calculates its information gain ratio, and selecting information gain than highest feature is division feature;
S306, according to the value for dividing feature, data set, each branch successively recurrence aforesaid operations are divided, until meeting following three
Kind situation then terminates division;All trained samples of current node belong to a class, then mark the node with the category, and by the node
As leaf node;The attribute that all samples of current node can not be used to divide;Current node training samples number is very little
Less than smallest sample amount threshold value.
6. the operating system recognition methods according to claim 5 based on random forest, which is characterized in that step S303
In, for discrete features, subsample collection is divided according to each value of this feature, calculates the comentropy of subsample collection, then is found out
Its conditional entropy;For continuous feature, first connection attribute is ranked up, is only met in the place that category label change and two
The difference of a successive value is greater than under conditions of the attribute setup threshold value, takes two successive value medians as candidate cut-off, will be continuous
Attribute is converted into Category Attributes calculating.
7. the operating system recognition methods according to claim 5 based on random forest, which is characterized in that step S305
In, for feature A, information gain is than calculation formulaWherein division information is
8. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S3,
First layer operating system classification layer is constructed, first goes out m data according to class label stratified sampling in the simulated data sets of generation and constitutes
New training set Mset, measuring accuracy are obtained by the way of the outer estimation of packet, according to the utilization rate of each operating system and identification need
It asks, determines operating system { x1,x2,...,xnN big class labels, original training set is belonged into the other most particulate of this n major class
The label of degree is remapped to the class label x of coarsenessi, being not belonging to the other operating system label mapping of this n major class is
Others type is trained the data set input random forest method after mapping to obtain os multi-class classification device;Setting behaviour
The threshold accuracy k1 for making system multi-class classification device tests the classifier precision using outer estimated data is wrapped, when precision reaches threshold
When value requires, the building of first layer classifier is completed, if not up to threshold accuracy, adjustment algorithm parameter or increase training sample amount
Re -training is to promote precision to threshold requirement;
This identification layer of second layer key plate is constructed, on the basis of the n major class label that first layer divides, by training set Mset by big
Category division maps to the form that classification adds major version number at n group data, and by its respective class label, every group of data use with
The training of machine forest method generates n main version classification devices, and the setting main version classification device threshold accuracy of operating system is k2, test side
Formula estimates mode using packet is outer, and every group of test data assesses the precision of its corresponding main version classification device respectively, when precision reaches
To threshold requirement, then retain the classifier, if not up to threshold requirement, is trained by adjusting ginseng or being enriched from fingerprint base again
The second layer of the mode re -training of sample data to be promoted to threshold requirement, in n main version classification device composition layer architectures
This identification layer of key plate;
Third layer is detailed version identification layer, if the main release label generated after second layer mapping is v, training set Mset is pressed
The main release label of the second layer is divided into v group data, and every group of data generate v detailed version point using random forests algorithm training
Class device, setting detailed version classifier threshold accuracy are k3, assess each detailed version classifier using the outer estimation mode of packet, if certain
The precision of classifier be lower than threshold requirement, then adjust ginseng or resampling abundant data by way of the re -training classifier until
Meet threshold requirement;K1 > k2 > k3 should be met when precision threshold.
9. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S4,
To the destination host that need to be detected detection sequence packet is sent using with third party's detection identical scale of mode of giving out a contract for a project, obtains detection and ring
Sequence is answered, fingerprint is therefrom extracted, and is converted into numerical value vector and is input in hierarchical mode, granularity is identified according to it, is rung
The prediction result answered;Special systematic label and IP list are obtained, by sending what probe data packet was responded to obtain it to IP
Finger print data is mapped as in vector deposit training set by finger print data, after newly-increased training data reaches the threshold value of setting, weight
New training is to realize algorithm Fingerprint Model to special systematic expansion effect.
10. the operating system recognition methods according to claim 1 based on random forest, which is characterized in that in step S5,
Give out a contract for a project the identical scale of mode using with Nmap detection, sends 6 TCP SYN detection detectives sequencings to the destination host that need to be detected
Column packet generates this four test response sequences of SEQ, OPS, WIN, T1,2 ICMP echo detection packets of transmission generate IE sequence, hair
Send 1 TCP SYN detection packet obtain description TCP clearly specified congestion notification when feature, send 6 TCP probe data packets point
Not Sheng Cheng six response sequences of T2-T7, send 1 UDP detection packet and arrive close port formation sequence U1, this 13 response sequences are pressed
Numerical value vector is mapped to according to the mode of step S1 and is input in hierarchical mode after corresponding Passivation Treatment, and grain is identified according to it
Degree, obtains the prediction result of response, and when prediction, test data enters top-level categories device, identifies corresponding operating system major class
Not, data enter the corresponding category classifier of the second layer, identify the major version number under its category, finally enter back into third layer
In detailed version classifier of the corresponding key plate under this, final detailed operation system information is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910893976.9A CN110519128B (en) | 2019-09-20 | 2019-09-20 | Random forest based operating system identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910893976.9A CN110519128B (en) | 2019-09-20 | 2019-09-20 | Random forest based operating system identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110519128A true CN110519128A (en) | 2019-11-29 |
CN110519128B CN110519128B (en) | 2021-02-19 |
Family
ID=68633106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910893976.9A Expired - Fee Related CN110519128B (en) | 2019-09-20 | 2019-09-20 | Random forest based operating system identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110519128B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111385297A (en) * | 2020-03-04 | 2020-07-07 | 西安交通大学 | Wireless device fingerprint identification method, system, device and readable storage medium |
CN111431872A (en) * | 2020-03-10 | 2020-07-17 | 西安交通大学 | Two-stage Internet of things equipment identification method based on TCP/IP protocol characteristics |
CN112202718A (en) * | 2020-09-03 | 2021-01-08 | 西安交通大学 | XGboost algorithm-based operating system identification method, storage medium and device |
CN112245728A (en) * | 2020-06-03 | 2021-01-22 | 北京化工大学 | Respirator false positive alarm signal identification method and system based on integrated tree |
CN112929364A (en) * | 2021-02-05 | 2021-06-08 | 上海观安信息技术股份有限公司 | Data leakage detection method and system based on ICMP tunnel analysis |
CN113569929A (en) * | 2021-07-15 | 2021-10-29 | 北京淇瑀信息科技有限公司 | Internet service providing method and device based on small sample expansion and electronic equipment |
CN113627761A (en) * | 2021-07-30 | 2021-11-09 | 中铁一局集团第二工程有限公司 | Parallel evaluation method for prediction of water inrush probability of geotechnical engineering |
CN114095235A (en) * | 2021-11-17 | 2022-02-25 | 恒安嘉新(北京)科技股份公司 | System identification method, apparatus, computer device and medium |
CN114169426A (en) * | 2021-12-02 | 2022-03-11 | 安徽庐峰交通科技有限公司 | Beidou position data-based highway traffic potential safety hazard investigation method |
CN114662557A (en) * | 2022-02-10 | 2022-06-24 | 北京墨云科技有限公司 | Host operating system identification method and device based on machine learning |
CN117395162A (en) * | 2023-12-12 | 2024-01-12 | 中孚信息股份有限公司 | Method, system, device and medium for identifying operating system by using encrypted traffic |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160164866A1 (en) * | 2014-12-09 | 2016-06-09 | Duo Security, Inc. | System and method for applying digital fingerprints in multi-factor authentication |
CN108846275A (en) * | 2018-04-11 | 2018-11-20 | 哈尔滨工程大学 | Unknown Method of Detecting Operating System based on RIPPER algorithm |
CN110213124A (en) * | 2019-05-06 | 2019-09-06 | 清华大学 | Passive operation system identification method and device based on the more sessions of TCP |
-
2019
- 2019-09-20 CN CN201910893976.9A patent/CN110519128B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160164866A1 (en) * | 2014-12-09 | 2016-06-09 | Duo Security, Inc. | System and method for applying digital fingerprints in multi-factor authentication |
CN108846275A (en) * | 2018-04-11 | 2018-11-20 | 哈尔滨工程大学 | Unknown Method of Detecting Operating System based on RIPPER algorithm |
CN110213124A (en) * | 2019-05-06 | 2019-09-06 | 清华大学 | Passive operation system identification method and device based on the more sessions of TCP |
Non-Patent Citations (2)
Title |
---|
HIREN J等: "Improving ZigBee Device Network Authentication Using Ensemble Decision Tree Classifiers With Radio Frequency Distinct Native Attribute Fingerprinting", 《IEEE》 * |
易运晖: "基于决策树的被动操作系统识别技术研究", 《基于决策树的被动操作系统识别技术研究》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111385297A (en) * | 2020-03-04 | 2020-07-07 | 西安交通大学 | Wireless device fingerprint identification method, system, device and readable storage medium |
CN111431872A (en) * | 2020-03-10 | 2020-07-17 | 西安交通大学 | Two-stage Internet of things equipment identification method based on TCP/IP protocol characteristics |
CN112245728A (en) * | 2020-06-03 | 2021-01-22 | 北京化工大学 | Respirator false positive alarm signal identification method and system based on integrated tree |
CN112202718A (en) * | 2020-09-03 | 2021-01-08 | 西安交通大学 | XGboost algorithm-based operating system identification method, storage medium and device |
CN112202718B (en) * | 2020-09-03 | 2021-08-13 | 西安交通大学 | XGboost algorithm-based operating system identification method, storage medium and device |
CN112929364B (en) * | 2021-02-05 | 2023-03-24 | 上海观安信息技术股份有限公司 | Data leakage detection method and system based on ICMP tunnel analysis |
CN112929364A (en) * | 2021-02-05 | 2021-06-08 | 上海观安信息技术股份有限公司 | Data leakage detection method and system based on ICMP tunnel analysis |
CN113569929A (en) * | 2021-07-15 | 2021-10-29 | 北京淇瑀信息科技有限公司 | Internet service providing method and device based on small sample expansion and electronic equipment |
CN113569929B (en) * | 2021-07-15 | 2024-03-01 | 北京淇瑀信息科技有限公司 | Internet service providing method and device based on small sample expansion and electronic equipment |
CN113627761A (en) * | 2021-07-30 | 2021-11-09 | 中铁一局集团第二工程有限公司 | Parallel evaluation method for prediction of water inrush probability of geotechnical engineering |
CN113627761B (en) * | 2021-07-30 | 2024-03-01 | 中铁一局集团第二工程有限公司 | Parallel evaluation method for geotechnical engineering water inrush probability prediction |
CN114095235A (en) * | 2021-11-17 | 2022-02-25 | 恒安嘉新(北京)科技股份公司 | System identification method, apparatus, computer device and medium |
CN114095235B (en) * | 2021-11-17 | 2024-03-19 | 恒安嘉新(北京)科技股份公司 | System identification method, device, computer equipment and medium |
CN114169426A (en) * | 2021-12-02 | 2022-03-11 | 安徽庐峰交通科技有限公司 | Beidou position data-based highway traffic potential safety hazard investigation method |
CN114662557A (en) * | 2022-02-10 | 2022-06-24 | 北京墨云科技有限公司 | Host operating system identification method and device based on machine learning |
CN117395162A (en) * | 2023-12-12 | 2024-01-12 | 中孚信息股份有限公司 | Method, system, device and medium for identifying operating system by using encrypted traffic |
CN117395162B (en) * | 2023-12-12 | 2024-02-23 | 中孚信息股份有限公司 | Method, system, device and medium for identifying operating system by using encrypted traffic |
Also Published As
Publication number | Publication date |
---|---|
CN110519128B (en) | 2021-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110519128A (en) | A kind of operating system recognition methods based on random forest | |
CN105022960B (en) | Multiple features mobile terminal from malicious software detecting method and system based on network traffics | |
Shao et al. | Community detection based on distance dynamics | |
CN104317681B (en) | For the behavioral abnormal automatic detection method and detecting system of computer system | |
Kirxpatrick et al. | Searching for evolutionary patterns in the shape of a phylogenetic tree | |
CN106470133B (en) | System pressure testing method and device | |
Li et al. | A comparative analysis of evolutionary and memetic algorithms for community detection from signed social networks | |
Xiang et al. | A clustering-based surrogate-assisted multiobjective evolutionary algorithm for shelter location problem under uncertainty of road networks | |
CN107683586A (en) | Method and apparatus for rare degree of the calculating in abnormality detection based on cell density | |
CN108038052A (en) | Automatic test management method, device, terminal device and storage medium | |
CN110347501A (en) | A kind of service testing method, device, storage medium and electronic equipment | |
CN108470022A (en) | A kind of intelligent work order quality detecting method based on operation management | |
CN104618304B (en) | Data processing method and data handling system | |
Ma et al. | Decomposition‐based multiobjective evolutionary algorithm for community detection in dynamic social networks | |
CN109670653A (en) | A kind of method and device predicted based on industrial model predictive engine | |
CN107046586A (en) | A kind of algorithm generation domain name detection method based on natural language feature | |
CN106682507A (en) | Virus library acquiring method and device, equipment, server and system | |
CN114266342A (en) | Internal threat detection method and system based on twin network | |
CN110011990A (en) | Intranet security threatens intelligent analysis method | |
He et al. | Genetic algorithm with ensemble learning for detecting community structure in complex networks | |
Xie et al. | Exploring express delivery networks in China based on complex network theory | |
Wetzig et al. | Unsupervised anomaly alerting for iot-gateway monitoring using adaptive thresholds and half-space trees | |
CN109977131A (en) | A kind of house type matching system | |
CN106611191A (en) | Decision tree classifier construction method based on uncertain continuous attributes | |
Shao et al. | Community detection via local dynamic interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210219 |
|
CF01 | Termination of patent right due to non-payment of annual fee |