CN110097122A - A kind of host identification model performance optimization method simplified based on fingerprint - Google Patents
A kind of host identification model performance optimization method simplified based on fingerprint Download PDFInfo
- Publication number
- CN110097122A CN110097122A CN201910364190.8A CN201910364190A CN110097122A CN 110097122 A CN110097122 A CN 110097122A CN 201910364190 A CN201910364190 A CN 201910364190A CN 110097122 A CN110097122 A CN 110097122A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- host
- value
- vector
- characteristic information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000005457 optimization Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000013139 quantization Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 230000010365 information processing Effects 0.000 claims 1
- 238000012795 verification Methods 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Collating Specific Patterns (AREA)
Abstract
The invention discloses a kind of host identification model performance optimization methods simplified based on fingerprint, and described method includes following steps: Step 1: obtaining network flow and being pre-processed;Step 2: extracting characteristic information;Step 3: extracting host fingerprint;Step 4: host fingerprint is carried out vectorization processing, the feature of vectorization is selected, obtains minimal feature subset;Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.The present invention is mainly in conjunction with the SVM vectorization procedure in machine learning, feature selecting is carried out to the vector after vectorization, the collective number of feature is continuously increased by being ranked up to feature, minimal characteristic vector is found to verify, and it is obtained by the method for inverse quantization and simplifies fingerprint, and finally by experimental verification under the premise of recognition accuracy is very nearly the same, simplifying fingerprint has recognition accuracy more better than complete finger print, improves the rate of host identification on the whole.
Description
Technical field
The present invention relates to a kind of methods that fingerprint is simplified, and in particular to a kind of using the characteristic information of the multiple dimensions of host as base
The host identification model performance optimization method that the fingerprint of plinth is simplified.
Background technique
Host identification, which refers to obtain to have by technologies such as Passive Network monitoring, active probes, can be used for identifying host
Then characteristic information forms the fingerprint of host, referred to by obtaining the characteristic information of this host with host to the host that needs identify
Line carries out the method that match cognization goes out the host.And the characteristic information of host is classified according to type in studying at present, can be divided
For three classes, the first kind: the characteristic information of hardware aspect, such as the corresponding sequence number of hardware MAC, hard disk, clock skew rate, the world
Mobile device identification code etc.;Second class: the characteristic information in software environment, such as the OS Type installed of host and right
Various software of browser type and its corresponding version number that the version number answered, user are installed and user installation and
Environment needed for these softwares etc.;Third class: the characteristic information on network behavior, such as stream when user is surfed the Internet using host
Which website measure feature, user often access and its common operation, user access used account information, user when website
Account information used etc. when sending mail.
Have very much, but for current research to the research in host identification technology at present, however it remains insufficient
The host recognition methods of place, currently used multiple characteristic items has very much, but is all the method comprising all characteristic items, not
It is simplified, the accuracy rate identified in this way is ensured, but when sample fingerprint quantity is very big in fingerprint base, there is identification
Inefficient problem.
Summary of the invention
The object of the present invention is to provide a kind of host identification model performance optimization methods simplified based on fingerprint, and this method can
With Statistical error efficiency significantly.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of host identification model performance optimization method simplified based on fingerprint, is included the following steps:
Step 1: obtaining network flow and being separated and screened pretreatment;
Step 2: extracting the characteristic information for host identification from flow, the characteristic information includes host time
Stab information, host flow time-varying characteristics, hardware information, software information and network behavior information;
Step 3: using hardware host fingerprint extraction method, software environment host fingerprint extraction method and network row simultaneously
Host fingerprint is extracted for host fingerprint extraction method;
Step 4: host fingerprint is carried out vectorization processing, according to corresponding vectorization rule, being converted into can be input to
Then vector after conversion is input to SVM by the vector format in SVM;For the host fingerprint vector of input, optimization is utilized
CHI algorithm carries out feature selecting to the vector of input, according to host of the size to input for optimizing the value that CHI algorithm is calculated
Characteristic information in fingerprint vector is ranked up;It is successively tested according to sorted sequence addition characteristic information by testing
Card, it is every to increase a characteristic information and all verify using SVM the accuracy rate of its host recognition result, find accuracy rate and complete
The recognition result of fingerprint item is very nearly the same and the least set of vector number, as minimal feature subset;
Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.
Compared with the prior art, the present invention has the advantage that
Method of the invention carries out the vector after vectorization mainly in conjunction with the SVM vectorization procedure in machine learning
Feature selecting is continuously increased the collective number of feature by being ranked up to feature, finds minimal characteristic vector to verify, and lead to
The method for crossing inverse quantization, which obtains, simplifies fingerprint, and the premise very nearly the same in recognition accuracy finally by experimental verification
Under, simplifying fingerprint has recognition accuracy more better than complete finger print, improves the rate of host identification on the whole.
Detailed description of the invention
Fig. 1 is that the host fingerprint based on optimization CHI algorithm simplifies flow chart.
Fig. 2 is the relational graph of 10 host recognition accuracies and characteristic.
Specific embodiment
Further description of the technical solution of the present invention with reference to the accompanying drawing, and however, it is not limited to this, all to this
Inventive technique scheme is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered
Within the protection scope of the present invention.
The present invention provides a kind of host identification model performance optimization methods simplified based on fingerprint, as shown in Figure 1, described
Method includes the following steps:
Step 1: flow pre-processes:
Mainly to the separation of flow and screening, separation is timestamp information will to be had in flow and without timestamp information
Separated processing, screening is to obtain in flow by multiple threads mode on the basis of using zero-copy and belong to same
Preceding in the data packet of stream several is handled that (for number without specific requirement, how more acurrate number is, but when handling, low efficiency;It is a
Number is few, high-efficient;Can be adjusted according to specific requirements), follow-up data packet will not be handled, and this avoid a large amount of tools of processing
There is the data packet of same characteristic features, improves data-handling efficiency.
Step 2: feature information extraction:
It is extracted including the extraction of information of host time stamp, the extraction of host flow time-varying characteristics, hardware information, software information
It extracts and network behavior information extraction.Feature information extraction is carried out from three hardware, software and network behavior dimensions, utilizes base
Data packet is handled in the Customization Tool of libpcap, obtaining in data packet there is the feature of identification to believe host identification
Breath, while information is extracted using corresponding plug-in unit to the characteristic information of different agreement, improve the scalability of system.
Step 3: host fingerprint extraction:
It mainly include that hardware host fingerprint extraction method, software environment host fingerprint extraction method and network behavior host refer to
Line extracting method, major function are that validity feature information is found from the characteristic information obtained as host fingerprint.For
The host-feature information of each dimension is needed using corresponding host fingerprint extraction algorithm, for different dimensions host fingerprint extraction
Method, the invention proposes one kind based on optimization CHI algorithm fingerprint compressing method, is not reducing the same of the accuracy rate of host identification
Shi Tigao host recognition efficiency.
Step 4: feature vector:
It is verified, is first had to host fingerprint since the fingerprint format in host fingerprint base can not be directly inputted to SVM
Feature carry out vectorization, convert the acceptable numeric form of SVM for the character of concrete meaning in fingerprint, i.e., by host fingerprint into
Line number value, specific vectorization method are as follows:
(a) when the value of characteristic information in fingerprint is specific numerical value A, this just is indicated with a dimension in vector
The value of characteristic information, the specific value of this dimension are A/max (A);
(b) when the value of characteristic information only occurs or two kinds of possible markers does not occur, just with one in vector
A dimension indicates that the value of this characteristic information, use 1 indicate occur, and use 0 indicates do not occur;
(c) when the value of characteristic information be N attribute combination character string sequence when, just with N number of dimension in vector come
The value of N number of attribute is recorded, appearance is then denoted as 1, does not occur, be denoted as 0.
(d) when the value of characteristic information is not specific value but arbitrary value in [a, b] range, just in vector
One dimension indicates the value of the fingerprint item, and value is the value in section;
(e) when fingerprint item value type is not fixed, it is assumed that both may be the character of the N attribute combination occurred in (c)
String sequence, it is also possible to the A when specific value occurred in (a) records the value of the fingerprint item with the N+1 dimension in input vector,
When fingerprint item value is numerical value, then the last one-dimensional value of N+1 dimension is A/max (A);When the value of fingerprint item is N attribute
When combined character string sequence, then there is position and be denoted as 1, does not occur position and be denoted as 0, and last one-dimensional value is 0;
(f) if fingerprint item does not have value, by the dimension of 0 all input vectors of merging.
According to the above vectorization create-rule, indicate each attribute in host fingerprint with numerical value, by MSS, TTL, WIN,
LEN, WS etc. have its numerical value that is set as of specific size, and S, SACK, NOP, DF identify whether the attribute occurs and divide with 0,1
Piece, TC have specific value and trend comparison, and C, U have specific content to be matched.
Step 5: inverse quantization can be carried out according to method opposite in step 4.
Step 6: experimental verification:
For SVM, in same experimental situation and identical parameter setting, recognition efficiency is often only and vector
Dimension is related, i.e. the bigger efficiency of dimension is lower.The efficiency for promoting identification seeks to the dimension of less vector.Method of the invention
Due to being tested under identical environment, experiment parameter does not also change, it is believed that the few feature subvector of dimension identifies speed
Rate is faster.
(a) it obtains and verifies minimal feature subset performance:
This 10 host of experiment is verified, and the average value of 10 hosts is as minimal feature subset.Fig. 2 is host
Accuracy rate is with characteristic variation diagram.Broken line in figure represents the accuracy rate of identification, and wherein abscissa represents number of features, ordinate
Represent accuracy rate.From the situation of change in figure it is recognised that in the identification process to host, when number of features is 7, accurately
Rate has reached 0.9, and the accuracy rate of complete subset is also 0.9, this illustrate just to have possessed when number of features is 7 with it is complete
Whole subset recognition accuracy very nearly the same, and the characteristic item of host reduces 5, the efficiency of identification is improved.And
Then accuracy rate variation is little after 7.
Analysis is it is found that one feature of every increase, the Average Accuracy of host will be big when feature quantity is less from figure
Amplitude improves, and when number of features is 6 in character subset, accuracy rate just has come to 0.8, and the accuracy rate of complete characterization number
Be 0.9, and after number of features 7, accuracy rate has reached 0.9, substantially with it is complete, be compared to complete characterization
Accuracy of identification is very nearly the same.This also illustrates in situation similar in accuracy of identification, nearly original two points of number of features at this time
One of, in situation similar in the accuracy rate with minimal feature subset identification, reduces characteristic item, improve the rate of identification.
(b) fingerprint host recognition performance is simplified in verifying:
Table 1 be to after simplifying fingerprint item and the identification experiment of complete fingerprint base host as a result, finger after wherein simplifying
Line also remains seven fingerprints (TTL, WIN, MSS, WS, LEN, TC, C).As can be seen from the table using fingerprint simplify and complete
The accuracy rate of experimental result is the same, but is handled on the time, the processing time under the data set of same scale, after simplifying
It is 1.3 seconds, relative to the fast 0.5s of complete finger prints processing time 1.8s, so more efficient after simplifying.By accuracy rate
Discovery is compared with the processing time, is not substantially reduced using the fingerprint after simplifying in the accuracy rate that host identifies, simultaneity
It can greatly improve.
Table 1 simplifies the host recognition performance in fingerprint and complete finger print library
Claims (5)
1. a kind of host identification model performance optimization method simplified based on fingerprint, it is characterised in that the method includes walking as follows
It is rapid:
Step 1: obtaining network flow and being pre-processed;
Step 2: extracting the characteristic information for host identification from flow;
Step 3: extracting host fingerprint;
Step 4: host fingerprint is carried out vectorization processing, according to corresponding vectorization rule, being converted into can be input in SVM
Vector format, the vector after conversion is then input to SVM;For the host fingerprint vector of input, calculated using the CHI of optimization
Method carries out feature selecting to the vector of input, according to host fingerprint of the size to input for optimizing the value that CHI algorithm is calculated
Characteristic information in vector is ranked up;It is successively verified according to sorted sequence addition characteristic information by testing, often
Increase the accuracy rate that a characteristic information all verify using SVM its host recognition result, finds accuracy rate and complete finger print
Recognition result is very nearly the same and the least set of vector number, as minimal feature subset;
Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.
2. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute
Stating pretreatment includes separation and screening, in which: separation is timestamp information will to be had in flow and without point of timestamp information
Processing is opened, screening is to obtain in flow by multiple threads mode on the basis of using zero-copy and belong to same stream
Preceding in data packet several is handled.
3. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute
Stating characteristic information includes host time stamp information, host flow time-varying characteristics, hardware information, software information and network behavior letter
Breath.
4. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute
State extract host fingerprint method be simultaneously use hardware host fingerprint extraction method, software environment host fingerprint extraction method and
Network behavior host fingerprint extraction method extracts host fingerprint.
5. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute
Stating vectorization processing, the specific method is as follows:
(a) when the value of characteristic information in fingerprint is specific numerical value A, this feature just is indicated with a dimension in vector
The value of information, the specific value of this dimension are A/max (A);
(b) when the value of characteristic information only occurs or two kinds of possible markers does not occur, just with a dimension in vector
Degree indicates that the value of this characteristic information, use 1 indicate occur, and use 0 indicates do not occur;
(c) when the value of characteristic information is the character string sequence of N attribute combination, N just is recorded with N number of dimension in vector
The value of a attribute, appearance are then denoted as 1, do not occur, be denoted as 0.
(d) when the value of characteristic information is not specific value but arbitrary value in [a, b] range, just with one in vector
Dimension indicates the value of the fingerprint item, and value is the value in section;
(e) when fingerprint item value type is not fixed, it is assumed that both may be the character string sequence of the N attribute combination occurred in (c)
Column, it is also possible to which A when the specific value occurred in (a) is recorded the value of the fingerprint item with the N+1 dimension in input vector, works as finger
When line item value is numerical value, then the last one-dimensional value of N+1 dimension is A/max (A);When the value of fingerprint item is the combination of N attribute
Character string sequence when, then there is position and be denoted as 1, do not occur position and be denoted as 0, and last one-dimensional value is 0;
(f) if fingerprint item does not have value, by the dimension of 0 all input vectors of merging.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910364190.8A CN110097122A (en) | 2019-04-30 | 2019-04-30 | A kind of host identification model performance optimization method simplified based on fingerprint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910364190.8A CN110097122A (en) | 2019-04-30 | 2019-04-30 | A kind of host identification model performance optimization method simplified based on fingerprint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110097122A true CN110097122A (en) | 2019-08-06 |
Family
ID=67446710
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910364190.8A Pending CN110097122A (en) | 2019-04-30 | 2019-04-30 | A kind of host identification model performance optimization method simplified based on fingerprint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110097122A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101262481A (en) * | 2008-02-22 | 2008-09-10 | 北京航空航天大学 | A remote service recognition system and method for computer network |
CN107040405A (en) * | 2017-03-13 | 2017-08-11 | 中国人民解放军信息工程大学 | Passive type various dimensions main frame Fingerprint Model construction method and its device under network environment |
US20170257388A1 (en) * | 2016-01-06 | 2017-09-07 | New York University | System, method and computer-accessible medium for network intrusion detection |
-
2019
- 2019-04-30 CN CN201910364190.8A patent/CN110097122A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101262481A (en) * | 2008-02-22 | 2008-09-10 | 北京航空航天大学 | A remote service recognition system and method for computer network |
US20170257388A1 (en) * | 2016-01-06 | 2017-09-07 | New York University | System, method and computer-accessible medium for network intrusion detection |
CN107040405A (en) * | 2017-03-13 | 2017-08-11 | 中国人民解放军信息工程大学 | Passive type various dimensions main frame Fingerprint Model construction method and its device under network environment |
Non-Patent Citations (9)
Title |
---|
冀俊忠: "基于类别加权和方差统计的特征选择方法" * |
张凯翔: "面向高速混杂网络的被动式多维度主机指纹模型" * |
张凯翔: "面向高速混杂网络的被动式多维度主机指纹模型", 《计算机系统应用》 * |
张昕: "网络流的时变特征分析" * |
张昕: "网络流的时变特性分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
杨家帅: "基于被动监测的主机操作系统识别技术研究" * |
樊梦娇: "基于行为特征的网络异常检测平台的设计与实现" * |
樊梦娇: "基于行为特征的网络异常检测平台的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
赵家帅: "基于被动监测的主机操作系识别技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110362677B (en) | Text data category identification method and device, storage medium and computer equipment | |
CN110147726B (en) | Service quality inspection method and device, storage medium and electronic device | |
CN107633227B (en) | CSI-based fine-grained gesture recognition method and system | |
CN107392121B (en) | Self-adaptive equipment identification method and system based on fingerprint identification | |
George et al. | Anomaly detection based on machine learning: dimensionality reduction using PCA and classification using SVM | |
CN112347244B (en) | Yellow-based and gambling-based website detection method based on mixed feature analysis | |
CN105740707B (en) | The recognition methods of malicious file and device | |
CN105205397B (en) | Rogue program sample sorting technique and device | |
CN108027814B (en) | Stop word recognition method and device | |
CN102571486A (en) | Traffic identification method based on bag of word (BOW) model and statistic features | |
CN113489685A (en) | Secondary feature extraction and malicious attack identification method based on kernel principal component analysis | |
US20230353585A1 (en) | Malicious traffic identification method and related apparatus | |
CN111274388B (en) | Text clustering method and device | |
CN112733146B (en) | Penetration testing method, device and equipment based on machine learning and storage medium | |
CN106709370A (en) | Long word identification method and system based on text contents | |
CN113890902A (en) | Feature recognition library construction method and device and flow recognition method | |
CN111291824A (en) | Time sequence processing method and device, electronic equipment and computer readable medium | |
US8086616B1 (en) | Systems and methods for selecting interest point descriptors for object recognition | |
CN109697676A (en) | Customer analysis and application method and device based on social group | |
CN109660656A (en) | A kind of intelligent terminal method for identifying application program | |
CN113869398B (en) | Unbalanced text classification method, device, equipment and storage medium | |
CN111224998A (en) | Botnet identification method based on extreme learning machine | |
WO2018047027A1 (en) | A method for exploring traffic passive traces and grouping similar urls | |
CN112417886B (en) | Method, device, computer equipment and storage medium for extracting intention entity information | |
CN110097122A (en) | A kind of host identification model performance optimization method simplified based on fingerprint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190806 |
|
RJ01 | Rejection of invention patent application after publication |