CN110097122A - A kind of host identification model performance optimization method simplified based on fingerprint - Google Patents

A kind of host identification model performance optimization method simplified based on fingerprint Download PDF

Info

Publication number
CN110097122A
CN110097122A CN201910364190.8A CN201910364190A CN110097122A CN 110097122 A CN110097122 A CN 110097122A CN 201910364190 A CN201910364190 A CN 201910364190A CN 110097122 A CN110097122 A CN 110097122A
Authority
CN
China
Prior art keywords
fingerprint
host
value
vector
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910364190.8A
Other languages
Chinese (zh)
Inventor
杨武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Talent Information Technology Co Ltd
Original Assignee
Harbin Talent Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Talent Information Technology Co Ltd filed Critical Harbin Talent Information Technology Co Ltd
Priority to CN201910364190.8A priority Critical patent/CN110097122A/en
Publication of CN110097122A publication Critical patent/CN110097122A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a kind of host identification model performance optimization methods simplified based on fingerprint, and described method includes following steps: Step 1: obtaining network flow and being pre-processed;Step 2: extracting characteristic information;Step 3: extracting host fingerprint;Step 4: host fingerprint is carried out vectorization processing, the feature of vectorization is selected, obtains minimal feature subset;Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.The present invention is mainly in conjunction with the SVM vectorization procedure in machine learning, feature selecting is carried out to the vector after vectorization, the collective number of feature is continuously increased by being ranked up to feature, minimal characteristic vector is found to verify, and it is obtained by the method for inverse quantization and simplifies fingerprint, and finally by experimental verification under the premise of recognition accuracy is very nearly the same, simplifying fingerprint has recognition accuracy more better than complete finger print, improves the rate of host identification on the whole.

Description

A kind of host identification model performance optimization method simplified based on fingerprint
Technical field
The present invention relates to a kind of methods that fingerprint is simplified, and in particular to a kind of using the characteristic information of the multiple dimensions of host as base The host identification model performance optimization method that the fingerprint of plinth is simplified.
Background technique
Host identification, which refers to obtain to have by technologies such as Passive Network monitoring, active probes, can be used for identifying host Then characteristic information forms the fingerprint of host, referred to by obtaining the characteristic information of this host with host to the host that needs identify Line carries out the method that match cognization goes out the host.And the characteristic information of host is classified according to type in studying at present, can be divided For three classes, the first kind: the characteristic information of hardware aspect, such as the corresponding sequence number of hardware MAC, hard disk, clock skew rate, the world Mobile device identification code etc.;Second class: the characteristic information in software environment, such as the OS Type installed of host and right Various software of browser type and its corresponding version number that the version number answered, user are installed and user installation and Environment needed for these softwares etc.;Third class: the characteristic information on network behavior, such as stream when user is surfed the Internet using host Which website measure feature, user often access and its common operation, user access used account information, user when website Account information used etc. when sending mail.
Have very much, but for current research to the research in host identification technology at present, however it remains insufficient The host recognition methods of place, currently used multiple characteristic items has very much, but is all the method comprising all characteristic items, not It is simplified, the accuracy rate identified in this way is ensured, but when sample fingerprint quantity is very big in fingerprint base, there is identification Inefficient problem.
Summary of the invention
The object of the present invention is to provide a kind of host identification model performance optimization methods simplified based on fingerprint, and this method can With Statistical error efficiency significantly.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of host identification model performance optimization method simplified based on fingerprint, is included the following steps:
Step 1: obtaining network flow and being separated and screened pretreatment;
Step 2: extracting the characteristic information for host identification from flow, the characteristic information includes host time Stab information, host flow time-varying characteristics, hardware information, software information and network behavior information;
Step 3: using hardware host fingerprint extraction method, software environment host fingerprint extraction method and network row simultaneously Host fingerprint is extracted for host fingerprint extraction method;
Step 4: host fingerprint is carried out vectorization processing, according to corresponding vectorization rule, being converted into can be input to Then vector after conversion is input to SVM by the vector format in SVM;For the host fingerprint vector of input, optimization is utilized CHI algorithm carries out feature selecting to the vector of input, according to host of the size to input for optimizing the value that CHI algorithm is calculated Characteristic information in fingerprint vector is ranked up;It is successively tested according to sorted sequence addition characteristic information by testing Card, it is every to increase a characteristic information and all verify using SVM the accuracy rate of its host recognition result, find accuracy rate and complete The recognition result of fingerprint item is very nearly the same and the least set of vector number, as minimal feature subset;
Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.
Compared with the prior art, the present invention has the advantage that
Method of the invention carries out the vector after vectorization mainly in conjunction with the SVM vectorization procedure in machine learning Feature selecting is continuously increased the collective number of feature by being ranked up to feature, finds minimal characteristic vector to verify, and lead to The method for crossing inverse quantization, which obtains, simplifies fingerprint, and the premise very nearly the same in recognition accuracy finally by experimental verification Under, simplifying fingerprint has recognition accuracy more better than complete finger print, improves the rate of host identification on the whole.
Detailed description of the invention
Fig. 1 is that the host fingerprint based on optimization CHI algorithm simplifies flow chart.
Fig. 2 is the relational graph of 10 host recognition accuracies and characteristic.
Specific embodiment
Further description of the technical solution of the present invention with reference to the accompanying drawing, and however, it is not limited to this, all to this Inventive technique scheme is modified or replaced equivalently, and without departing from the spirit and scope of the technical solution of the present invention, should all be covered Within the protection scope of the present invention.
The present invention provides a kind of host identification model performance optimization methods simplified based on fingerprint, as shown in Figure 1, described Method includes the following steps:
Step 1: flow pre-processes:
Mainly to the separation of flow and screening, separation is timestamp information will to be had in flow and without timestamp information Separated processing, screening is to obtain in flow by multiple threads mode on the basis of using zero-copy and belong to same Preceding in the data packet of stream several is handled that (for number without specific requirement, how more acurrate number is, but when handling, low efficiency;It is a Number is few, high-efficient;Can be adjusted according to specific requirements), follow-up data packet will not be handled, and this avoid a large amount of tools of processing There is the data packet of same characteristic features, improves data-handling efficiency.
Step 2: feature information extraction:
It is extracted including the extraction of information of host time stamp, the extraction of host flow time-varying characteristics, hardware information, software information It extracts and network behavior information extraction.Feature information extraction is carried out from three hardware, software and network behavior dimensions, utilizes base Data packet is handled in the Customization Tool of libpcap, obtaining in data packet there is the feature of identification to believe host identification Breath, while information is extracted using corresponding plug-in unit to the characteristic information of different agreement, improve the scalability of system.
Step 3: host fingerprint extraction:
It mainly include that hardware host fingerprint extraction method, software environment host fingerprint extraction method and network behavior host refer to Line extracting method, major function are that validity feature information is found from the characteristic information obtained as host fingerprint.For The host-feature information of each dimension is needed using corresponding host fingerprint extraction algorithm, for different dimensions host fingerprint extraction Method, the invention proposes one kind based on optimization CHI algorithm fingerprint compressing method, is not reducing the same of the accuracy rate of host identification Shi Tigao host recognition efficiency.
Step 4: feature vector:
It is verified, is first had to host fingerprint since the fingerprint format in host fingerprint base can not be directly inputted to SVM Feature carry out vectorization, convert the acceptable numeric form of SVM for the character of concrete meaning in fingerprint, i.e., by host fingerprint into Line number value, specific vectorization method are as follows:
(a) when the value of characteristic information in fingerprint is specific numerical value A, this just is indicated with a dimension in vector The value of characteristic information, the specific value of this dimension are A/max (A);
(b) when the value of characteristic information only occurs or two kinds of possible markers does not occur, just with one in vector A dimension indicates that the value of this characteristic information, use 1 indicate occur, and use 0 indicates do not occur;
(c) when the value of characteristic information be N attribute combination character string sequence when, just with N number of dimension in vector come The value of N number of attribute is recorded, appearance is then denoted as 1, does not occur, be denoted as 0.
(d) when the value of characteristic information is not specific value but arbitrary value in [a, b] range, just in vector One dimension indicates the value of the fingerprint item, and value is the value in section;
(e) when fingerprint item value type is not fixed, it is assumed that both may be the character of the N attribute combination occurred in (c) String sequence, it is also possible to the A when specific value occurred in (a) records the value of the fingerprint item with the N+1 dimension in input vector, When fingerprint item value is numerical value, then the last one-dimensional value of N+1 dimension is A/max (A);When the value of fingerprint item is N attribute When combined character string sequence, then there is position and be denoted as 1, does not occur position and be denoted as 0, and last one-dimensional value is 0;
(f) if fingerprint item does not have value, by the dimension of 0 all input vectors of merging.
According to the above vectorization create-rule, indicate each attribute in host fingerprint with numerical value, by MSS, TTL, WIN, LEN, WS etc. have its numerical value that is set as of specific size, and S, SACK, NOP, DF identify whether the attribute occurs and divide with 0,1 Piece, TC have specific value and trend comparison, and C, U have specific content to be matched.
Step 5: inverse quantization can be carried out according to method opposite in step 4.
Step 6: experimental verification:
For SVM, in same experimental situation and identical parameter setting, recognition efficiency is often only and vector Dimension is related, i.e. the bigger efficiency of dimension is lower.The efficiency for promoting identification seeks to the dimension of less vector.Method of the invention Due to being tested under identical environment, experiment parameter does not also change, it is believed that the few feature subvector of dimension identifies speed Rate is faster.
(a) it obtains and verifies minimal feature subset performance:
This 10 host of experiment is verified, and the average value of 10 hosts is as minimal feature subset.Fig. 2 is host Accuracy rate is with characteristic variation diagram.Broken line in figure represents the accuracy rate of identification, and wherein abscissa represents number of features, ordinate Represent accuracy rate.From the situation of change in figure it is recognised that in the identification process to host, when number of features is 7, accurately Rate has reached 0.9, and the accuracy rate of complete subset is also 0.9, this illustrate just to have possessed when number of features is 7 with it is complete Whole subset recognition accuracy very nearly the same, and the characteristic item of host reduces 5, the efficiency of identification is improved.And Then accuracy rate variation is little after 7.
Analysis is it is found that one feature of every increase, the Average Accuracy of host will be big when feature quantity is less from figure Amplitude improves, and when number of features is 6 in character subset, accuracy rate just has come to 0.8, and the accuracy rate of complete characterization number Be 0.9, and after number of features 7, accuracy rate has reached 0.9, substantially with it is complete, be compared to complete characterization Accuracy of identification is very nearly the same.This also illustrates in situation similar in accuracy of identification, nearly original two points of number of features at this time One of, in situation similar in the accuracy rate with minimal feature subset identification, reduces characteristic item, improve the rate of identification.
(b) fingerprint host recognition performance is simplified in verifying:
Table 1 be to after simplifying fingerprint item and the identification experiment of complete fingerprint base host as a result, finger after wherein simplifying Line also remains seven fingerprints (TTL, WIN, MSS, WS, LEN, TC, C).As can be seen from the table using fingerprint simplify and complete The accuracy rate of experimental result is the same, but is handled on the time, the processing time under the data set of same scale, after simplifying It is 1.3 seconds, relative to the fast 0.5s of complete finger prints processing time 1.8s, so more efficient after simplifying.By accuracy rate Discovery is compared with the processing time, is not substantially reduced using the fingerprint after simplifying in the accuracy rate that host identifies, simultaneity It can greatly improve.
Table 1 simplifies the host recognition performance in fingerprint and complete finger print library

Claims (5)

1. a kind of host identification model performance optimization method simplified based on fingerprint, it is characterised in that the method includes walking as follows It is rapid:
Step 1: obtaining network flow and being pre-processed;
Step 2: extracting the characteristic information for host identification from flow;
Step 3: extracting host fingerprint;
Step 4: host fingerprint is carried out vectorization processing, according to corresponding vectorization rule, being converted into can be input in SVM Vector format, the vector after conversion is then input to SVM;For the host fingerprint vector of input, calculated using the CHI of optimization Method carries out feature selecting to the vector of input, according to host fingerprint of the size to input for optimizing the value that CHI algorithm is calculated Characteristic information in vector is ranked up;It is successively verified according to sorted sequence addition characteristic information by testing, often Increase the accuracy rate that a characteristic information all verify using SVM its host recognition result, finds accuracy rate and complete finger print Recognition result is very nearly the same and the least set of vector number, as minimal feature subset;
Step 5: inverse quantization is carried out according to vectorization rule to minimal feature subset, the fingerprint after being simplified.
2. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute Stating pretreatment includes separation and screening, in which: separation is timestamp information will to be had in flow and without point of timestamp information Processing is opened, screening is to obtain in flow by multiple threads mode on the basis of using zero-copy and belong to same stream Preceding in data packet several is handled.
3. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute Stating characteristic information includes host time stamp information, host flow time-varying characteristics, hardware information, software information and network behavior letter Breath.
4. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute State extract host fingerprint method be simultaneously use hardware host fingerprint extraction method, software environment host fingerprint extraction method and Network behavior host fingerprint extraction method extracts host fingerprint.
5. the host identification model performance optimization method according to claim 1 simplified based on fingerprint, it is characterised in that institute Stating vectorization processing, the specific method is as follows:
(a) when the value of characteristic information in fingerprint is specific numerical value A, this feature just is indicated with a dimension in vector The value of information, the specific value of this dimension are A/max (A);
(b) when the value of characteristic information only occurs or two kinds of possible markers does not occur, just with a dimension in vector Degree indicates that the value of this characteristic information, use 1 indicate occur, and use 0 indicates do not occur;
(c) when the value of characteristic information is the character string sequence of N attribute combination, N just is recorded with N number of dimension in vector The value of a attribute, appearance are then denoted as 1, do not occur, be denoted as 0.
(d) when the value of characteristic information is not specific value but arbitrary value in [a, b] range, just with one in vector Dimension indicates the value of the fingerprint item, and value is the value in section;
(e) when fingerprint item value type is not fixed, it is assumed that both may be the character string sequence of the N attribute combination occurred in (c) Column, it is also possible to which A when the specific value occurred in (a) is recorded the value of the fingerprint item with the N+1 dimension in input vector, works as finger When line item value is numerical value, then the last one-dimensional value of N+1 dimension is A/max (A);When the value of fingerprint item is the combination of N attribute Character string sequence when, then there is position and be denoted as 1, do not occur position and be denoted as 0, and last one-dimensional value is 0;
(f) if fingerprint item does not have value, by the dimension of 0 all input vectors of merging.
CN201910364190.8A 2019-04-30 2019-04-30 A kind of host identification model performance optimization method simplified based on fingerprint Pending CN110097122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910364190.8A CN110097122A (en) 2019-04-30 2019-04-30 A kind of host identification model performance optimization method simplified based on fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910364190.8A CN110097122A (en) 2019-04-30 2019-04-30 A kind of host identification model performance optimization method simplified based on fingerprint

Publications (1)

Publication Number Publication Date
CN110097122A true CN110097122A (en) 2019-08-06

Family

ID=67446710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910364190.8A Pending CN110097122A (en) 2019-04-30 2019-04-30 A kind of host identification model performance optimization method simplified based on fingerprint

Country Status (1)

Country Link
CN (1) CN110097122A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101262481A (en) * 2008-02-22 2008-09-10 北京航空航天大学 A remote service recognition system and method for computer network
CN107040405A (en) * 2017-03-13 2017-08-11 中国人民解放军信息工程大学 Passive type various dimensions main frame Fingerprint Model construction method and its device under network environment
US20170257388A1 (en) * 2016-01-06 2017-09-07 New York University System, method and computer-accessible medium for network intrusion detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101262481A (en) * 2008-02-22 2008-09-10 北京航空航天大学 A remote service recognition system and method for computer network
US20170257388A1 (en) * 2016-01-06 2017-09-07 New York University System, method and computer-accessible medium for network intrusion detection
CN107040405A (en) * 2017-03-13 2017-08-11 中国人民解放军信息工程大学 Passive type various dimensions main frame Fingerprint Model construction method and its device under network environment

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
冀俊忠: "基于类别加权和方差统计的特征选择方法" *
张凯翔: "面向高速混杂网络的被动式多维度主机指纹模型" *
张凯翔: "面向高速混杂网络的被动式多维度主机指纹模型", 《计算机系统应用》 *
张昕: "网络流的时变特征分析" *
张昕: "网络流的时变特性分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
杨家帅: "基于被动监测的主机操作系统识别技术研究" *
樊梦娇: "基于行为特征的网络异常检测平台的设计与实现" *
樊梦娇: "基于行为特征的网络异常检测平台的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
赵家帅: "基于被动监测的主机操作系识别技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN107633227B (en) CSI-based fine-grained gesture recognition method and system
CN107392121B (en) Self-adaptive equipment identification method and system based on fingerprint identification
George et al. Anomaly detection based on machine learning: dimensionality reduction using PCA and classification using SVM
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
CN105740707B (en) The recognition methods of malicious file and device
CN105205397B (en) Rogue program sample sorting technique and device
CN108027814B (en) Stop word recognition method and device
CN102571486A (en) Traffic identification method based on bag of word (BOW) model and statistic features
CN113489685A (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
US20230353585A1 (en) Malicious traffic identification method and related apparatus
CN111274388B (en) Text clustering method and device
CN112733146B (en) Penetration testing method, device and equipment based on machine learning and storage medium
CN106709370A (en) Long word identification method and system based on text contents
CN113890902A (en) Feature recognition library construction method and device and flow recognition method
CN111291824A (en) Time sequence processing method and device, electronic equipment and computer readable medium
US8086616B1 (en) Systems and methods for selecting interest point descriptors for object recognition
CN109697676A (en) Customer analysis and application method and device based on social group
CN109660656A (en) A kind of intelligent terminal method for identifying application program
CN113869398B (en) Unbalanced text classification method, device, equipment and storage medium
CN111224998A (en) Botnet identification method based on extreme learning machine
WO2018047027A1 (en) A method for exploring traffic passive traces and grouping similar urls
CN112417886B (en) Method, device, computer equipment and storage medium for extracting intention entity information
CN110097122A (en) A kind of host identification model performance optimization method simplified based on fingerprint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190806

RJ01 Rejection of invention patent application after publication