CN112633353A

CN112633353A - Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm

Info

Publication number: CN112633353A
Application number: CN202011506245.3A
Authority: CN
Inventors: 杨家海; 段晨鑫; 王之梁
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-09
Anticipated expiration: 2040-12-18
Also published as: CN112633353B

Abstract

The invention belongs to the technical field of computer network management, and particularly relates to an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm. On the basis of fully mining the traffic characteristics of different Internet of things devices, the method takes the length probability distribution of network data packets generated by communication devices within a certain time as a single characteristic, further designs a classifier based on a k-nearest neighbor algorithm, and utilizes the k-nearest neighbor algorithm to classify and identify the types of source devices generating traffic, particularly the specific types of the Internet of things devices. The method can effectively distinguish whether the source equipment generating the flow is the equipment of the Internet of things or not and which known specific equipment type. Compared with the existing methods for similar tasks, the method provided by the invention not only realizes higher identification accuracy, but also improves the performance indexes such as operation efficiency, robustness, expandability and adaptability to special scenes.

Description

Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm

Technical Field

The invention belongs to the technical field of computer network management, and particularly relates to an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm.

Background

With the rapid development of the internet of things technology, a great number of different types of internet of things devices have been deployed in various fields of human production and life, such as smart homes, smart cities, and industrial control systems. The use of the internet of things equipment brings new challenges to network management while bringing great convenience. Unlike general-purpose networking devices such as smartphones and laptops, internet of things devices typically have only limited computing and communication capabilities, and therefore they require customized network management strategies such as resource allocation and reservation, quality of service management, access control, and anomaly detection. When a specific scenario is taken as a case, when some internet of things device is carelessly leaked and has a security vulnerability, in order to prevent the devices from being invaded by attackers and further utilized, a network administrator needs to immediately find whether the same type of dangerous devices exist in the current network. Achieving such network management needs relies on techniques that can quickly and accurately identify from the traffic the type of source device that generated the traffic.

The most direct method for identifying the internet of things device is to observe the identifiable information existing in the device traffic, such as the oui (organization Unique identifier) field in the MAC address, the domain name in the DNS request, the owner of the IP address, and the user-agent field of the HTTP request. However, this approach has limited applicability due to the availability of vendors that provide multiple device types simultaneously and the popularity of encrypted traffic, and is often accompanied by large and uncertain identification delays due to waiting for a particular packet. Therefore, the current paradigm of the method for classifying and identifying the devices of the internet of things is completed by feature engineering and machine learning algorithms. However, even though the existing methods can achieve high classification accuracy, they still lack many other characteristics that are highly required in practical scenarios, as listed below:

1. operation efficiency: since the equipment classification system will typically be used as an online operation system that handles real-time traffic, its own runtime efficiency should be as high as possible, while minimizing the overhead on various computing resources. However, existing methods tend to extract various different types of features from the traffic, and many of the features rely on deep inspection and matching of the packet payload, making the system less efficient and consuming more computing resources.

2. Robustness: many existing methods evaluate in a relatively pure network environment, and in an actual network environment, various types of devices which are easy to be confused, such as different types of devices produced by manufacturers and the same type of devices produced by different manufacturers, scanning flow which is ubiquitous in a network, and the like, may be used for hard recognition of system performance. Therefore, the device identification system should improve its robustness as much as possible, so that it can still achieve a high classification accuracy under various interference conditions.

3. Expansibility: the technology of the internet of things is still in rapid development, which means that new device types are continuously generated, and in addition, the deployed device types may be carelessly missed to present a safety hazard. Therefore, the equipment classification system should be expandable, and the system can be expanded on the premise of not interfering the running system as much as possible when a new equipment type needing to be identified appears. However, many current device identification methods employ supervised machine learning methods, which require retraining and replacement of the original system each time the device is updated. Another class of methods uses a two-classifier training for each device, however this method still requires an additional training process and additional processing when different classifiers give the results of the spear shield.

4. Adaptability to special scenes: many existing classification methods can achieve better performance under the condition of sufficient training data, however, in a real scene, obtaining a large amount of labeled data is difficult, which means that a system needs to be capable of better adapting to a scene of small sample learning. Or another typical scenario, it is easy to collect a large amount of training data, but labeling them is time consuming and laborious, which requires that the classification system can be easily switched to a semi-supervised learning mode, thereby fully utilizing labeled and unlabeled data to obtain better classification accuracy.

Disclosure of Invention

The invention aims to provide an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm, so as to overcome the characteristic that the existing Internet of things equipment identification method based on flow is generally difficult to meet, on the basis of ensuring high classification accuracy, the method can have high operation efficiency, low resource occupation overhead and robustness of resisting various potential interference factors during actual operation of a system, and conveniently increase the expandability of a new equipment type to be identified and the capability of adapting to characteristic scenes of small sample learning and semi-supervised learning.

The invention provides an Internet of things equipment identification method based on packet length probability distribution and a k nearest neighbor algorithm, which has two different schemes, wherein:

the first scheme comprises the following steps:

(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;

(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:

(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;

(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;

(2-3) respectively calculating the proportion of the number of different types of data packets in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the different types of data packets, namely the characteristics of the network data packet set;

(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;

(4) inputting the network data packet set feature set in the step (3) into a k-nearest neighbor classifier, wherein the distance measurement mode of the k-nearest neighbor classifier is total variation distance or Hailinge distance:

total variation distance:

hailinge distance:

and the k-nearest neighbor classifier outputs a classification result of the type of the Internet of things equipment to be identified, so that the identification of the Internet of things equipment based on packet length probability distribution and a k-nearest neighbor algorithm is realized.

Between the step (3) and the step (4) in the first scheme, the following steps may be further included:

(1) inputting the network data packet set feature set into a DBSCAN clustering algorithm, wherein the distance measurement mode of the DBSCAN clustering algorithm is the same as the distance measurement mode of the k-nearest neighbor classifier in the step (4) of claim 1, and the DBCSAN clustering algorithm outputs a network data packet set feature cluster and a feature outlier after the feature set is clustered;

(2) calculating the geometric center point of each cluster obtained in the step (1);

(3) inputting the feature outliers in step (1) and the geometric center points in step (2) as a new feature set into the k-nearest neighbor classifier of step (4) of claim 1.

The second scheme of the method comprises the following steps:

(4) marking the characteristics of the network data packet set in the step (2), wherein the marking content is the type of the to-be-identified Internet of things equipment generating the Internet of things traffic;

(5) traversing all the Internet of things equipment to be identified, repeating the step (1), the step (2) and the step (4) to obtain the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set containing the type of the Internet of things equipment to be identified;

(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set containing the type of the internet of things device to be identified of the step (5) into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is a full variation distance or a Hailinge distance:

total variation distance:

hailinge distance:

k, outputting the classification result of the network data packet set characteristic set in the step (3) by the neighbor classifier;

(7) taking the classification result of the network data packet set feature set in the step (6) as a mark of the network data packet set feature set in the step (3) to obtain a network data packet set feature set containing the type of the equipment of the Internet of things to be identified;

(8) merging the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (5) and the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (7) to obtain a final network data packet set characteristic set;

(9) and (4) inputting the final network data packet set feature set in the step (8) into the k neighbor classifier in the step (6), and outputting to obtain an identification result, so that the Internet of things equipment identification based on packet length probability distribution and a k neighbor algorithm is realized.

The Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm provided by the invention has the advantages that:

the Internet of things equipment identification method based on the packet length probability distribution and the k-nearest neighbor algorithm is a system which takes the length probability distribution of network data packets generated by communication equipment within a certain time as a single characteristic on the basis of fully mining the flow characteristics of different Internet of things equipment, further designs a classifier based on the k-nearest neighbor algorithm and classifies and identifies the type of source equipment generating flow, particularly the type of specific Internet of things equipment, by using the k-nearest neighbor algorithm. The method can effectively distinguish whether the source equipment generating the flow is the equipment of the Internet of things or not and which known specific equipment type. Compared with the existing methods for similar tasks, the method disclosed by the invention not only realizes higher identification accuracy, but also improves the performance indexes such as operation efficiency, robustness, expandability, adaptability to special scenes and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The flow diagram of the method for identifying the internet of things equipment based on the packet length probability distribution and the k-nearest neighbor algorithm is shown in fig. 1, and the method has two different schemes, wherein:

the first scheme comprises the following steps:

(4) inputting the network data packet set feature set in the step (3) into a k-nearest neighbor classifier, wherein the distance measurement mode of the k-nearest neighbor classifier is total variation distance or Hellinger distance:

total variation distance:

hailinge distance:

The second scheme of the method comprises the following steps:

(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set of the step (5) containing the type of the internet of things device to be identified into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is a full variation distance or a hailing distance:

total variation distance:

hailinge distance:

When the method is implemented, if new Internet of things equipment to be identified enters, the steps (1) to (3) of the method can be repeated to obtain a new network data packet set characteristic set of the Internet of things equipment to be identified, then the new network data packet set characteristic set of the Internet of things equipment to be identified is combined with the previous network data packet set characteristic set, and other steps of the method are carried out according to the combined network data packet set characteristic set, so that the type identification of the newly added Internet of things equipment is realized. Therefore, compared with the prior art, the method has the expandability on the new equipment of the Internet of things to be identified.

Because the method does not need any training process, when the known device types to be classified need to be added or deleted, only the feature vector samples from the corresponding device types need to be added or deleted from the known feature set of the system. In the whole system operation process, the feature set used as the similarity comparison reference can be used as a configurable parameter, and system operation and maintenance personnel can modify the feature set without influencing the system operation to realize the thermal update of the system.

Given a period of traffic from an unknown type of device, the method counts the number of various packets of different length and direction in the traffic. From the perspective of the device, the direction of the data packet may be both transmit and receive. Due to the limitation of the minimum frame length and the maximum transmission unit of the network, the value range of the data packet length is a limited interval, and the most common maximum transmission unit setting of the ethernet is usually 1500 bytes, so that a feature set with a dimension not exceeding 3000 can be obtained by considering the binary group of the two attributes. The method takes the binary value of the length and the direction of the data packet in the finite sample space as a discrete random variable, and calculates the probability distribution of the discrete random variable as a unique characteristic for classification in a frequency estimation probability mode. For performance and robustness reasons it is also feasible to use only the length distribution of packets sent out by the devices in the method, in which case the feature dimension is reduced to half and is not susceptible to packets that can be arbitrarily constructed by other traffic senders.

The k-nearest neighbor classifier used in the method of the invention does not need a training process, and only needs to add example features serving as comparison reference to the feature set. When a new instance feature to be classified is input, the k-nearest neighbor classifier calculates distance metrics between the k-nearest neighbor classifier and the instances in the feature set one by one, and outputs most results or closest results in the device types to which the instances in the k feature sets with the smallest distance metric values belong as classification results. According to experience, the value of k is only a small value within 5, and usually 1 is taken to avoid the situation that most results cannot be selected from k nearest neighbor samples.

The distance metric for features between samples in the k-nearest neighbor classifier is the core of the algorithm. In the method, since the characteristic of the sample is a probability distribution, a measurement mode for measuring the similarity between two probability distributions needs to be selected. In the scene of the classification and identification of the internet of things equipment, special requirements are also set for the selection of the measurement mode. Given two k-dimensional dispersion probability distributions P ═ { P ═ P₁,p₂,…,p_kQ ═ Q₁,q₂,…,q_kThe method requires a similarity measurement mode to meet the requirements of four aspects: firstly, the measurement needs to have symmetry, that is, the measurement result is not different due to the input sequence of two eigenvectors; secondly, the measurement mode has lower calculation complexity so as to avoid causing too high calculation overhead and influencing algorithm efficiency; third, the metric should not consider the similarity between different dimensions of the feature vector, i.e. the values of different dimensions in the feature vector are only calculated in the metric calculation with the values of the same dimension in another vector, because each data packet with a specific length and direction is considered as an independent attribute in the design of the methodCharacteristic; finally, the method is expected to ensure that the calculation result of the measurement mode is within a limited range, which is beneficial to provide confidence judgment for the classification result, namely when the measurement result is larger than a certain threshold value, the classifier has low confidence for the current classification result, and the device has larger possibility of being from an unknown device type. The distance measurement method satisfying the above condition includes: the total variation distance and Hellinger distance are defined as follows

Total variation distance

Hellinger distance

Both of these metrics are used in actual deployment, wherein the Hellinger distance is an analogy of the euclidean distance in probability space, and is more commonly used. In the cumulative distribution curve of the minimum Hellinger distance given by the k-neighbor classifier in the experimental evaluation of the method, the metric distance of most samples is below 0.2, so that in practical operation, when the final distance metric value is higher than 0.3, the lower confidence is implied, and the possibility that the sample comes from a device type which is not in the known feature set needs to be considered.

In the method, the progressive time complexity of a k-nearest neighbor classifier is O (nk), wherein n is the number of samples in a known feature set, and k is the dimension of a sample feature vector. The method further improves the operation performance of the method mainly by reducing the value of n. The method clusters the samples belonging to the same equipment type in the feature set by adopting the same distance measurement mode in the k nearest neighbor classifier and a DBSCAN clustering algorithm, and only keeps the geometric center and the outlier of the cluster point in the clustering result as the final feature set, so that similar and redundant samples in the feature set can be effectively reduced, and the operation efficiency of the algorithm is improved.

Experiments prove that the method can be well adapted to the scene of small sample learning without changing, and still keeps high classification accuracy. For scenes needing semi-supervised learning, the method adopts a pseudo-label technology, a large amount of data without labels are classified by using a small amount of known labeled data, the classification result is used as the pseudo label of the data, the data with the real label and the pseudo label are used as a final characteristic set together for a comparison benchmark for classifying unknown data, and therefore the data without the labels in a training data set is fully utilized, and the classification accuracy is further improved

Through verification on a flow data set consisting of nearly 70 different Internet of things devices and several common non-Internet of things devices, the classification accuracy of the method can be higher and close to 100% at various sampling time intervals (5 minutes, 15 minutes and 30 minutes), and the classification accuracy and the operation efficiency of the method are superior to those of the existing methods for completing similar tasks.

In the data set used for verifying the method, common scanning flow and confusable equipment in an actual network exist, the method can still keep higher accuracy under the potential interference factors, and other existing methods obviously reduce classification accuracy, thereby illustrating the superiority of the method in robustness.

The method can be well adapted to special scenes needing small sample learning and semi-supervised learning, and can still obtain higher classification accuracy under the condition that only a small number of samples and a small number of labeled data exist.

Claims

1. An Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm is characterized by comprising the following steps:

(2-3) respectively calculating the proportion of the number of the data packets of different types in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the data packets of different types, namely the characteristics of the network data packet set;

total variation distance:

hailinge distance:

and the k neighbor classifier outputs a classification result of the type of the Internet of things equipment to be identified, so that the identification of the Internet of things equipment based on packet length probability distribution and a k neighbor algorithm is realized.

2. The internet of things equipment identification method according to claim 1, wherein the following steps are further included between the step (3) and the step (4):

3. An Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm is characterized by comprising the following steps:

(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set containing the type of the internet of things device to be identified of the step (5) into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is total variation distance or Hailinge distance:

total variation distance:

hailinge distance: