CN112633353A - Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm - Google Patents

Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm Download PDF

Info

Publication number
CN112633353A
CN112633353A CN202011506245.3A CN202011506245A CN112633353A CN 112633353 A CN112633353 A CN 112633353A CN 202011506245 A CN202011506245 A CN 202011506245A CN 112633353 A CN112633353 A CN 112633353A
Authority
CN
China
Prior art keywords
data packet
network data
internet
packet set
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011506245.3A
Other languages
Chinese (zh)
Other versions
CN112633353B (en
Inventor
杨家海
段晨鑫
王之梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202011506245.3A priority Critical patent/CN112633353B/en
Publication of CN112633353A publication Critical patent/CN112633353A/en
Application granted granted Critical
Publication of CN112633353B publication Critical patent/CN112633353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16YINFORMATION AND COMMUNICATION TECHNOLOGY SPECIALLY ADAPTED FOR THE INTERNET OF THINGS [IoT]
    • G16Y30/00IoT infrastructure

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of computer network management, and particularly relates to an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm. On the basis of fully mining the traffic characteristics of different Internet of things devices, the method takes the length probability distribution of network data packets generated by communication devices within a certain time as a single characteristic, further designs a classifier based on a k-nearest neighbor algorithm, and utilizes the k-nearest neighbor algorithm to classify and identify the types of source devices generating traffic, particularly the specific types of the Internet of things devices. The method can effectively distinguish whether the source equipment generating the flow is the equipment of the Internet of things or not and which known specific equipment type. Compared with the existing methods for similar tasks, the method provided by the invention not only realizes higher identification accuracy, but also improves the performance indexes such as operation efficiency, robustness, expandability and adaptability to special scenes.

Description

Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm
Technical Field
The invention belongs to the technical field of computer network management, and particularly relates to an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm.
Background
With the rapid development of the internet of things technology, a great number of different types of internet of things devices have been deployed in various fields of human production and life, such as smart homes, smart cities, and industrial control systems. The use of the internet of things equipment brings new challenges to network management while bringing great convenience. Unlike general-purpose networking devices such as smartphones and laptops, internet of things devices typically have only limited computing and communication capabilities, and therefore they require customized network management strategies such as resource allocation and reservation, quality of service management, access control, and anomaly detection. When a specific scenario is taken as a case, when some internet of things device is carelessly leaked and has a security vulnerability, in order to prevent the devices from being invaded by attackers and further utilized, a network administrator needs to immediately find whether the same type of dangerous devices exist in the current network. Achieving such network management needs relies on techniques that can quickly and accurately identify from the traffic the type of source device that generated the traffic.
The most direct method for identifying the internet of things device is to observe the identifiable information existing in the device traffic, such as the oui (organization Unique identifier) field in the MAC address, the domain name in the DNS request, the owner of the IP address, and the user-agent field of the HTTP request. However, this approach has limited applicability due to the availability of vendors that provide multiple device types simultaneously and the popularity of encrypted traffic, and is often accompanied by large and uncertain identification delays due to waiting for a particular packet. Therefore, the current paradigm of the method for classifying and identifying the devices of the internet of things is completed by feature engineering and machine learning algorithms. However, even though the existing methods can achieve high classification accuracy, they still lack many other characteristics that are highly required in practical scenarios, as listed below:
1. operation efficiency: since the equipment classification system will typically be used as an online operation system that handles real-time traffic, its own runtime efficiency should be as high as possible, while minimizing the overhead on various computing resources. However, existing methods tend to extract various different types of features from the traffic, and many of the features rely on deep inspection and matching of the packet payload, making the system less efficient and consuming more computing resources.
2. Robustness: many existing methods evaluate in a relatively pure network environment, and in an actual network environment, various types of devices which are easy to be confused, such as different types of devices produced by manufacturers and the same type of devices produced by different manufacturers, scanning flow which is ubiquitous in a network, and the like, may be used for hard recognition of system performance. Therefore, the device identification system should improve its robustness as much as possible, so that it can still achieve a high classification accuracy under various interference conditions.
3. Expansibility: the technology of the internet of things is still in rapid development, which means that new device types are continuously generated, and in addition, the deployed device types may be carelessly missed to present a safety hazard. Therefore, the equipment classification system should be expandable, and the system can be expanded on the premise of not interfering the running system as much as possible when a new equipment type needing to be identified appears. However, many current device identification methods employ supervised machine learning methods, which require retraining and replacement of the original system each time the device is updated. Another class of methods uses a two-classifier training for each device, however this method still requires an additional training process and additional processing when different classifiers give the results of the spear shield.
4. Adaptability to special scenes: many existing classification methods can achieve better performance under the condition of sufficient training data, however, in a real scene, obtaining a large amount of labeled data is difficult, which means that a system needs to be capable of better adapting to a scene of small sample learning. Or another typical scenario, it is easy to collect a large amount of training data, but labeling them is time consuming and laborious, which requires that the classification system can be easily switched to a semi-supervised learning mode, thereby fully utilizing labeled and unlabeled data to obtain better classification accuracy.
Disclosure of Invention
The invention aims to provide an Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm, so as to overcome the characteristic that the existing Internet of things equipment identification method based on flow is generally difficult to meet, on the basis of ensuring high classification accuracy, the method can have high operation efficiency, low resource occupation overhead and robustness of resisting various potential interference factors during actual operation of a system, and conveniently increase the expandability of a new equipment type to be identified and the capability of adapting to characteristic scenes of small sample learning and semi-supervised learning.
The invention provides an Internet of things equipment identification method based on packet length probability distribution and a k nearest neighbor algorithm, which has two different schemes, wherein:
the first scheme comprises the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of different types of data packets in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the different types of data packets, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) inputting the network data packet set feature set in the step (3) into a k-nearest neighbor classifier, wherein the distance measurement mode of the k-nearest neighbor classifier is total variation distance or Hailinge distance:
total variation distance:
Figure BDA0002845021920000031
hailinge distance:
Figure BDA0002845021920000032
and the k-nearest neighbor classifier outputs a classification result of the type of the Internet of things equipment to be identified, so that the identification of the Internet of things equipment based on packet length probability distribution and a k-nearest neighbor algorithm is realized.
Between the step (3) and the step (4) in the first scheme, the following steps may be further included:
(1) inputting the network data packet set feature set into a DBSCAN clustering algorithm, wherein the distance measurement mode of the DBSCAN clustering algorithm is the same as the distance measurement mode of the k-nearest neighbor classifier in the step (4) of claim 1, and the DBCSAN clustering algorithm outputs a network data packet set feature cluster and a feature outlier after the feature set is clustered;
(2) calculating the geometric center point of each cluster obtained in the step (1);
(3) inputting the feature outliers in step (1) and the geometric center points in step (2) as a new feature set into the k-nearest neighbor classifier of step (4) of claim 1.
The second scheme of the method comprises the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of different types of data packets in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the different types of data packets, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) marking the characteristics of the network data packet set in the step (2), wherein the marking content is the type of the to-be-identified Internet of things equipment generating the Internet of things traffic;
(5) traversing all the Internet of things equipment to be identified, repeating the step (1), the step (2) and the step (4) to obtain the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set containing the type of the Internet of things equipment to be identified;
(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set containing the type of the internet of things device to be identified of the step (5) into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is a full variation distance or a Hailinge distance:
total variation distance:
Figure BDA0002845021920000041
hailinge distance:
Figure BDA0002845021920000042
k, outputting the classification result of the network data packet set characteristic set in the step (3) by the neighbor classifier;
(7) taking the classification result of the network data packet set feature set in the step (6) as a mark of the network data packet set feature set in the step (3) to obtain a network data packet set feature set containing the type of the equipment of the Internet of things to be identified;
(8) merging the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (5) and the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (7) to obtain a final network data packet set characteristic set;
(9) and (4) inputting the final network data packet set feature set in the step (8) into the k neighbor classifier in the step (6), and outputting to obtain an identification result, so that the Internet of things equipment identification based on packet length probability distribution and a k neighbor algorithm is realized.
The Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm provided by the invention has the advantages that:
the Internet of things equipment identification method based on the packet length probability distribution and the k-nearest neighbor algorithm is a system which takes the length probability distribution of network data packets generated by communication equipment within a certain time as a single characteristic on the basis of fully mining the flow characteristics of different Internet of things equipment, further designs a classifier based on the k-nearest neighbor algorithm and classifies and identifies the type of source equipment generating flow, particularly the type of specific Internet of things equipment, by using the k-nearest neighbor algorithm. The method can effectively distinguish whether the source equipment generating the flow is the equipment of the Internet of things or not and which known specific equipment type. Compared with the existing methods for similar tasks, the method disclosed by the invention not only realizes higher identification accuracy, but also improves the performance indexes such as operation efficiency, robustness, expandability, adaptability to special scenes and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The flow diagram of the method for identifying the internet of things equipment based on the packet length probability distribution and the k-nearest neighbor algorithm is shown in fig. 1, and the method has two different schemes, wherein:
the first scheme comprises the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of different types of data packets in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the different types of data packets, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) inputting the network data packet set feature set in the step (3) into a k-nearest neighbor classifier, wherein the distance measurement mode of the k-nearest neighbor classifier is total variation distance or Hellinger distance:
total variation distance:
Figure BDA0002845021920000061
hailinge distance:
Figure BDA0002845021920000062
and the k-nearest neighbor classifier outputs a classification result of the type of the Internet of things equipment to be identified, so that the identification of the Internet of things equipment based on packet length probability distribution and a k-nearest neighbor algorithm is realized.
Between the step (3) and the step (4) in the first scheme, the following steps may be further included:
(1) inputting the network data packet set feature set into a DBSCAN clustering algorithm, wherein the distance measurement mode of the DBSCAN clustering algorithm is the same as the distance measurement mode of the k-nearest neighbor classifier in the step (4) of claim 1, and the DBCSAN clustering algorithm outputs a network data packet set feature cluster and a feature outlier after the feature set is clustered;
(2) calculating the geometric center point of each cluster obtained in the step (1);
(3) inputting the feature outliers in step (1) and the geometric center points in step (2) as a new feature set into the k-nearest neighbor classifier of step (4) of claim 1.
The second scheme of the method comprises the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of different types of data packets in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the different types of data packets, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) marking the characteristics of the network data packet set in the step (2), wherein the marking content is the type of the to-be-identified Internet of things equipment generating the Internet of things traffic;
(5) traversing all the Internet of things equipment to be identified, repeating the step (1), the step (2) and the step (4) to obtain the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set containing the type of the Internet of things equipment to be identified;
(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set of the step (5) containing the type of the internet of things device to be identified into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is a full variation distance or a hailing distance:
total variation distance:
Figure BDA0002845021920000071
hailinge distance:
Figure BDA0002845021920000072
k, outputting the classification result of the network data packet set characteristic set in the step (3) by the neighbor classifier;
(7) taking the classification result of the network data packet set feature set in the step (6) as a mark of the network data packet set feature set in the step (3) to obtain a network data packet set feature set containing the type of the equipment of the Internet of things to be identified;
(8) merging the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (5) and the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (7) to obtain a final network data packet set characteristic set;
(9) and (4) inputting the final network data packet set feature set in the step (8) into the k neighbor classifier in the step (6), and outputting to obtain an identification result, so that the Internet of things equipment identification based on packet length probability distribution and a k neighbor algorithm is realized.
When the method is implemented, if new Internet of things equipment to be identified enters, the steps (1) to (3) of the method can be repeated to obtain a new network data packet set characteristic set of the Internet of things equipment to be identified, then the new network data packet set characteristic set of the Internet of things equipment to be identified is combined with the previous network data packet set characteristic set, and other steps of the method are carried out according to the combined network data packet set characteristic set, so that the type identification of the newly added Internet of things equipment is realized. Therefore, compared with the prior art, the method has the expandability on the new equipment of the Internet of things to be identified.
Because the method does not need any training process, when the known device types to be classified need to be added or deleted, only the feature vector samples from the corresponding device types need to be added or deleted from the known feature set of the system. In the whole system operation process, the feature set used as the similarity comparison reference can be used as a configurable parameter, and system operation and maintenance personnel can modify the feature set without influencing the system operation to realize the thermal update of the system.
Given a period of traffic from an unknown type of device, the method counts the number of various packets of different length and direction in the traffic. From the perspective of the device, the direction of the data packet may be both transmit and receive. Due to the limitation of the minimum frame length and the maximum transmission unit of the network, the value range of the data packet length is a limited interval, and the most common maximum transmission unit setting of the ethernet is usually 1500 bytes, so that a feature set with a dimension not exceeding 3000 can be obtained by considering the binary group of the two attributes. The method takes the binary value of the length and the direction of the data packet in the finite sample space as a discrete random variable, and calculates the probability distribution of the discrete random variable as a unique characteristic for classification in a frequency estimation probability mode. For performance and robustness reasons it is also feasible to use only the length distribution of packets sent out by the devices in the method, in which case the feature dimension is reduced to half and is not susceptible to packets that can be arbitrarily constructed by other traffic senders.
The k-nearest neighbor classifier used in the method of the invention does not need a training process, and only needs to add example features serving as comparison reference to the feature set. When a new instance feature to be classified is input, the k-nearest neighbor classifier calculates distance metrics between the k-nearest neighbor classifier and the instances in the feature set one by one, and outputs most results or closest results in the device types to which the instances in the k feature sets with the smallest distance metric values belong as classification results. According to experience, the value of k is only a small value within 5, and usually 1 is taken to avoid the situation that most results cannot be selected from k nearest neighbor samples.
The distance metric for features between samples in the k-nearest neighbor classifier is the core of the algorithm. In the method, since the characteristic of the sample is a probability distribution, a measurement mode for measuring the similarity between two probability distributions needs to be selected. In the scene of the classification and identification of the internet of things equipment, special requirements are also set for the selection of the measurement mode. Given two k-dimensional dispersion probability distributions P ═ { P ═ P1,p2,…,pkQ ═ Q1,q2,…,qkThe method requires a similarity measurement mode to meet the requirements of four aspects: firstly, the measurement needs to have symmetry, that is, the measurement result is not different due to the input sequence of two eigenvectors; secondly, the measurement mode has lower calculation complexity so as to avoid causing too high calculation overhead and influencing algorithm efficiency; third, the metric should not consider the similarity between different dimensions of the feature vector, i.e. the values of different dimensions in the feature vector are only calculated in the metric calculation with the values of the same dimension in another vector, because each data packet with a specific length and direction is considered as an independent attribute in the design of the methodCharacteristic; finally, the method is expected to ensure that the calculation result of the measurement mode is within a limited range, which is beneficial to provide confidence judgment for the classification result, namely when the measurement result is larger than a certain threshold value, the classifier has low confidence for the current classification result, and the device has larger possibility of being from an unknown device type. The distance measurement method satisfying the above condition includes: the total variation distance and Hellinger distance are defined as follows
Total variation distance
Figure BDA0002845021920000081
Hellinger distance
Figure BDA0002845021920000082
Both of these metrics are used in actual deployment, wherein the Hellinger distance is an analogy of the euclidean distance in probability space, and is more commonly used. In the cumulative distribution curve of the minimum Hellinger distance given by the k-neighbor classifier in the experimental evaluation of the method, the metric distance of most samples is below 0.2, so that in practical operation, when the final distance metric value is higher than 0.3, the lower confidence is implied, and the possibility that the sample comes from a device type which is not in the known feature set needs to be considered.
In the method, the progressive time complexity of a k-nearest neighbor classifier is O (nk), wherein n is the number of samples in a known feature set, and k is the dimension of a sample feature vector. The method further improves the operation performance of the method mainly by reducing the value of n. The method clusters the samples belonging to the same equipment type in the feature set by adopting the same distance measurement mode in the k nearest neighbor classifier and a DBSCAN clustering algorithm, and only keeps the geometric center and the outlier of the cluster point in the clustering result as the final feature set, so that similar and redundant samples in the feature set can be effectively reduced, and the operation efficiency of the algorithm is improved.
Experiments prove that the method can be well adapted to the scene of small sample learning without changing, and still keeps high classification accuracy. For scenes needing semi-supervised learning, the method adopts a pseudo-label technology, a large amount of data without labels are classified by using a small amount of known labeled data, the classification result is used as the pseudo label of the data, the data with the real label and the pseudo label are used as a final characteristic set together for a comparison benchmark for classifying unknown data, and therefore the data without the labels in a training data set is fully utilized, and the classification accuracy is further improved
Through verification on a flow data set consisting of nearly 70 different Internet of things devices and several common non-Internet of things devices, the classification accuracy of the method can be higher and close to 100% at various sampling time intervals (5 minutes, 15 minutes and 30 minutes), and the classification accuracy and the operation efficiency of the method are superior to those of the existing methods for completing similar tasks.
In the data set used for verifying the method, common scanning flow and confusable equipment in an actual network exist, the method can still keep higher accuracy under the potential interference factors, and other existing methods obviously reduce classification accuracy, thereby illustrating the superiority of the method in robustness.
The method can be well adapted to special scenes needing small sample learning and semi-supervised learning, and can still obtain higher classification accuracy under the condition that only a small number of samples and a small number of labeled data exist.

Claims (3)

1. An Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm is characterized by comprising the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of the data packets of different types in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the data packets of different types, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) inputting the network data packet set feature set in the step (3) into a k-nearest neighbor classifier, wherein the distance measurement mode of the k-nearest neighbor classifier is total variation distance or Hailinge distance:
total variation distance:
Figure FDA0002845021910000011
hailinge distance:
Figure FDA0002845021910000012
and the k neighbor classifier outputs a classification result of the type of the Internet of things equipment to be identified, so that the identification of the Internet of things equipment based on packet length probability distribution and a k neighbor algorithm is realized.
2. The internet of things equipment identification method according to claim 1, wherein the following steps are further included between the step (3) and the step (4):
(1) inputting the network data packet set feature set into a DBSCAN clustering algorithm, wherein the distance measurement mode of the DBSCAN clustering algorithm is the same as the distance measurement mode of the k-nearest neighbor classifier in the step (4) of claim 1, and the DBCSAN clustering algorithm outputs a network data packet set feature cluster and a feature outlier after the feature set is clustered;
(2) calculating the geometric center point of each cluster obtained in the step (1);
(3) inputting the feature outliers in step (1) and the geometric center points in step (2) as a new feature set into the k-nearest neighbor classifier of step (4) of claim 1.
3. An Internet of things equipment identification method based on packet length probability distribution and a k-nearest neighbor algorithm is characterized by comprising the following steps:
(1) the method comprises the steps of collecting the flow of the Internet of things equipment to be identified in real time to obtain a network data packet set, wherein elements in the network data packet set are binary groups corresponding to the length and the direction of a network data packet;
(2) performing feature extraction on the network data packet set in the step (1), wherein the feature extraction method comprises the following steps:
(2-1) dividing the network data packet set into a plurality of groups according to a set time interval;
(2-2) merging the data packets with the same length and direction into the same category according to the length and direction in the network data packet set, and counting the number of the data packets with the same category in each group of network data packet set;
(2-3) respectively calculating the proportion of the number of the data packets of different types in each group of network data packet set to the total number of the data packets, and recording the proportion as the probability of network data packet binary group, thereby obtaining the probability distribution of the data packets of different types, namely the characteristics of the network data packet set;
(3) traversing all the Internet of things equipment to be identified, returning to the step (1), obtaining the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set;
(4) marking the characteristics of the network data packet set in the step (2), wherein the marking content is the type of the to-be-identified Internet of things equipment generating the Internet of things traffic;
(5) traversing all the Internet of things equipment to be identified, repeating the step (1), the step (2) and the step (4) to obtain the characteristics of the network data packet set corresponding to all the Internet of things equipment to be identified, and forming a network data packet set characteristic set containing the type of the Internet of things equipment to be identified;
(6) inputting the network data packet set feature set of the step (3) and the network data packet set feature set containing the type of the internet of things device to be identified of the step (5) into a k-neighbor classifier, wherein the distance measurement mode of the k-neighbor classifier is total variation distance or Hailinge distance:
total variation distance:
Figure FDA0002845021910000021
hailinge distance:
Figure FDA0002845021910000022
k, outputting the classification result of the network data packet set characteristic set in the step (3) by the neighbor classifier;
(7) taking the classification result of the network data packet set feature set in the step (6) as a mark of the network data packet set feature set in the step (3) to obtain a network data packet set feature set containing the type of the equipment of the Internet of things to be identified;
(8) merging the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (5) and the network data packet set characteristic set containing the type of the equipment of the Internet of things to be identified in the step (7) to obtain a final network data packet set characteristic set;
(9) and (4) inputting the final network data packet set feature set in the step (8) into the k neighbor classifier in the step (6), and outputting to obtain an identification result, so that the Internet of things equipment identification based on packet length probability distribution and a k neighbor algorithm is realized.
CN202011506245.3A 2020-12-18 2020-12-18 Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm Active CN112633353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011506245.3A CN112633353B (en) 2020-12-18 2020-12-18 Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011506245.3A CN112633353B (en) 2020-12-18 2020-12-18 Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm

Publications (2)

Publication Number Publication Date
CN112633353A true CN112633353A (en) 2021-04-09
CN112633353B CN112633353B (en) 2022-06-24

Family

ID=75317350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011506245.3A Active CN112633353B (en) 2020-12-18 2020-12-18 Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm

Country Status (1)

Country Link
CN (1) CN112633353B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726809A (en) * 2021-09-07 2021-11-30 国网湖南省电力有限公司 Internet of things equipment identification method based on flow data
CN114615020A (en) * 2022-02-15 2022-06-10 中国人民解放军战略支援部队信息工程大学 Method and system for quickly identifying network equipment based on feature reduction and dynamic weighting

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150296043A1 (en) * 2014-04-15 2015-10-15 Smarty Lab Co., Ltd. DYNAMIC IDENTIFICATION SYSTEM AND METHOD FOR IoT DEVICES
CN109474691A (en) * 2018-12-03 2019-03-15 北京神州绿盟信息安全科技股份有限公司 A kind of method and device of internet of things equipment identification
CN110445689A (en) * 2019-08-15 2019-11-12 平安科技(深圳)有限公司 Identify the method, apparatus and computer equipment of internet of things equipment type
CN111026090A (en) * 2019-12-26 2020-04-17 浙江力石科技股份有限公司 Internet of things equipment fault identification method, system and device and storable medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150296043A1 (en) * 2014-04-15 2015-10-15 Smarty Lab Co., Ltd. DYNAMIC IDENTIFICATION SYSTEM AND METHOD FOR IoT DEVICES
CN109474691A (en) * 2018-12-03 2019-03-15 北京神州绿盟信息安全科技股份有限公司 A kind of method and device of internet of things equipment identification
CN110445689A (en) * 2019-08-15 2019-11-12 平安科技(深圳)有限公司 Identify the method, apparatus and computer equipment of internet of things equipment type
CN111026090A (en) * 2019-12-26 2020-04-17 浙江力石科技股份有限公司 Internet of things equipment fault identification method, system and device and storable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARKUS MIETTINEN 等: "IoT Sentinel Demo: Automated Device-Type Identification for Security Enforcement in IoT", 《2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS)》 *
张立 等: "物联网终端智能识别系统设计与实现", 《重庆邮电大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726809A (en) * 2021-09-07 2021-11-30 国网湖南省电力有限公司 Internet of things equipment identification method based on flow data
CN113726809B (en) * 2021-09-07 2023-07-18 国网湖南省电力有限公司 Internet of things equipment identification method based on flow data
CN114615020A (en) * 2022-02-15 2022-06-10 中国人民解放军战略支援部队信息工程大学 Method and system for quickly identifying network equipment based on feature reduction and dynamic weighting
CN114615020B (en) * 2022-02-15 2023-05-26 中国人民解放军战略支援部队信息工程大学 Method and system for rapidly identifying network equipment based on feature reduction and dynamic weighting

Also Published As

Publication number Publication date
CN112633353B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
Meidan et al. ProfilIoT: A machine learning approach for IoT device identification based on network traffic analysis
Msadek et al. Iot device fingerprinting: Machine learning based encrypted traffic analysis
Dong et al. Novel feature selection and classification of Internet video traffic based on a hierarchical scheme
Liu et al. Effective and real-time in-app activity analysis in encrypted internet traffic streams
WO2018160136A1 (en) Method and apparatus for determining an identity of an unknown internet-of-things (iot) device in a communication network
Bassey et al. Intrusion detection for IoT devices based on RF fingerprinting using deep learning
CN112633353B (en) Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm
Lei Network anomaly traffic detection algorithm based on SVM
Hajjar et al. Network traffic application identification based on message size analysis
US11658989B1 (en) Method and device for identifying unknown traffic data based dynamic network environment
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN113821793B (en) Multi-stage attack scene construction method and system based on graph convolution neural network
Ganapathy et al. An intelligent intrusion detection system for mobile ad-hoc networks using classification techniques
CN117216660A (en) Method and device for detecting abnormal points and abnormal clusters based on time sequence network traffic integration
Noorbehbahani et al. A new semi-supervised method for network traffic classification based on X-means clustering and label propagation
Hameed et al. IoT traffic multi-classification using network and statistical features in a smart environment
Khedkar et al. Machine learning model for classification of iot network traffic
Jin et al. Mobile network traffic pattern classification with incomplete a priori information
Liu et al. Zero-bias deep neural network for quickest RF signal surveillance
Chowdhury et al. Device identification using optimized digital footprints
CN114978593B (en) Graph matching-based encrypted traffic classification method and system for different network environments
Alizadeh et al. Timely classification and verification of network traffic using Gaussian mixture models
Guo et al. Few-shot malware traffic classification method using network traffic and meta transfer learning
Lu et al. TCFOM: a robust traffic classification framework based on OC-SVM combined with MC-SVM
CN114866301B (en) Encryption traffic identification and classification method and system based on direct push graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant