CN117579397A - Internet of things privacy leakage detection method and device based on small sample ensemble learning - Google Patents

Internet of things privacy leakage detection method and device based on small sample ensemble learning Download PDF

Info

Publication number
CN117579397A
CN117579397A CN202410064005.4A CN202410064005A CN117579397A CN 117579397 A CN117579397 A CN 117579397A CN 202410064005 A CN202410064005 A CN 202410064005A CN 117579397 A CN117579397 A CN 117579397A
Authority
CN
China
Prior art keywords
value
candidate
flow
solution
random
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410064005.4A
Other languages
Chinese (zh)
Other versions
CN117579397B (en
Inventor
王滨
周少鹏
方璐
毕志城
鲁天阳
朱伟康
刘帅
王旭
张峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202410064005.4A priority Critical patent/CN117579397B/en
Publication of CN117579397A publication Critical patent/CN117579397A/en
Application granted granted Critical
Publication of CN117579397B publication Critical patent/CN117579397B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • General Health & Medical Sciences (AREA)

Abstract

The application provides an internet of things privacy disclosure detection method and device based on small sample ensemble learning, wherein the method comprises the following steps: acquiring an original data set, wherein the original data set comprises a plurality of network traffic; preprocessing a plurality of network flows in an original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values; performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic; and sending a plurality of target flow characteristics corresponding to the plurality of candidate flow characteristics to the management equipment, and training at least two classification models by the management equipment based on the plurality of target flow characteristics, wherein the at least two classification models are used for detecting whether privacy leakage exists in the Internet of things equipment. Through the technical scheme of the application, whether the internet of things equipment has privacy information leakage or not can be detected, and the safety of data is ensured.

Description

Internet of things privacy leakage detection method and device based on small sample ensemble learning
Technical Field
The application relates to the technical field of information security, in particular to an internet of things privacy disclosure detection method and device based on small sample ensemble learning.
Background
The internet of things (Internet of Things, IOT for short) refers to collecting any object or process needing to be connected and interacted in real time through various devices and technologies such as various information sensors, radio frequency identification technologies, global positioning systems, infrared sensors, laser scanners, and the like, collecting information such as sound, light, heat, electricity, mechanics, chemistry, biology, positions and the like, accessing through various possible networks, realizing ubiquitous connection of objects and people, and realizing intelligent perception, identification and management of objects and processes. The internet of things is an information carrier based on the internet, a telecommunication network and the like, so that all common physical objects which can be independently addressed form an interconnection network.
All equipment in the internet of things can be internet of things equipment, and the internet of things equipment can comprise intelligent home equipment (such as intelligent sound boxes, intelligent sweeping robots, intelligent home gateways and the like), industrial intelligent gateways, life safety equipment, internet of things equipment and the like. With the rapid development of the internet of things, more and more aggressive behaviors are aimed at the internet of things equipment. An attacker can attack the Internet of things equipment, so that sensitive information or private information is acquired from the Internet of things equipment, the sensitive information or the private information is revealed, and potential safety hazards exist in data.
Disclosure of Invention
In view of the above, the application provides an internet of things privacy disclosure detection method and device based on small sample ensemble learning, which can detect whether the internet of things equipment has privacy information disclosure or not and ensure the security of data.
The application provides an internet of things privacy leakage detection method based on small sample ensemble learning, which is applied to internet of things equipment and comprises the following steps:
acquiring an original data set, wherein the original data set comprises a plurality of network flows, and the network flows are mirror image network flows when the internet of things equipment sends the network flows to external equipment;
preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values;
for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic;
transmitting a plurality of target flow characteristics corresponding to the plurality of candidate flow characteristics to a management device, and training at least two classification models by the management device based on the plurality of target flow characteristics;
The at least two classification models are used for detecting whether privacy leakage exists in the internet of things equipment.
The application provides a detection device is revealed to thing networking privacy based on little sample ensemble learning, the device is applied to thing networking equipment, the device includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original data set, the original data set comprises a plurality of network traffic which are mirror image network traffic when the internet of things equipment sends the network traffic to external equipment;
the processing module is used for preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values; for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic;
the sending module is used for sending a plurality of target flow characteristics corresponding to the candidate flow characteristics to the management equipment, and the management equipment trains at least two classification models based on the target flow characteristics;
The at least two classification models are used for detecting whether privacy leakage exists in the internet of things equipment.
The application provides an electronic device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is used for executing machine executable instructions to realize the detection method for privacy disclosure of the Internet of things based on small sample ensemble learning.
The present application provides a machine-readable storage medium storing machine-executable instructions executable by a processor; the processor is configured to execute the machine-executable instruction to implement the method for detecting privacy disclosure of the internet of things based on small sample ensemble learning.
The application provides a computer program which is stored in a machine-readable storage medium, and when being executed by a processor, the computer program causes the processor to realize the method for detecting privacy leakage of the internet of things based on small sample ensemble learning.
According to the technical scheme, in the embodiment of the application, the at least two classification models are trained, and whether the Internet of things equipment has privacy leakage or not is detected through the at least two classification models, so that whether the Internet of things equipment has privacy information leakage or not can be detected, the data of the Internet of things equipment are protected, the leakage of sensitive information or privacy information of the Internet of things equipment is reduced, and the safety of the data is guaranteed. And selecting part of flow characteristics from all the flow characteristics as candidate flow characteristics, and performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristics to obtain target flow characteristics, so that a data set of a small sample is constructed, and can represent all network flows of an original data set, thereby improving the processing speed of a classification model and reducing the training complexity of the classification model. The mode of detecting whether privacy leakage exists in the internet of things equipment based on at least two classification models is called ensemble learning, and prediction performance of the models can be improved through ensemble learning.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings that are required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may also be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.
Fig. 1 is a flow diagram of an internet of things privacy disclosure detection method based on small sample ensemble learning;
fig. 2 is a flow chart of an internet of things privacy disclosure detection method based on small sample ensemble learning;
fig. 3 is a schematic structural diagram of an internet of things privacy disclosure detection device based on small sample ensemble learning;
fig. 4 is a hardware configuration diagram of an electronic device in an embodiment of the present application.
Detailed Description
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".
The embodiment of the application provides a detection method for privacy leakage of the internet of things based on small sample ensemble learning, which can be applied to the internet of things equipment, and is shown in fig. 1, and is a flow diagram of the method, and the method comprises the following steps:
step 101, acquiring an original data set, where the original data set may include a plurality of network traffic, where the plurality of network traffic is mirrored when the internet of things device sends the network traffic to an external device.
Step 102, preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics.
Step 103, selecting a plurality of candidate flow characteristics from all the flow characteristics (namely selecting a part of the flow characteristics as candidate flow characteristics), wherein each candidate flow characteristic comprises a plurality of flow characteristic values.
Step 104, for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic.
Step 105, a plurality of target flow characteristics corresponding to the candidate flow characteristics are sent to the management device, and the management device trains at least two classification models based on the target flow characteristics. Illustratively, at least two classification models are used to detect whether privacy disclosure exists for the internet of things device.
Illustratively, preprocessing a plurality of network traffic in an original dataset to obtain a plurality of traffic characteristics may include, but is not limited to: deleting abnormal network traffic in the original data set to obtain a first data set; converting the non-numerical data in the first data set into numerical data to obtain a second data set; normalizing the numerical data in the second data set to obtain a third data set, wherein the third data set comprises a plurality of network flows, and each network flow comprises normalized numerical data; and selecting a plurality of network traffic without privacy disclosure from all network traffic in the third data set, and determining a plurality of traffic characteristics corresponding to the plurality of network traffic.
Illustratively, selecting a plurality of candidate flow characteristics from all flow characteristics may include, but is not limited to: clustering all flow characteristics through a clustering algorithm to obtain a plurality of clustering clusters; for each cluster, the cluster includes a portion of the flow features of all of the flow features, the centroid of the cluster is an average of all of the flow features within the cluster, and a distance between each of the flow features within the cluster and the centroid of the cluster is less than or equal to a distance threshold. For each cluster, the centroid of the cluster may be determined as a candidate flow feature, or a flow feature within the cluster that is the smallest distance from the centroid may be determined as a candidate flow feature.
Illustratively, selecting a plurality of candidate flow characteristics from all flow characteristics may include, but is not limited to: generating N initial random solutions, wherein each initial random solution comprises K numerical solutions corresponding to K flow characteristics, K is the total number of the flow characteristics, and N and K are positive integers; for each numerical solution, the numerical solution may be a first value indicating that the initial random solution does not select the corresponding flow characteristic or a second value indicating that the initial random solution selects the corresponding flow characteristic. Selecting an initial random solution with the minimum fitness value from N initial random solutions as a current optimal random solution based on the fitness value corresponding to each initial random solution; and optimizing each initial random solution based on the current optimal random solution to obtain N candidate random solutions corresponding to the N initial random solutions. Generating a target random solution based on the N candidate random solutions, and selecting a plurality of candidate flow characteristics from all flow characteristics based on the target random solution; the flow characteristics corresponding to the second valued numerical solution in the target random solution can be used as candidate flow characteristics.
Illustratively, optimizing each initial random solution based on the current optimal random solution to obtain N candidate random solutions corresponding to the N initial random solutions may include, but is not limited to: determining a current solution variation amount based on the initial random solution, the current optimal random solution, the uniform random number and an initial solution variation amount corresponding to the initial random solution for each initial random solution; determining an intermediate random solution based on the initial random solution and the current solution variation; if the solution update ending condition is satisfied, the intermediate random solution can be used as a candidate random solution corresponding to the initial random solution; if the solution update end condition is not met, based on the fitness value corresponding to each intermediate random solution, the intermediate random solution with the smallest fitness value is used as the current optimal random solution, and the current optimal random solution is subjected to local search to obtain an updated current optimal random solution; and updating the intermediate random solution into an initial random solution, and then returning to execute the operation of determining the current solution variation based on the initial random solution, the current optimal random solution, the uniform random number and the initial solution variation corresponding to the initial random solution.
Illustratively, generating a target random solution based on the N candidate random solutions may include, but is not limited to: removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution; determining a local preferred solution based on the remaining candidate random solutions; optimizing the rest candidate random solutions based on the local preferred solutions to obtain optimized intermediate random solutions; if the output condition is met, selecting an optimal random solution from the intermediate random solutions, and carrying out local search on the optimal random solution to obtain a target random solution; and if the output condition is not met, taking the intermediate random solution as a candidate random solution, and returning to execute the operation of removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution.
Illustratively, the dimension reduction of the plurality of flow characteristic values of the candidate flow characteristic to obtain the target flow characteristic may include, but is not limited to: and performing at least one of de-duplication dimension reduction, de-correlation dimension reduction and de-specification type feature dimension reduction on the plurality of flow feature values of the candidate flow feature to obtain a target flow feature. Wherein, the weight and dimension removal means: if the similarity between the first flow characteristic value and the second flow characteristic value is greater than a similarity threshold value, removing the second flow characteristic value from the candidate flow characteristic; decorrelation dimension reduction refers to: if the correlation between the third flow characteristic value and the fourth flow characteristic value is greater than a correlation threshold value, removing the fourth flow characteristic value from the candidate flow characteristic; de-specification type feature dimension reduction refers to: and if the feature type of the fifth flow feature value is in the non-privacy list, removing the fifth flow feature value from the candidate flow feature, wherein the designated type is the feature type without the privacy information, and the non-privacy list is used for recording the feature type without the privacy information.
The method may further include, after sending the plurality of target traffic characteristics corresponding to the plurality of candidate traffic characteristics to the management device, receiving at least two classification models sent by the management device, and detecting whether privacy leakage exists in the internet of things device based on the at least two classification models. The detecting whether the internet of things device has privacy disclosure based on at least two classification models may include, but is not limited to: the network traffic to be detected of the internet of things equipment can be input to each classification model to obtain a classification result output by each classification model, the classification result can be a first value or a second value, the first value indicates that privacy leakage exists, and the second value indicates that privacy leakage does not exist. On the basis of the method, the device comprises the following steps: if the number of the first values is larger than the number of the second values, the fact that privacy leakage exists in the Internet of things equipment can be determined, and if the number of the first values is smaller than the number of the second values, the fact that privacy leakage does not exist in the Internet of things equipment can be determined; wherein, at least two classification models are odd classification models; or, determining a first score value based on the weight coefficient value of each classification model outputting the first value, and determining a second score value based on the weight coefficient value of each classification model outputting the second value; if the first score value is larger than the second score value, it can be determined that privacy leakage exists in the Internet of things equipment, and if the first score value is smaller than the second score value, it can be determined that privacy leakage does not exist in the Internet of things equipment.
Illustratively, the at least two classification models may include, but are not limited to: an IF classification model, an OC-SVM classification model, and an LOF classification model; IF the network flow in the original data set supports the first classification mode, the weight coefficient value of the IF classification model is larger than that of the LOF classification model, and the weight coefficient value of the LOF classification model is larger than that of the OC-SVM classification model; IF the network traffic in the original data set supports the second classification mode, the weight coefficient value of the OC-SVM classification model is greater than the weight coefficient value of the IF classification model, and the weight coefficient value of the IF classification model is greater than the weight coefficient value of the LOF classification model.
The first classification mode may indicate that the number of classification labels of the network traffic is greater than a number threshold and the number of times of circulation of the network traffic is greater than a number threshold; the second classification mode may indicate that the number of classification labels of the network traffic is not greater than a number threshold and the number of transitions of the network traffic is not greater than a number threshold.
Illustratively, for each classification model, the management device trains the classification model based on the target traffic characteristics, which may include, but is not limited to: inputting the target flow characteristics into a model to be trained to obtain a prediction classification result output by the model to be trained; taking a predicted classification result of the target flow characteristics and a real classification result of the target flow characteristics as input parameters of a fitness function to obtain loss values corresponding to the target flow characteristics; adjusting network parameters of the model to be trained based on the loss value to obtain an adjusted model; if the adjusted model is converged, the adjusted model is used as a classification model; and if the adjusted model is not converged, taking the adjusted model as a model to be trained, and returning to execute the operation of inputting the target flow characteristics into the model to be trained.
According to the technical scheme, in the embodiment of the application, the at least two classification models are trained, and whether the Internet of things equipment has privacy leakage or not is detected through the at least two classification models, so that whether the Internet of things equipment has privacy information leakage or not can be detected, when the Internet of things equipment has privacy leakage, the Internet of things equipment can be safely upgraded or the equipment is offline and the like, so that the data of the Internet of things equipment are protected, the leakage of sensitive information or privacy information of the Internet of things equipment is reduced, and the safety of the data is guaranteed. And selecting part of flow characteristics from all the flow characteristics as candidate flow characteristics, and performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristics to obtain target flow characteristics, so that a data set of a small sample is constructed, and can represent all network flows of an original data set, thereby improving the processing speed of a classification model and reducing the training complexity of the classification model. The mode of detecting whether privacy leakage exists in the internet of things equipment based on at least two classification models is called ensemble learning, and prediction performance of the models can be improved through ensemble learning.
The following describes the technical solution of the embodiment of the present application in conjunction with a specific application scenario.
In the embodiment of the present application, a method for detecting privacy leakage of internet of things based on small sample ensemble learning is provided, and referring to fig. 2, which is a schematic flow chart of the method, the method may include the following steps:
step 201, the internet of things device acquires an original data set, where the original data set may include a plurality of network traffic, and the plurality of network traffic is mirrored network traffic when the internet of things device sends the network traffic to an external device.
The internet of things equipment can be the internet of things equipment, the internet of things equipment can also be called an internet of things terminal, and the internet of things equipment can include, but is not limited to, intelligent home equipment (such as an intelligent sound box, an intelligent sweeping robot, an intelligent home gateway and the like), an industrial intelligent gateway, life safety equipment, internet of things equipment, a camera and the like, and the type of the internet of things equipment is not limited in the embodiment.
For the internet of things device, if the internet of things device needs to perform internet of things privacy disclosure detection, the internet of things device can adopt the internet of things privacy disclosure detection method to realize detection.
For example, for an internet of things device that needs to perform internet of things privacy disclosure detection, the internet of things device may mirror the network traffic when sending the network traffic to an external device each time, that is, the internet of things device mirrors the network traffic and stores the network traffic in an original data set, so that the network traffic is mirrored in the original data set when sending the network traffic each time, and the original data set may include a plurality of network traffic.
For each network traffic, the network traffic may be HTTP protocol-based network traffic data, HTTPs protocol-based network traffic data, internet of things protocol-based network traffic data, etc., where the network traffic data may be text data or other types of data. For example, the network traffic data may include device information, user information, key information, case information, etc., and the content of the data is not limited.
In summary, an original data set may be obtained, and the original data set includes a plurality of network traffic.
Step 202, when the original data set meets training conditions of the classification model, the internet of things device may preprocess a plurality of network flows in the original data set to obtain a plurality of flow characteristics.
For example, if the number of network traffic in the original data set reaches the number threshold, the original data set satisfies the training condition of the classification model, or if the collection duration of the network traffic in the original data set reaches the duration threshold, the original data set satisfies the training condition of the classification model, and the training condition is not limited.
When the original data set meets the training conditions of the classification model, the preprocessing process of the original data set, such as data cleaning, weighing, data normalization, data conversion and the like, can be executed, and the preprocessing process is not limited.
For example, when the internet of things device preprocesses the original data set, the following steps may be adopted:
step 2021, deleting the abnormal network traffic in the original data set to obtain the first data set.
For example, the data cleaning and weighing may be performed on the original data set, where the data cleaning and weighing refers to performing operations such as removing null values and removing repeated records on the original data set, and null values and repeated records may be used as abnormal network traffic, that is, the first data set is obtained by deleting the abnormal network traffic in the original data set.
By deleting the abnormal network traffic in the original data set, the influence of the abnormal network traffic on the training of the classification model is avoided, namely the performance reduction of the classification model caused by the abnormal network traffic is avoided.
For example, the data cleaning and the data cleaning are optional steps, if the data cleaning and the data cleaning are performed, the abnormal network traffic in the original data set is deleted to obtain a first data set, and if the data cleaning and the data cleaning are not performed, the original data set is directly used as the first data set. In summary, a first data set may be obtained.
Step 2022, converting the non-numeric data in the first data set into numeric data to obtain a second data set.
For example, the first data set may be subjected to data conversion, where the data conversion refers to converting non-numeric data into numeric data, where the non-numeric data may also be referred to as symbol data, where the symbol data may be a character, a character string, a chinese character, a punctuation mark, a special symbol, and the like, and the data conversion process is not limited.
For example, the data conversion is an optional step, if the data conversion is performed, non-numeric data in the first data set is converted into numeric data to obtain the second data set, and if the data conversion is not performed, the first data set is directly used as the second data set. In summary, a second data set may be obtained.
Step 2023, normalizing the numerical data in the second data set to obtain a third data set, where the third data set includes a plurality of network flows, and each network flow includes normalized numerical data.
For example, the second data set may be data normalized, which refers to scaling the data values to within a preset range, such as scaling the data values to within a [0,1] range. By scaling the data values to the preset range, the third data set can be prevented from biasing towards the characteristic with larger eigenvalue, the influence of the characteristic with larger eigenvalue on the training of the classification model is avoided, the performance reduction of the classification model is avoided, and the performance of the classification model is improved.
For example, the second data set may include a plurality of network traffic, each network traffic being a network traffic of numerical data, and the numerical data may include a plurality of data values, and thus, for each network traffic, the network traffic may include a plurality of data values. Based on this, each data value of the network traffic may be data normalized, e.g., each data value of the network traffic may be scaled to within a range of [0,1 ]. After the above-mentioned processing is performed on all network traffic in the second data set, a third data set may be obtained, i.e. the third data set comprises a plurality of network traffic, and each network traffic comprises normalized numerical data.
In data normalization of each data value of the network traffic, the data value may be data normalized using the following formula:. Wherein (1)>Representing the maximum data value in the second data set, and (2)>Representing the smallest data value in the second data set, is->Representing the data values to be data normalized,representing data value +.>Corresponding toIs included in the data values obtained by the normalization of the data values. Based on the above processing, all numerical data in the second data set can be normalized to [0,1]Within the range.
For example, the data normalization is an optional step, if the data normalization is performed, the numerical data in the second data set is normalized to obtain a third data set, and if the data normalization is not performed, the second data set is directly used as the third data set. In summary, a third data set may be obtained.
Step 2024, selecting a plurality of network traffic (i.e. network traffic of the positive sample) without privacy disclosure from all the network traffic of the third data set, and determining a plurality of traffic characteristics corresponding to the plurality of network traffic.
For example, the third data set may include a plurality of network traffic, the tag values of which may be distinguished, the network traffic may be referred to as a positive sample for network traffic for which there is no privacy disclosure (i.e., the network traffic has no privacy information, no privacy disclosure occurs), and the tag value of the network traffic may be 0. For network traffic with privacy disclosure (i.e., the network traffic has privacy information, privacy disclosure occurs), the network traffic may be referred to as a negative sample, and the tag value of the network traffic may be 1.
On the basis, the network traffic without privacy disclosure can be selected from all the network traffic in the third data set, and training of the classification model is performed based on the network traffic without privacy disclosure. For example, since the classification model is trained in the early stage of the network life cycle, the normal network traffic is the type that should be learned, so for the training stage of the classification model, in order to reduce the number of records and speed up the training, the network traffic without privacy disclosure may be selected from all the network traffic in the third data set. After the classification model training is completed, the classification model can be optimized, and in the optimization process of the classification model, normal network traffic and privacy disclosure network traffic can be distinguished, so that the optimization process is not limited.
And determining the flow characteristics corresponding to the network flow aiming at the network flow without privacy disclosure. For example, the network traffic includes normalized numerical data (i.e., a plurality of data values), and the network traffic can be converted into a traffic vector, i.e., the data values are vectorized to obtain a traffic vector corresponding to the network traffic, which is not limited. After the flow vector is obtained, the flow characteristic corresponding to the flow vector can be determined, for example, the flow vector can be used as the flow characteristic, or the flow characteristic can be obtained after the flow vector is processed by adopting an algorithm, and the process is not limited.
Obviously, for a plurality of network traffic without privacy disclosure, after each network traffic is processed in the above manner, a plurality of traffic characteristics corresponding to the plurality of network traffic can be obtained. Wherein for each flow characteristic the flow characteristic may comprise a plurality of flow characteristic values.
For example, a plurality of network traffic without privacy disclosure may be selected from all network traffic in the third data set, and a plurality of traffic characteristics corresponding to the plurality of network traffic may be determined. Alternatively, instead of selecting the network traffic from the third data set, traffic characteristics corresponding to all the network traffic of the third data set (i.e., the network traffic without privacy disclosure and the network traffic with privacy disclosure) may be determined. In summary, a plurality of flow characteristics may be derived and used to train the classification model.
Step 203, the internet of things device selects a plurality of candidate flow characteristics from all the flow characteristics.
For example, in order to reduce the training complexity of the classification model, reduce the training time of the classification model, reduce the resource overhead of the classification model, and reduce the feature quantity used for training the classification model, feature selection may be performed on all flow features, that is, a plurality of flow features may be selected from all flow features as candidate flow features. By means of feature selection, irrelevant features or attributes can be removed, features which obviously influence data classification are reserved, classification accuracy can be improved by feature selection, and learning of a classification model is quickened.
In this embodiment, in order to implement feature selection, a clustering manner may be used to select representative candidate flow features from all flow features, and/or a local search algorithm may be used to select representative candidate flow features from all flow features. Since the number of candidate flow features is relatively small and the number of candidate flow features can be controlled, the candidate flow features are referred to as a dataset of small samples.
In one possible implementation manner, when a plurality of candidate flow characteristics are selected from all flow characteristics in a clustering manner, all flow characteristics can be clustered through a clustering algorithm to obtain a plurality of clusters. For example, a K-means clustering algorithm is adopted to cluster all flow characteristics to obtain K clusters, wherein K is a positive integer. For example, assume that 50 candidate flow features are required to be acquired, the value of K is configured to be 50, assume that 80 candidate flow features are required to be acquired, and so on. Based on the value of K, after all flow characteristics are input into a K-means clustering algorithm, the K-means clustering algorithm can cluster all flow characteristics to obtain K clustering clusters, and the characteristic clustering process is not limited.
For example, based on a K-means clustering algorithm, a random centroid is selected for each cluster from a set of K clusters, and a feature point is assigned to a cluster closest to the centroid according to a similarity (e.g., distance similarity, etc.) between the feature point and each centroid. After assigning all feature points to the closest cluster, a new centroid can be calculated for each cluster, the centroid representing the average of all feature points in the cluster. Then, the process of assigning feature points to the centroids of the new cluster is repeated, and the centroids are recalculated until the value of the centroids is stable. Thus, K clusters can be obtained.
After clustering all flow characteristics through a clustering algorithm to obtain a plurality of clusters, for each cluster, the cluster can comprise a plurality of flow characteristics, the mass center of the cluster is the average value of all flow characteristics in the cluster, and the distance between each flow characteristic in the cluster and the mass center of the cluster is smaller than or equal to a distance threshold. For example, all traffic characteristics may be clustered into cluster 1, cluster 2, cluster 3, cluster 1 may include a plurality of traffic characteristics, and the centroid of cluster 1 is the average of all traffic characteristics within cluster 1, and the distance between each traffic characteristic within cluster 1 and the centroid of cluster 1 is less than or equal to a distance threshold. Cluster 2 may include a plurality of flow features, and the centroid of cluster 2 is an average of all flow features within cluster 2, and the distance between each flow feature within cluster 2 and the centroid of cluster 2 is less than or equal to a distance threshold. Cluster 3 may include a plurality of flow features, and the centroid of cluster 3 is an average of all flow features within cluster 3, and the distance between each flow feature within cluster 3 and the centroid of cluster 3 is less than or equal to a distance threshold.
For each cluster, the centroid of the cluster may be determined as a candidate flow feature, or a flow feature within the cluster that is the smallest distance from the centroid may be determined as a candidate flow feature. For example, for cluster 1, the centroid of cluster 1 is determined as candidate traffic feature 1, or the traffic feature within cluster 1 having the smallest distance from the centroid is determined as candidate traffic feature 1. For cluster 2, the centroid of cluster 2 is determined as candidate traffic feature 2, or the traffic feature within cluster 2 that is the smallest distance from the centroid is determined as candidate traffic feature 2. For cluster 3, the centroid of cluster 3 is determined as candidate traffic feature 3, or the traffic feature within cluster 3 that is the smallest distance from the centroid is determined as candidate traffic feature 3.
In summary, the internet of things device may select K candidate flow features from all flow features by using a K-means clustering algorithm, where for each candidate flow feature, the candidate flow feature may include a plurality of flow feature values. For example, the candidate flow characteristic 1 includes a plurality of flow characteristic values, the candidate flow characteristic 2 includes a plurality of flow characteristic values, and the candidate flow characteristic 3 includes a plurality of flow characteristic values.
In one possible implementation, when a local search algorithm is used to select a plurality of candidate flow features from all flow features, the following steps may be used to select a plurality of candidate flow features:
step S21, generating N initial random solutions, wherein each initial random solution comprises K numerical solutions corresponding to K flow characteristics, and K is the total number of the flow characteristics. For example, assuming that 100 flow characteristics are obtained in step 202, K is 100. For each of the initial random solutions, the value solution may be a first value (e.g., 0) or a second value (e.g., 1), where the first value may indicate that the initial random solution does not select the corresponding flow characteristic, and the second value may indicate that the initial random solution selects the corresponding flow characteristic.
For example, the initial random solution may include 00001110101 … in turn, each of the numerical solutions in the initial random solution may be randomly generated, the first numerical solution (0) indicating that the initial random solution does not select the first flow characteristic, the fifth numerical solution (1) indicating that the initial random solution selects the fifth flow characteristic, and so on.
Step S22, for each initial random solutionCalculate the initial random solution +. >Corresponding fitness value.
For example, the initial random solution may be calculated using the following formulaCorresponding fitness value, the fitness value is recorded as. Of course, the following formula is merely an example, and the calculation manner of the fitness value is not limited.
In the above-mentioned formula(s),represents the total number of initial random solutions, +.>Represents the total number of flow characteristics, +.>、/>All being pre-configured empirical values, e.g +.>= 0.06,/>= 0.48,/>=0.48, for which->、/>、/>The value of (2) is not limited, for example +.>、/>、/>The sum is 1, & gt>、/>、/>Representation->、/>And->Is a weight of (2).
Representing an initial random solution->Corresponding false positive rate (False Positive Rate), +.>Representing an initial random solution->Corresponding true positive rate (True Positive Rate) based on an initial random solution +.>Multiple +.>And a plurality of->Based on multiple->And a plurality of->Generating multiple sets of input data (each set of input data includes one +>And one->) Substituting each group of input data into the above formula can obtain a +.>Based on the +.>Will be at a minimum +.>As an initial random solution->Corresponding fitness value.
To obtain an initial random solutionCorresponding plurality of->And a plurality of->The initial random solution can be +.>Inputting to a network model to obtain an initial random solution +. >Corresponding plurality of->And a plurality of->Other algorithms may be used to obtain the initial random solution +.>Corresponding plurality of->And a plurality of->There is no limitation in this regard.
Step S23, selecting an initial random solution with the smallest fitness value from N initial random solutions as a current optimal random solution based on the fitness value corresponding to each initial random solution, and recording the initial random solution as the current optimal random solution
Obtaining the current optimal random solutionThe update process for feature selection can then be divided into two phases, the first phase being far from the globally optimal solution phase (before +.>Round), the second phase is near the global optimal solution phase (post +.>A wheel). Illustratively, in the phase away from the global optimal solution, the +.>And optimizing each initial random solution to obtain N candidate random solutions corresponding to the N initial random solutions. In the near global optimal solution phase, a target random solution may be generated based on the N candidate random solutions. The following pair is far from the globally optimal solution stage (front +.>Round) and near global optimal solution phase (postWheel) is described.
Step S24, for each initial random solutionDetermining initial follow-upMechanical solution->Corresponding intermediate random solutions.
Exemplary, can be based on an initial random solution Current optimal random solution->Uniform random number, initial random solution->The corresponding initial solution variation amount determines the current solution variation amount, and then, based on the initial random solution +.>And determining an intermediate random solution by the current solution variation. For example, the current solution variation and the intermediate random solution may be determined using the following formula, which is, of course, merely an example, and is not limited to this determination.
In the above-mentioned formula(s),representing the initial solution variation, in the first iteration,/->Is a fixed value, can be empirically configured, in the second iteration, +.>For the current solution variation in the first round of iterationIn the third iteration, +_s>For the current solution variation in the second round of iteration +.>And so on. />Is constant and can be empirically configured, such as 0.2, etc. />Represents [0,1 ]]Uniform random numbers within a range, i.e. from [0,1]And randomly selecting a value as a uniform random number in the range. />Representing the current optimal random solution. />Representing an initial random solution, +.>Representing an initial random solution->In the second round of iteration, +.>Representing the intermediate random solution in the first round of iterations (i.e. the intermediate random solution in the first round of iterations is the initial random solution in the second round of iterations), in the third round of iterations,/and (ii) >Representing the intermediate random solution in the second round of iteration, and so on. />Representing the current solution variation, i.e., the variation of the updated solution. />Representing an initial random solution->Corresponding intermediate random solutions.
Step S25, judging whether the solution update end condition is satisfied.
Illustratively, each time an initial random solution is determinedThe corresponding intermediate random solution operation (step S24) can be understood as a round of iteration, the iteration number can be counted, and whether the iteration number reaches +.>,/>The value of (2) can be configured empirically, for which>The value of (2) is not limited. If yes, the solution update end condition is satisfied, and if not, the solution update end condition is not satisfied. If the solution update end condition is not satisfied, step S26 may be performed, and if the solution update end condition is satisfied, step S29 may be performed.
And step S26, based on the fitness value corresponding to each intermediate random solution, taking the intermediate random solution with the smallest fitness value as the current optimal random solution. For example, after obtaining the intermediate random solutions corresponding to each initial random solution, the fitness value corresponding to each intermediate random solution can be calculated by the above procedure, and then the intermediate random solution with the smallest fitness value is used as the current optimal random solution
And step S27, carrying out local search on the current optimal random solution to obtain an updated current optimal random solution.
Exemplary, based onPre-optimal random solutionLocal search can be performed by adopting a climbing method or a tabu method, and the updated current optimal random solution is obtained. For example, the current optimal random solution is mutated bit by bit>Calculating the fitness value of the variant random solution, if the fitness value of the variant random solution is lower than the current optimal random solution +.>Then update the current optimal random solution +.>Otherwise, returning the current optimal random solution +.>Is a solution to the old of (a).
For example, the current optimal random solution is mutated firstThe 1 st numerical solution (e.g., 0 is mutated to 1 or 1 is mutated to 0) to obtain a mutated random solution P1. If the fitness value of the variant random solution P1 is smaller than the current optimal random solution +.>The adaptation value of (2) is then the current optimal random solution +.>Updating the solution to a variant random solution P1, and if the fitness value of the variant random solution P1 is greater than or equal to the current optimal random solution +.>The fitness value of (2) then keeps the current optimal random solution +.>Is unchanged. Then, mutating the current optimal random solution->The 2 nd numerical solution of (2) to obtain a variant random solution P2. If the fitness value of the variant random solution P2 is smaller than the current optimal random solution +. >The adaptation value of (2) is then the current optimal random solution +.>Updating the solution to a variant random solution P2, and if the fitness value of the variant random solution P2 is larger than or equal to the current optimal random solution +.>The fitness value of (2) then keeps the current optimal random solution +.>Is unchanged. And the like, until all the numerical solutions are mutated, obtaining the updated current optimal random solution.
Step S28, for each initial random solutionThe initial random solution->Corresponding intermediate random solution is updated to the initial random solution +.>And taking the updated current optimal random solution as the current optimal random solution +.>Then, return to step S24, determine the initial random solution +.>Corresponding intermediate random solutions.
Step S29, for each initial random solutionThe initial random solution->The corresponding intermediate random solution is taken as the initial random solution +.>The corresponding candidate random solution is then executed step S30.
And step S30, removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution. For example, all candidate random solutions are ranked based on the fitness value corresponding to each candidate random solution, and half of the candidate random solutions with large fitness values are removed. Of course, 1/3, 1/4, 2/3 of the large fitness value may be removed to wait for the random solution, and at least one of them may be removed without limitation.
And S31, determining a local superior solution based on the rest candidate random solutions.
Illustratively, after going through a phase distant from the globally optimal solution (step S24-step S29), each candidate random solution approaches the globally optimal solution, and the update of each candidate random solution depends on surrounding solutions, requiring to find a locally superior solution that approaches the optimal solution in rangeTherefore, a local superior solution +.Can be calculated that approximates the optimal solution in the range based on the remaining candidate random solutions>For example, the following formula may be used to determine the local preferred solution, and of course, the following formula is merely an example, and the calculation method is not limited thereto.
Wherein,representing a locally preferred solution, < >>Representing the i-th candidate random solution,fitness value representing the i-th candidate random solution,/->Representing the number of candidate random solutions remaining.
And step S32, optimizing each remaining candidate random solution based on the local preferred solution to obtain an optimized intermediate random solution corresponding to the candidate random solution.
For example, the candidate random solution may be optimized to obtain an intermediate random solution using the following formula:
in the above-mentioned formula(s),representing candidate random solutions, < >>Represents [0,1 ] ]A uniform random number within the range of the random number,representing a locally preferred solution, < >>Representing the optimized intermediate random solution. />
Step S33, judging whether the output condition is satisfied.
Exemplary, each time the candidate random solution is optimized based on the locally superior solution (step S32), it can be understood that one iteration is performed, the number of iterations can be counted, and whether the number of iterations reaches or not,/>The value of (2) can be configured empirically, for which>The value of (2) is not limited. If yes, the output condition is satisfied, and if not, the output condition is not satisfied. And/or, the number of remaining candidate random solutions may be counted, if the number of candidate random solutions is less than or equal to a preset value (such as 1, etc.), the output condition is satisfied, and if the number of candidate random solutions is greater than the preset value, the output condition is not satisfied. If the output condition is not satisfied, step S34 may be executed, and if the output condition is satisfied, step S35 may be executed.
Step S34, for each candidate random solution, updating the optimized intermediate random solution corresponding to the candidate random solution into the candidate random solution, and returning to execute step S30, and removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution, for example, removing half of the candidate random solutions.
And S35, selecting an optimal random solution from the intermediate random solutions, and carrying out local search on the optimal random solution to obtain a target random solution. For example, if the optimized intermediate random solution is one, the intermediate random solution is used as an optimal random solution, and if the optimized intermediate random solution is at least two, the intermediate random solution with the minimum fitness value is used as the optimal random solution based on the fitness value corresponding to each intermediate random solution.
After the optimal random solution is obtained, a climbing method or a tabu method can be adopted to perform local search, so that an updated optimal random solution is obtained, and the updated optimal random solution is the target random solution. For example, the optimal random solution is mutated bit by bit, the fitness value of the mutated random solution is calculated, and if the fitness value of the mutated random solution is lower than the optimal random solution, the optimal random solution is updatedOtherwise, returning the old solution of the optimal random solution. Through the local searchAnd then, obtaining the updated optimal random solution, namely obtaining the target random solution.
Step S36, after obtaining a target random solution, selecting a plurality of candidate flow characteristics from all flow characteristics based on the target random solution; the flow characteristics corresponding to the second valued numerical solution in the target random solution can be used as candidate flow characteristics. For example, the target random solution includes 00001110101 …, the first numerical solution (0) is used to indicate that the first flow feature is not a candidate flow feature, the fifth numerical solution (1) is used to indicate that the fifth flow feature is a candidate flow feature, and so on.
Obviously, the flow characteristics corresponding to all the numerical solutions (1) can be used as candidate flow characteristics. For each candidate flow characteristic, the candidate flow characteristic may include a plurality of flow characteristic values.
In one possible implementation manner, when a clustering mode and a local search algorithm are adopted to select a plurality of candidate flow features from all flow features, a clustering mode may be adopted to select part of flow features from all flow features as intermediate flow features, and then a local search algorithm is adopted to select a plurality of intermediate flow features from all intermediate flow features as candidate flow features. Or, a local search algorithm may be first adopted to select a part of flow features from all the flow features as intermediate flow features, and then a clustering mode is adopted to select a plurality of intermediate flow features from all the intermediate flow features as candidate flow features.
In one possible implementation, when the internet of things device supports multiple service functions, each service function corresponds to one original data set, e.g., service function 1 corresponds to original data set 1, service function 2 corresponds to original data set 2, and so on. Based on this, steps 201-203 may be performed on the original data set 1 to obtain a plurality of candidate traffic characteristics corresponding to the service function 1, steps 201-203 may be performed on the original data set 2 to obtain a plurality of candidate traffic characteristics corresponding to the service function 2, and so on. After obtaining candidate traffic characteristics corresponding to each service function, executing subsequent steps based on the candidate traffic characteristics.
For example, taking the example that the internet of things device is a camera, the service functions may include, but are not limited to: the video stream extraction service function, namely that the network flow carries the video stream; the equipment information extraction service function, namely the network flow carries equipment information; the user information extraction service function, namely that the network flow carries user information; the operation data extraction service function, namely, the network traffic carries the operation data. Of course, the foregoing are only examples of service functions, which are not limited in this respect, and are related to network traffic supported by the internet of things device for other types of internet of things devices.
In summary, by selecting the features of all the flow features corresponding to the original data set, a group of similar flow features can be replaced by one candidate flow feature, that is, the original data set only corresponds to a small number of candidate flow features, so that the number of flow features is reduced, and the small number of candidate flow features can represent all the network flows of the original data set, so that the processing speed of the classification model is improved, and the training complexity of the classification model is reduced.
Step 204, for each candidate flow feature, the candidate flow feature may include a plurality of flow feature values, the internet of things device performs dimension reduction on the plurality of flow feature values of the candidate flow feature to obtain a target flow feature, and the target flow feature may include a part of feature values of the candidate flow feature.
For example, in order to reduce the training complexity of the classification model, reduce the training duration of the classification model, reduce the resource overhead of the classification model, and reduce the feature quantity used for training the classification model, therefore, feature selection can be performed on each candidate flow feature, that is, the dimension reduction operation can be performed on a plurality of flow feature values of the candidate flow feature. By means of feature selection, irrelevant features or attributes can be removed, features which obviously influence data classification are reserved, classification accuracy can be improved by feature selection, and learning of a classification model is quickened.
In this embodiment, since the candidate flow feature may include a plurality of flow feature values, in order to implement feature selection, the plurality of flow feature values of the candidate flow feature may be reduced in dimension to obtain the target flow feature, and the target flow feature may include a part of feature values of the candidate flow feature. For example, assuming that the candidate flow feature 1 includes a flow feature value 11, a flow feature value 12, a flow feature value 13, and a flow feature value 14, the target flow feature 1 may be obtained by reducing the dimensions of the flow feature values, and the target flow feature 1 may include the flow feature value 11 and the flow feature value 12. Similarly, the target flow characteristic 2 corresponding to the candidate flow characteristic 2 and the target flow characteristic 3 corresponding to the candidate flow characteristic 3 can be obtained.
For example, when performing the dimension reduction operation on the plurality of flow feature values of the candidate flow feature, at least one of performing the de-duplication dimension reduction, the de-correlation dimension reduction and the de-specified type feature dimension reduction on the plurality of flow feature values of the candidate flow feature may be performed to obtain the target flow feature. Of course, the operations of de-duplication, de-correlation, de-specification and feature dimension reduction are just a few examples of dimension reduction operations, and the dimension reduction operations are not limited.
The de-duplication dimension reduction refers to removing repeated flow characteristic values, and if the similarity between the first flow characteristic value and the second flow characteristic value is greater than a similarity threshold, removing the second flow characteristic value from the candidate flow characteristic. For example, the candidate flow feature 1 includes a flow feature value 11, a flow feature value 12, a flow feature value 13, and a flow feature value 14, and the similarity (such as distance similarity, cosine similarity, etc.) between the flow feature value 12 and the flow feature value 11 can be calculated, if the similarity is greater than the similarity threshold, the flow feature value 12 or the flow feature value 11 is removed, and if the similarity is greater than the similarity threshold, the similarity between the flow feature value 13 and the flow feature value 11 is calculated, and if the similarity is greater than the similarity threshold, the flow feature value 13 or the flow feature value 11 is removed, and so on.
Decorrelation dimension reduction refers to removing the related flow characteristic value, and if the correlation between the third flow characteristic value and the fourth flow characteristic value is greater than a correlation threshold (which can be configured empirically), removing the fourth flow characteristic value from the candidate flow characteristic. For example, after the de-duplication and dimension reduction, the candidate flow feature 1 includes a flow feature value 11, a flow feature value 12, and a flow feature value 13, and the correlation between the flow feature value 12 and the flow feature value 11 may be calculated, if the correlation is greater than the correlation threshold, the flow feature value 12 or the flow feature value 11 may be removed, and if the correlation is greater than the correlation threshold, the correlation between the flow feature value 13 and the flow feature value 11 may be calculated, and if the correlation is greater than the correlation threshold, the flow feature value 13 or the flow feature value 11 may be removed, and so on.
For example, in determining the correlation between the third flow characteristic value and the fourth flow characteristic value, the correlation between the third flow characteristic value and the fourth flow characteristic value may be determined by using a maximum information coefficient method. For example, the third flow characteristic value and the fourth flow characteristic value are gridded, a maximum mutual information value between the third flow characteristic value and the fourth flow characteristic value is calculated by adopting a maximum mutual information coefficient algorithm, and then the maximum mutual information value is normalized to be used as the correlation between the third flow characteristic value and the fourth flow characteristic value.
The de-specification type feature dimension reduction refers to removing specification type features, wherein the specification type is a feature type without privacy information, namely removing features without privacy information. And if the feature type of the fifth flow feature value is in the non-privacy list, removing the fifth flow feature value from the candidate flow feature, wherein the designated type is the feature type without the privacy information, and the non-privacy list is used for recording the feature type without the privacy information.
For example, a non-privacy manifest may be preconfigured to record feature types for which no privacy information exists, and the non-privacy manifest may include feature type a, assuming that the data/features of feature type a do not have privacy information (i.e., the data/features do not have privacy information as long as the feature type of the data/features is feature type a), the data/features representing feature type a are not used to train the classification model.
Based on this, after the decorrelation dimension reduction, assuming that the candidate flow feature 1 includes the flow feature value 11 and the flow feature value 12, if the feature type of the flow feature value 12 is in the non-privacy list, the flow feature value 12 may be removed from the candidate flow feature, so that the target flow feature 1 corresponding to the candidate flow feature 1 may be obtained, and the target flow feature 1 may include the flow feature value 11.
Step 205, the internet of things device sends the plurality of target traffic characteristics to the management device.
Step 206, the management device trains at least two classification models based on the plurality of target traffic characteristics.
Step 207, the management device sends at least two classification models to the internet of things device.
For example, after obtaining the plurality of target flow characteristics, the internet of things device may send the plurality of target flow characteristics to the management device, train at least two classification models by the management device based on the plurality of target flow characteristics, and send the at least two classification models to the internet of things device. Or, after obtaining the plurality of target flow characteristics, the internet of things device may train at least two classification models based on the plurality of target flow characteristics.
Illustratively, the at least two classification models may include, but are not limited to, at least two of: an IF (Isolation Forests, isolated forest) classification model (also known as an IF classifier), an OC-SVM (One Class Support Vector Machine, single class support vector machine) classification model (also known as an OC-SVM classifier), and an LOF (Local Outlier Factor ) classification model (also known as an LOF classifier). Of course, the above is only an example of the classification model, and is not limited thereto, as long as the classification function can be implemented.
Illustratively, for each classification model (e.g., IF classification model, OC-SVM classification model, LOF classification model), the management device trains the classification model based on a plurality of target traffic characteristics, which may include:
and S11, acquiring a model to be trained corresponding to the classification model. For example, in order to train the IF classification model, an IF to-be-trained model corresponding to the IF classification model may be obtained, in order to train the OC-SVM classification model, an OC-SVM to-be-trained model corresponding to the OC-SVM classification model may be obtained, and in order to train the LOF classification model, an LOF to-be-trained model corresponding to the LOF classification model may be obtained.
Step S12, inputting the target flow characteristics (namely, each target flow characteristic, and taking one target flow characteristic as an example) into the model to be trained, and obtaining a prediction classification result output by the model to be trained.
For example, since the model to be trained is used for implementing the classification function, after the target flow characteristics are input into the model to be trained, the model to be trained can process the target flow characteristics to obtain the prediction classification result corresponding to the target flow characteristics, and the processing process is not limited.
And S13, taking a predicted classification result of the target flow characteristic and a real classification result of the target flow characteristic as input parameters of the fitness function to obtain a loss value corresponding to the target flow characteristic.
For example, after the prediction classification result of the target traffic feature is obtained, the loss value corresponding to the target traffic feature may be calculated based on the prediction classification result of the target traffic feature and the real classification result of the target traffic feature (i.e., the tag value of the target traffic feature, assuming that training is performed based on a plurality of network traffic without privacy disclosure, the real classification result indicates that there is no privacy disclosure, i.e., the real classification result is a positive tag).
For example, a loss function whose input is a predicted classification result and a true classification result and whose output is a loss value may be previously configured, and the loss function is not limited and may have the above-described input-output relationship. Based on the above, the predicted classification result of the target flow characteristic and the real classification result of the target flow characteristic can be substituted into the loss function to obtain the loss value corresponding to the target flow characteristic.
And step S14, adjusting the network parameters of the model to be trained based on the loss value to obtain an adjusted model.
For example, the network parameters of the model to be trained can be adjusted based on the loss values, and the network parameters are adjusted to make the loss values smaller and smaller, for example, the network parameters of the model to be trained are adjusted by adopting a gradient descent method, and the adjustment process is not limited, so that the adjustment targets can be met.
And S15, judging whether the adjusted model is converged.
For example, if the loss value is smaller than the preset threshold, the adjusted model is converged, and if the loss value is not smaller than the preset threshold, the adjusted model is not converged. For another example, if the number of iterations of the model to be trained reaches the number threshold, the adjusted model is converged, and if the number of iterations of the model to be trained does not reach the number threshold, the adjusted model is not converged. For another example, if the iteration duration of the model to be trained reaches the duration threshold, the adjusted model is converged, and if the iteration duration of the model to be trained does not reach the duration threshold, the adjusted model is not converged. Of course, the above are just a few examples of determining whether the adjusted model has converged, and this is not a limitation.
If the adjusted model has converged, the adjusted model may be used as the classification model. For example, IF the model to be trained is an IF model to be trained, the adjusted model corresponding to the IF model to be trained may be used as the IF classification model. If the model to be trained is an OC-SVM model to be trained, the adjusted model corresponding to the OC-SVM model to be trained can be used as an OC-SVM classification model. If the model to be trained is an LOF model to be trained, the adjusted model corresponding to the LOF model to be trained can be used as an LOF classification model.
If the adjusted model is not converged, the adjusted model can be used as a model to be trained, the step S12 is returned, the target flow characteristics are input into the model to be trained, and the prediction classification result output by the model to be trained is obtained.
Step 208, the internet of things device receives at least two classification models, and detects whether the internet of things device has privacy disclosure or not based on the at least two classification models, that is, whether the internet of things device has privacy disclosure or not.
The method for detecting whether privacy leakage exists in the internet of things device based on at least two classification models can also be called ensemble learning, wherein ensemble learning is a method for solving an intelligent problem by combining a plurality of classification models, and prediction performance of the models can be improved through ensemble learning. For example, the at least two classification models may include, but are not limited to, an IF classification model, an OC-SVM classification model, and an LOF classification model, from which the privacy detection integration model may be constructed. In this way, the internet of things device can detect whether privacy leakage exists in the internet of things device based on the privacy detection integrated model (such as the IF classification model, the OC-SVM classification model, the LOF classification model and the like).
In one possible embodiment, the at least two classification models may be an odd number of classification models, such as 3 classification models, 5 classification models, 7 classification models, etc. Based on the above, based on at least two classification models, the internet of things device can detect whether privacy leakage exists in the internet of things device in the following manner:
the method comprises the steps that the Internet of things equipment obtains network traffic to be detected, the network traffic to be detected is mirror image network traffic when the Internet of things equipment sends the network traffic to external equipment, namely the mirror image network traffic is called as the network traffic to be detected, and the network traffic to be detected is used for detecting whether privacy leakage exists in the Internet of things equipment.
The internet of things device can input the network traffic to be detected into each classification model to obtain a classification result output by each classification model, wherein the classification result can be a first value or a second value, the first value indicates that privacy leakage exists, and the second value indicates that privacy leakage does not exist. For example, the internet of things device may input the network traffic to be detected to the IF classification model, and the IF classification model processes the network traffic to be detected to obtain a classification result 1 corresponding to the network traffic to be detected, where the classification result 1 may be a first value or a second value. The internet of things equipment can input the network flow to be detected into an OC-SVM classification model, the OC-SVM classification model processes the network flow to be detected to obtain a classification result 2 corresponding to the network flow to be detected, and the classification result 2 can be a first value or a second value. The internet of things equipment can input the network traffic to be detected into the LOF classification model, the LOF classification model processes the network traffic to be detected to obtain a classification result 3 corresponding to the network traffic to be detected, and the classification result 3 can be a first value or a second value.
The number of the first values and the number of the second values can be counted by the internet of things equipment, and the first values indicate that privacy leakage exists, and the second values indicate that privacy leakage does not exist. If the number of the first values is smaller than the number of the second values, it can be determined that the device of the internet of things does not have privacy disclosure.
So far, whether privacy leakage exists in the internet of things equipment can be detected through the plurality of classification models, and the mode can also be called a voting detection mode, namely, voting is carried out on output results of the plurality of classification models.
In one possible embodiment, the at least two classification models may be an odd number of classification models, or an even number of classification models, such as 2, 3, 4, 5, 6 classification models, etc. Based on the above, based on at least two classification models, the internet of things device detects whether privacy leakage exists in the internet of things device in the following manner:
the method comprises the steps that the Internet of things equipment obtains network traffic to be detected, wherein the network traffic to be detected is mirrored network traffic when the Internet of things equipment sends the network traffic to external equipment. The internet of things equipment inputs the network traffic to be detected into each classification model to obtain a classification result output by each classification model, wherein the classification result can be a first value or a second value, the first value indicates that privacy leakage exists, and the second value indicates that privacy leakage does not exist.
The first score value is determined based on the weight coefficient value of each classification model outputting the first value, for example, the sum of the weight coefficient values of each classification model outputting the first value is taken as the first score value. A second score value is determined based on the weight coefficient value of each classification model outputting the second value, for example, the sum of the weight coefficient values of each classification model outputting the second value is taken as the second score value. Assuming that the classification results output by the IF classification model and the OC-SVM classification model are first values, the first score value may be a sum of a weight coefficient value of the IF classification model and a weight coefficient value of the OC-SVM classification model. Assuming that the classification result output by the LOF classification model is a second value, the second score value may be a weight coefficient value of the LOF classification model.
Because the first value indicates that privacy leakage exists, the second value indicates that privacy leakage does not exist, the first score value is the sum of the weight coefficient values of each classification model of the first value, and the second score value is the sum of the weight coefficient values of each classification model of the second value, if the first score value is greater than the second score value, the internet of things device can determine that the internet of things device has privacy leakage, if the first score value is less than the second score value, the internet of things device can determine that the internet of things device does not have privacy leakage, and if the first score value is equal to the second score value, the internet of things device can determine that the internet of things device has privacy leakage, or the internet of things device can also determine that the internet of things device does not have privacy leakage.
So far, whether privacy leakage exists in the internet of things equipment can be detected through the plurality of classification models, and the mode can also be called a weighted detection mode, namely, the output results of the plurality of classification models are weighted.
For example, for the weight coefficient value of the classification model, the weight coefficient value may be empirically configured, and there is no limitation on the weight coefficient value. For example, taking at least two classification models including an IF classification model, an LOF classification model, and an OC-SVM classification model as an example, then: IF the network traffic in the original data set supports the first classification mode, the weight coefficient value of the IF classification model is greater than the weight coefficient value of the LOF classification model, and the weight coefficient value of the LOF classification model is greater than the weight coefficient value of the OC-SVM classification model. IF the network traffic in the original data set supports the second classification mode, the weight coefficient value of the OC-SVM classification model is greater than the weight coefficient value of the IF classification model, and the weight coefficient value of the IF classification model is greater than the weight coefficient value of the LOF classification model. The first classification mode may indicate that the number of classification labels of the network traffic is greater than a number threshold and the number of times of circulation of the network traffic is greater than a number threshold; the second classification mode may indicate that the number of classification labels of the network traffic is not greater than a number threshold and the number of transitions of the network traffic is not greater than a number threshold.
For example, IF classification models can support multi-classification scenarios (i.e., the more classification labels are, the better), the better the reliability of IF classification models, allowing the input features to be features that are frequently streamed by private data. The OC-SVM classification model can support single classification scenarios (i.e., the fewer classification labels the better), allowing the input features to be features of less private data streams. The LOF classification model is between the IF classification model and the OC-SVM classification model, namely the number of classification labels is between the IF classification model and the OC-SVM classification model, and the number of times of privacy data circulation of the input features is allowed to be between the IF classification model and the OC-SVM classification model.
For application scenes such as vehicle entering and exiting management, privacy data in the application scenes are frequently circulated, IF an internet of things device (such as a camera) is deployed in the application scenes, network traffic in an original data set acquired by the internet of things device can support a first classification mode (such as the number of classification labels is greater than a number threshold and the number of circulation times is greater than a number threshold), that is, for the internet of things device deployed in the application scenes, the weight coefficient value of an IF classification model is greater than the weight coefficient value of an LOF classification model, and the weight coefficient value of the LOF classification model is greater than the weight coefficient value of an OC-SVM classification model.
In addition, for application scenes such as power monitoring, privacy data flows in the application scenes are less, IF the internet of things equipment (such as a camera) is deployed in the application scenes, network flows in an original data set acquired by the internet of things equipment can support a second classification mode (such as the number of classification labels is not greater than a number threshold and the number of times of flow is not greater than a number threshold), that is, for the internet of things equipment deployed in the application scenes, the weight coefficient value of the OC-SVM classification model can be greater than the weight coefficient value of the IF classification model, and the weight coefficient value of the IF classification model can be greater than the weight coefficient value of the LOF classification model.
Of course, the above is merely an example, and the magnitude relation of the weight coefficient value is not limited in this embodiment.
In summary, whether the device of the internet of things has privacy disclosure or not can be detected based on at least two classification models, that is, whether the device of the internet of things has privacy disclosure or not. When privacy of the Internet of things equipment is revealed, the Internet of things equipment can be safely processed, such as the Internet of things equipment is safely upgraded or the equipment is off line, and the safety processing process is not limited. The data security of the Internet of things equipment can be protected by carrying out security processing on the Internet of things equipment, the leakage of sensitive information or private information of the Internet of things equipment is reduced, and the security of the data is ensured.
As can be seen from the above technical solutions, in the embodiments of the present application, since the final detection result is based on the weighted voting of the output results of the IF classification model, the OC-SVM classification model and the LOF classification model, only a small amount of data is needed to train the IF classification model, the OC-SVM classification model and the LOF classification model, so that the training complexity can be greatly reduced, and the integrated learning method of multiple classification models is fused, so that the detection precision can be further improved while the consumption of computing resources is reduced. Based on the clustered small sample data set, the training time can be further reduced by using a local search algorithm through feature selection of local search.
Based on the same application conception as the method, in the embodiment of the present application, an internet of things privacy leakage detection device based on small sample ensemble learning is provided, and the device is applied to internet of things equipment, as shown in fig. 3, and is a schematic structural diagram of the internet of things privacy leakage detection device based on small sample ensemble learning, where the device includes:
an obtaining module 31, configured to obtain an original data set, where the original data set includes a plurality of network traffic, where the plurality of network traffic is mirrored network traffic when the internet of things device sends the network traffic to an external device; a processing module 32, configured to pre-process a plurality of network flows in the original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values; for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic; a sending module 33, configured to send a plurality of target traffic characteristics corresponding to the plurality of candidate traffic characteristics to a management device, where the management device trains at least two classification models based on the plurality of target traffic characteristics; the at least two classification models are used for detecting whether privacy leakage exists in the internet of things equipment.
Illustratively, the processing module 32 is specifically configured to, when preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics: deleting abnormal network traffic in the original data set to obtain a first data set; converting the non-numerical data in the first data set into numerical data to obtain a second data set; normalizing the numerical data in the second data set to obtain a third data set, wherein the third data set comprises a plurality of network flows, and each network flow comprises normalized numerical data; and selecting a plurality of network traffic without privacy disclosure from all network traffic in the third data set, and determining a plurality of traffic characteristics corresponding to the plurality of network traffic.
Illustratively, the processing module 32 is specifically configured to, when selecting a plurality of candidate flow characteristics from all flow characteristics: clustering all flow characteristics through a clustering algorithm to obtain a plurality of clusters; for each cluster, the cluster includes a portion of the flow features of all flow features, the centroid of the cluster is an average of all flow features within the cluster, and the distance between each flow feature within the cluster and the centroid of the cluster is less than or equal to a distance threshold; for each cluster, determining the mass center of the cluster as a candidate flow characteristic, or determining the flow characteristic with the smallest distance from the mass center in the cluster as a candidate flow characteristic.
Illustratively, the processing module 32 is specifically configured to, when selecting a plurality of candidate flow characteristics from all flow characteristics: generating N initial random solutions, wherein each initial random solution comprises K numerical solutions corresponding to K flow characteristics, K is the total number of the flow characteristics, and N and K are positive integers; for each numerical solution, the numerical solution is a first value or a second value, the first value indicates that the initial random solution does not select the corresponding flow characteristic, and the second value indicates that the initial random solution selects the corresponding flow characteristic; selecting an initial random solution with the minimum fitness value from N initial random solutions as a current optimal random solution based on the fitness value corresponding to each initial random solution; optimizing each initial random solution based on the current optimal random solution to obtain N candidate random solutions corresponding to the N initial random solutions; generating a target random solution based on the N candidate random solutions, and selecting a plurality of candidate flow characteristics from all flow characteristics based on the target random solution; and taking the flow characteristic corresponding to the numerical solution of the second value in the target random solution as the candidate flow characteristic.
For example, the processing module 32 optimizes each initial random solution based on the current optimal random solution, and is specifically configured to: determining a current solution variation amount based on an initial random solution, the current optimal random solution, a uniform random number and an initial solution variation amount corresponding to the initial random solution for each initial random solution; determining an intermediate random solution based on the initial random solution and the current solution variation; if the solution update ending condition is met, the intermediate random solution is used as a candidate random solution corresponding to the initial random solution; if the solution updating end condition is not met, based on the fitness value corresponding to each intermediate random solution, taking the intermediate random solution with the smallest fitness value as the current optimal random solution, and carrying out local search on the current optimal random solution to obtain an updated current optimal random solution; and updating the intermediate random solution into an initial random solution, and returning to execute the operation of determining the current solution variation based on the initial random solution, the current optimal random solution, the uniform random number and the initial solution variation corresponding to the initial random solution. The processing module 32 is specifically configured to, when generating a target random solution based on the N candidate random solutions: removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution; determining a local preferred solution based on the remaining candidate random solutions; optimizing the rest candidate random solutions based on the local preferred solutions to obtain optimized intermediate random solutions; if the output condition is met, selecting an optimal random solution from the intermediate random solutions, and carrying out local search on the optimal random solution to obtain the target random solution; and if the output condition is not met, taking the intermediate random solution as a candidate random solution, and returning to execute the operation of removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution.
Illustratively, the processing module 32 is specifically configured to, when performing dimension reduction on the plurality of flow feature values of the candidate flow feature to obtain the target flow feature: performing at least one of de-duplication dimension reduction, de-correlation dimension reduction and de-specification type feature dimension reduction on a plurality of flow feature values of the candidate flow feature to obtain a target flow feature; the steps of removing weight and reducing dimension are as follows: if the similarity between the first flow characteristic value and the second flow characteristic value is greater than a similarity threshold value, removing the second flow characteristic value from the candidate flow characteristic; decorrelation dimension reduction refers to: if the correlation between the third flow characteristic value and the fourth flow characteristic value is greater than a correlation threshold value, removing the fourth flow characteristic value from the candidate flow characteristic; de-specification type feature dimension reduction refers to: and if the feature type of the fifth flow feature value is in the non-privacy list, removing the fifth flow feature value from the candidate flow feature, wherein the designated type is the feature type without the privacy information, and the non-privacy list is used for recording the feature type without the privacy information.
Illustratively, the apparatus further comprises: the detection module is used for receiving at least two classification models sent by the management equipment and detecting whether privacy leakage exists in the Internet of things equipment or not based on the at least two classification models; the detection module is specifically used for detecting whether privacy leakage exists in the Internet of things equipment or not based on at least two classification models: inputting the network traffic to be detected of the Internet of things equipment to each classification model to obtain a classification result output by each classification model, wherein the classification result is a first value or a second value, the first value indicates that privacy leakage exists, and the second value indicates that privacy leakage does not exist; if the number of the first values is larger than the number of the second values, determining that privacy leakage exists in the Internet of things equipment, and if the number of the first values is smaller than the number of the second values, determining that privacy leakage does not exist in the Internet of things equipment; the at least two classification models are an odd number of classification models; or, determining a first score value based on the weight coefficient value of each classification model outputting the first value, and determining a second score value based on the weight coefficient value of each classification model outputting the second value; if the first score value is larger than the second score value, the fact that privacy leakage exists in the Internet of things equipment is determined, and if the first score value is smaller than the second score value, the fact that privacy leakage does not exist in the Internet of things equipment is determined.
Illustratively, the at least two classification models include an IF classification model, an OC-SVM classification model, and an LOF classification model; IF the network flow in the original data set supports a first classification mode, the weight coefficient value of the IF classification model is larger than the weight coefficient value of the LOF classification model, and the weight coefficient value of the LOF classification model is larger than the weight coefficient value of the OC-SVM classification model; IF the network flow in the original data set supports a second classification mode, the weight coefficient value of the OC-SVM classification model is larger than the weight coefficient value of the IF classification model, and the weight coefficient value of the IF classification model is larger than the weight coefficient value of the LOF classification model; the first classification mode indicates that the number of classification labels of the network traffic is larger than a number threshold value, and the circulation times of the network traffic is larger than a times threshold value; the second classification mode indicates that the number of classification labels of the network traffic is not greater than a number threshold and the number of times of forwarding the network traffic is not greater than a number threshold.
Based on the same application concept as the above method, an electronic device (such as an internet of things device) is provided in an embodiment of the present application, and referring to fig. 4, the electronic device includes a processor 41 and a machine-readable storage medium 42, where the machine-readable storage medium 42 stores machine-executable instructions that can be executed by the processor 41; the processor 41 is configured to execute machine executable instructions to implement the internet of things privacy disclosure detection method based on small sample ensemble learning.
Based on the same application concept as the method, the embodiment of the application also provides a machine-readable storage medium, wherein a plurality of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, the method for detecting privacy leakage of the internet of things based on the small sample ensemble learning can be realized.
Wherein the machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer entity or by an article of manufacture having some functionality. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (10)

1. The method is characterized by being applied to Internet of things equipment, and comprises the following steps:
acquiring an original data set, wherein the original data set comprises a plurality of network flows, and the network flows are mirror image network flows when the internet of things equipment sends the network flows to external equipment;
preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values;
for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic;
transmitting a plurality of target flow characteristics corresponding to the plurality of candidate flow characteristics to a management device, and training at least two classification models by the management device based on the plurality of target flow characteristics;
the at least two classification models are used for detecting whether privacy leakage exists in the internet of things equipment.
2. The method of claim 1, wherein the preprocessing the plurality of network traffic in the raw dataset to obtain a plurality of traffic characteristics comprises:
Deleting abnormal network traffic in the original data set to obtain a first data set;
converting the non-numerical data in the first data set into numerical data to obtain a second data set;
normalizing the numerical data in the second data set to obtain a third data set, wherein the third data set comprises a plurality of network flows, and each network flow comprises normalized numerical data;
and selecting a plurality of network traffic without privacy disclosure from all network traffic in the third data set, and determining a plurality of traffic characteristics corresponding to the plurality of network traffic.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the selecting a plurality of candidate flow characteristics from all flow characteristics comprises the following steps:
clustering all flow characteristics through a clustering algorithm to obtain a plurality of clustering clusters; wherein, for each cluster, the cluster includes a portion of all flow features, the centroid of the cluster is an average of all flow features within the cluster, and the distance between each flow feature within the cluster and the centroid of the cluster is less than or equal to a distance threshold;
for each cluster, determining the mass center of the cluster as a candidate flow characteristic, or determining the flow characteristic with the smallest distance from the mass center in the cluster as a candidate flow characteristic.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the selecting a plurality of candidate flow characteristics from all flow characteristics comprises the following steps:
generating N initial random solutions, wherein each initial random solution comprises K numerical solutions corresponding to K flow characteristics, K is the total number of the flow characteristics, and N and K are positive integers; for each numerical solution, the numerical solution is a first value or a second value, the first value indicates that the initial random solution does not select the corresponding flow characteristic, and the second value indicates that the initial random solution selects the corresponding flow characteristic;
selecting an initial random solution with the minimum fitness value from N initial random solutions as a current optimal random solution based on the fitness value corresponding to each initial random solution; optimizing each initial random solution based on the current optimal random solution to obtain N candidate random solutions corresponding to the N initial random solutions;
generating a target random solution based on the N candidate random solutions, and selecting a plurality of candidate flow characteristics from all flow characteristics based on the target random solution; and taking the flow characteristic corresponding to the numerical solution of the second value in the target random solution as the candidate flow characteristic.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,
optimizing each initial random solution based on the current optimal random solution to obtain N candidate random solutions corresponding to the N initial random solutions, wherein the optimizing comprises the steps of: determining a current solution variation amount according to the initial random solution, the current optimal random solution, the uniform random number and the initial solution variation amount corresponding to the initial random solution; determining an intermediate random solution based on the initial random solution and the current solution variation; if the solution update ending condition is met, the intermediate random solution is used as a candidate random solution corresponding to the initial random solution; if the solution updating end condition is not met, based on the fitness value corresponding to each intermediate random solution, taking the intermediate random solution with the smallest fitness value as the current optimal random solution, and carrying out local search on the current optimal random solution to obtain an updated current optimal random solution; updating the intermediate random solution into an initial random solution, and returning to execute the operation of determining the current solution variation based on the initial random solution, the current optimal random solution, the uniform random number and the initial solution variation corresponding to the initial random solution;
The generating a target random solution based on the N candidate random solutions includes: removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution; determining a local preferred solution based on the remaining candidate random solutions; optimizing the rest candidate random solutions based on the local preferred solutions to obtain optimized intermediate random solutions; if the output condition is met, selecting an optimal random solution from the intermediate random solutions, and carrying out local search on the optimal random solution to obtain the target random solution; and if the output condition is not met, taking the intermediate random solution as a candidate random solution, and returning to execute the operation of removing at least one candidate random solution with a large fitness value based on the fitness value corresponding to each candidate random solution.
6. The method of claim 1, wherein the dimension-reducing the plurality of flow characteristic values of the candidate flow characteristic to obtain the target flow characteristic comprises:
performing at least one of de-duplication dimension reduction, de-correlation dimension reduction and de-specification type feature dimension reduction on a plurality of flow feature values of the candidate flow feature to obtain a target flow feature;
wherein, the weight and dimension removal means: if the similarity between the first flow characteristic value and the second flow characteristic value is greater than a similarity threshold value, removing the second flow characteristic value from the candidate flow characteristic;
Decorrelation dimension reduction refers to: if the correlation between the third flow characteristic value and the fourth flow characteristic value is greater than a correlation threshold value, removing the fourth flow characteristic value from the candidate flow characteristic;
de-specification type feature dimension reduction refers to: and if the feature type of the fifth flow feature value is in the non-privacy list, removing the fifth flow feature value from the candidate flow feature, wherein the designated type is the feature type without the privacy information, and the non-privacy list is used for recording the feature type without the privacy information.
7. The method of claim 1, wherein after the sending the plurality of target traffic characteristics corresponding to the plurality of candidate traffic characteristics to the management device, the method further comprises:
receiving the at least two classification models sent by the management equipment, and detecting whether privacy leakage exists in the Internet of things equipment or not based on the at least two classification models; the detecting whether privacy leakage exists in the internet of things device based on the at least two classification models comprises the following steps:
inputting the network traffic to be detected of the Internet of things equipment to each classification model to obtain a classification result output by each classification model, wherein the classification result is a first value or a second value, the first value indicates that privacy leakage exists, and the second value indicates that privacy leakage does not exist;
If the number of the first values is larger than the number of the second values, determining that privacy leakage exists in the Internet of things equipment, and if the number of the first values is smaller than the number of the second values, determining that privacy leakage does not exist in the Internet of things equipment; wherein the at least two classification models are an odd number of classification models; or alternatively, the first and second heat exchangers may be,
determining a first score value based on the weight coefficient value of each classification model outputting the first value, and determining a second score value based on the weight coefficient value of each classification model outputting the second value; if the first score value is larger than the second score value, the fact that privacy leakage exists in the Internet of things equipment is determined, and if the first score value is smaller than the second score value, the fact that privacy leakage does not exist in the Internet of things equipment is determined.
8. The method of claim 7, wherein the at least two classification models include an IF classification model, an OC-SVM classification model, and an LOF classification model;
IF the network traffic in the original dataset supports a first classification mode, the weight coefficient value of the IF classification model is greater than the weight coefficient value of the LOF classification model, and the weight coefficient value of the LOF classification model is greater than the weight coefficient value of the OC-SVM classification model;
IF the network flow in the original data set supports a second classification mode, the weight coefficient value of the OC-SVM classification model is larger than the weight coefficient value of the IF classification model, and the weight coefficient value of the IF classification model is larger than the weight coefficient value of the LOF classification model;
the first classification mode indicates that the number of classification labels of the network traffic is larger than a number threshold value, and the circulation times of the network traffic is larger than a times threshold value; the second classification mode indicates that the number of classification labels of the network traffic is not greater than a number threshold and the number of times of forwarding the network traffic is not greater than a number threshold.
9. Detection device is revealed to thing networking privacy based on little sample ensemble learning, its characterized in that, the device is applied to thing networking equipment, the device includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original data set, the original data set comprises a plurality of network traffic which are mirror image network traffic when the internet of things equipment sends the network traffic to external equipment;
the processing module is used for preprocessing a plurality of network flows in the original data set to obtain a plurality of flow characteristics; selecting a plurality of candidate flow characteristics from all flow characteristics, wherein each candidate flow characteristic comprises a plurality of flow characteristic values; for each candidate flow characteristic, performing dimension reduction on a plurality of flow characteristic values of the candidate flow characteristic to obtain a target flow characteristic, wherein the target flow characteristic comprises part of characteristic values of the candidate flow characteristic;
The sending module is used for sending a plurality of target flow characteristics corresponding to the candidate flow characteristics to the management equipment, and the management equipment trains at least two classification models based on the target flow characteristics;
the at least two classification models are used for detecting whether privacy leakage exists in the internet of things equipment.
10. An electronic device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to implement the method of any of claims 1-8.
CN202410064005.4A 2024-01-16 2024-01-16 Internet of things privacy leakage detection method and device based on small sample ensemble learning Active CN117579397B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410064005.4A CN117579397B (en) 2024-01-16 2024-01-16 Internet of things privacy leakage detection method and device based on small sample ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410064005.4A CN117579397B (en) 2024-01-16 2024-01-16 Internet of things privacy leakage detection method and device based on small sample ensemble learning

Publications (2)

Publication Number Publication Date
CN117579397A true CN117579397A (en) 2024-02-20
CN117579397B CN117579397B (en) 2024-03-26

Family

ID=89886706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410064005.4A Active CN117579397B (en) 2024-01-16 2024-01-16 Internet of things privacy leakage detection method and device based on small sample ensemble learning

Country Status (1)

Country Link
CN (1) CN117579397B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor
CN112217787A (en) * 2020-08-31 2021-01-12 北京工业大学 Method and system for generating mock domain name training data based on ED-GAN
CN113904841A (en) * 2021-09-30 2022-01-07 中国信息通信研究院 Network attack detection method applied to IPv6 network environment
US20220086064A1 (en) * 2018-12-14 2022-03-17 Newsouth Innovations Pty Limited Apparatus and process for detecting network security attacks on iot devices
US20220382864A1 (en) * 2020-07-17 2022-12-01 Hunan University Method and system for detecting intrusion in parallel based on unbalanced data deep belief network
CN116662817A (en) * 2023-07-31 2023-08-29 北京天防安全科技有限公司 Asset identification method and system of Internet of things equipment
CN116684877A (en) * 2023-05-12 2023-09-01 中国科学院计算技术研究所 GYAC-LSTM-based 5G network traffic anomaly detection method and system
CN116708003A (en) * 2023-07-14 2023-09-05 国家计算机网络与信息安全管理中心 Malicious encryption traffic detection method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220086064A1 (en) * 2018-12-14 2022-03-17 Newsouth Innovations Pty Limited Apparatus and process for detecting network security attacks on iot devices
CN111709439A (en) * 2020-05-06 2020-09-25 西安理工大学 Feature selection method based on word frequency deviation rate factor
US20220382864A1 (en) * 2020-07-17 2022-12-01 Hunan University Method and system for detecting intrusion in parallel based on unbalanced data deep belief network
CN112217787A (en) * 2020-08-31 2021-01-12 北京工业大学 Method and system for generating mock domain name training data based on ED-GAN
CN113904841A (en) * 2021-09-30 2022-01-07 中国信息通信研究院 Network attack detection method applied to IPv6 network environment
CN116684877A (en) * 2023-05-12 2023-09-01 中国科学院计算技术研究所 GYAC-LSTM-based 5G network traffic anomaly detection method and system
CN116708003A (en) * 2023-07-14 2023-09-05 国家计算机网络与信息安全管理中心 Malicious encryption traffic detection method
CN116662817A (en) * 2023-07-31 2023-08-29 北京天防安全科技有限公司 Asset identification method and system of Internet of things equipment

Also Published As

Publication number Publication date
CN117579397B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN111860872B (en) System and method for anomaly detection
US11743276B2 (en) Methods, systems, articles of manufacture and apparatus for producing generic IP reputation through cross protocol analysis
US20200027009A1 (en) Device and method for optimising model performance
KR20190096876A (en) System nad method of unsupervised training with weight sharing for the improvement in speech recognition and recording medium for performing the method
JP2022552761A (en) Target re-recognition method, device, equipment, storage medium and program product
CN111400504A (en) Method and device for identifying enterprise key people
CN110795703A (en) Data anti-theft method and related product
CN112381216A (en) Training and predicting method and device for mixed graph neural network model
KR20210067605A (en) A method for controlling commercial laundry machine and system for the same using artificial intelligence
CN110929785A (en) Data classification method and device, terminal equipment and readable storage medium
Hussan et al. DDoS attack detection in IoT environment using optimized Elman recurrent neural networks based on chaotic bacterial colony optimization
CN104615620B (en) Map search kind identification method and device, map search method and system
AU2021304283A1 (en) Dataset-aware and invariant learning for face recognition
CN117579397B (en) Internet of things privacy leakage detection method and device based on small sample ensemble learning
CN113286008B (en) Edge computing intelligent gateway service processing method and intelligent gateway system
CN108122123B (en) Method and device for expanding potential users
US20220368163A1 (en) Systems and methods for wirelessly charging internet of things devices
CN114004356A (en) Anti-money laundering model training method, anti-money laundering method and device
CN110795631B (en) Push model optimization and prediction method and device based on factorization machine
Pipelidis et al. Cross-device radio map generation via crowdsourcing
CN109101992B (en) Image matching method, device and computer readable storage medium
CN117354013B (en) Fishing attack detection method based on wolf group hunting algorithm
KR20200027586A (en) A method for image searching using a converted image and an apparatus therefor
CN116049862B (en) Data protection method, device and system based on asynchronous packet federation learning
CN116150476A (en) Interaction method, model training method and device for dialogue recommended articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant