CN113852629B - Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium - Google Patents

Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium Download PDF

Info

Publication number
CN113852629B
CN113852629B CN202111121169.9A CN202111121169A CN113852629B CN 113852629 B CN113852629 B CN 113852629B CN 202111121169 A CN202111121169 A CN 202111121169A CN 113852629 B CN113852629 B CN 113852629B
Authority
CN
China
Prior art keywords
data
data object
outlier
natural
kof
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111121169.9A
Other languages
Chinese (zh)
Other versions
CN113852629A (en
Inventor
隆华
熊忠阳
张玉芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202111121169.9A priority Critical patent/CN113852629B/en
Publication of CN113852629A publication Critical patent/CN113852629A/en
Application granted granted Critical
Publication of CN113852629B publication Critical patent/CN113852629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and a computer storage medium. The method comprises the following steps: preprocessing data; self-adaptive iteration is carried out to obtain a natural neighbor set of each preprocessed data; solving the self-adaptive bandwidth coefficient and weight of each data according to the natural neighbor set of each data; calculating the self-adaptive weighted kernel density, the outlier and the outlier threshold of each data according to the self-adaptive bandwidth coefficient and the weight, or calculating the upper bound of the outlier of each data; and marking the data with the maximum n outliers or all the data larger than the threshold value of the outliers in the network connection record parameters as abnormal data to finish the network connection abnormality identification, wherein n is a positive integer. The network connection abnormity identification method can provide inspiration for abnormal data detection of large-scale data, and abnormal data can be extracted without reference under the condition that the quantity of the abnormal data is uncertain.

Description

Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium
Technical Field
The invention relates to the field of data mining, in particular to a network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and a computer storage medium.
Background
With the rapid development of related technologies in the field of data mining, people pay more attention to the behavior patterns of most data objects, namely anomaly detection, while paying attention to the overall trend of the data objects. Anomaly detection is one of the most important tasks in the field of data mining, and has wide application in many fields, such as fraud detection by analyzing log data to detect misuse or suspicious fraudulent behavior, and in medical fields to identify abnormal cells or tumors, and in addition to the above applications, anomaly detection is applied in many scenarios, such as data leakage prevention, finding abnormal energy consumption, detecting counterfeit documents, and the like.
The popularization of internet technology in various industries brings great convenience to the life of people, and with the network security problem of the internet, various abnormal network connections become more and more common, and the abnormal network connections can cause serious information security problems such as abnormal webpage skipping, slow webpage opening speed and even personal privacy leakage, so that the identification of the abnormal network connections is very important.
The existing anomaly detection algorithms can be mainly classified into the following categories:
based on the distribution model: distribution-based methods typically assume that a data set follows a certain distribution, and then a model based on that distribution is built to detect anomalous objects. This type of approach performs well with sufficient data and known data distribution. Most applications produce datasets that often do not exhibit an ideal mathematical distribution, and it is difficult to estimate the distribution of high dimensional data. Therefore, the distribution-based approach is only applicable to cases where the data distribution is known or the data dimensionality is low.
Based on clustering: a cluster-based anomaly detection algorithm divides data into clusters according to similarities between the data and then defines an anomalous object as a data object that is not in any cluster or is far from the center of the nearest cluster. However, the performance of such methods depends mainly on the clustering algorithm used, and the outlier data is often just a byproduct of the clustering. Such methods may be ineffective if the anomalous data is assigned to a large cluster by the clustering algorithm.
Based on the neighbors: the neighbor-based method allows test data to determine its properties through a set of found neighbors, which may be "global" or "local". Neighbor-based techniques can be divided into two categories, distance-based and density-based, where distance-based methods use the distance between data as a measure of anomaly detection, without requiring the data itself to satisfy a particular distribution; density-based methods typically work to find the density of the data, and then combine the neighboring set to find the degree of outlier of the data, which is often of a "local" nature. Both distance-based and density-based methods face the problem of selecting the nearest neighbor number k, the selection of k will affect the performance of the algorithm, and meanwhile, the definition of density in the density-based method directly affects the accuracy of the algorithm.
Disclosure of Invention
In order to overcome the defects in the prior art, the present invention provides a method for identifying network connection anomalies based on a natural neighbor adaptive weighted kernel density and a computer storage medium.
In order to achieve the above object, the present invention provides a network connection anomaly identification method based on natural neighbor adaptive weighted kernel density, which includes the following steps:
carrying out data preprocessing on the network connection recording parameters;
self-adaptive iteration is carried out to obtain a natural neighbor set of each preprocessed data;
solving the self-adaptive bandwidth coefficient and weight of each data according to the natural neighbor set of each data;
calculating the self-adaptive weighted kernel density, the outlier and the outlier threshold of each data according to the self-adaptive bandwidth coefficient and the weight, or calculating the upper bound of the outlier of each data;
and marking the data with the maximum n outliers or all the data larger than the threshold value of the outliers in the network connection record parameters as abnormal data to finish the network connection abnormality identification, wherein n is a positive integer.
The network connection abnormity identification method adopts a self-adaptive bandwidth coefficient and a self-adaptive weight to enable the density estimation of data to be more accurate and robust; the method for rapidly cutting data by adopting the mode of outlier upper bound can be used for providing inspiration for abnormal data detection of large-scale data; by adopting the self-adaptive weighted kernel density, the outlier and the outlier threshold, abnormal data can be extracted without parameters under the condition that the number of the abnormal data is uncertain.
The preferred scheme of the network connection abnormity identification method comprises the following steps: the generation steps of the natural adjacent set of each data are as follows:
(1) Constructing a KD tree for the preprocessed data set;
(2) Traversing the data set in the KD tree, searching k neighbors of each data and putting the k neighbors into a corresponding neighbor set NN, and updating an inverse neighbor set RNN of the data regarded as the k neighbors, wherein k is a positive integer with an initial value of 1;
(3) If the reverse neighbor set of the data set is empty or the quantity of data of which the reverse neighbor set is empty in two adjacent iterations changes, adding 1 to the k value and executing the step (2);
if each data in the data set has at least one reverse neighbor or the number of data with the reverse neighbor set being empty in two adjacent iterations is not changed, the state of the data set can be considered to be stable at the moment, the k value is not increased, and then the step (4) is executed;
(4) And (4) solving the intersection of each data neighbor set NN and the reverse neighbor set RNN, so as to obtain the natural neighbor set NaN of each data.
The natural neighbor set of each data is solved by adopting an iterative mode, and compared with k neighbor, a neighbor parameter k is not required to be given, so that the defect that the performance difference of the algorithm is large due to different k values is avoided, and the algorithm has stability.
The preferable scheme of the network connection abnormity identification method comprises the following steps: the adaptive bandwidth coefficient calculation formula of the data object p is h p Where h is a fixed bandwidth factor, dist is a distance function, and data object q is the nearest neighbor in the natural neighbor set of data object p that is farthest from data object p.
The method for calculating the self-adaptive weight of the data object p comprises the following steps: computing a data object pCost (p, x) for data x to reach each other, cost (p, x) = min (r) { r | x ∈ NaN r (p)∧p∈NaN r (x) The data x is any data in a natural adjacent set NaN (p) of the data object p, refers to data which is in the natural adjacent set of the data object p and is close to the r-th position of the data object p, and refers to data which is in the natural adjacent set of the data object x and is close to the r-th position of the data object x;
and calculating the average cost of the data object p and all the data in the natural neighbor set NaN (p) which can reach each other, thereby obtaining the self-adaptive weight (p) of the data object p.
The adaptive bandwidth coefficient and the adaptive weight are adopted, so that the density estimation of the data is more accurate and robust.
The preferable scheme of the network connection abnormity identification method comprises the following steps: the adaptive weighted kernel density AKDE (p) for data object p is calculated as:
Figure BDA0003277228610000041
wherein weight (p) is the self-adaptive weight of the data object p, KDE (p) is the kernel density estimation of the data object, and the calculation formula is as follows:
Figure BDA0003277228610000042
where | NaN (p) | is the number of data in the natural neighbor set of data object p, d is the dimensionality of data object p, h p Is the adaptive bandwidth coefficient of data object p, dist is a distance function, and data object q is the nearest neighbor in the natural neighbor set of data object p that is farthest from data object p.
The formula for the degree of outlier KOF (p) of data object p is:
Figure BDA0003277228610000051
where | NaN (p) | is the number of data in the natural neighbor set of data object p, AKDE (p) is the adaptive weighted kernel density of the data object, and AKDE (q) is the adaptive weighted kernel density of the data object.
The outlier threshold calculation steps are as follows:
firstly, the calculated outliers are sorted according to non-decreasing order, and the change rate KO of the outliers is calculatedF var(i,j)
Figure BDA0003277228610000052
Where i, j is the subscript of two adjacent data objects;
calculating an outlier threshold KOF based on the calculated outlier rate threshold The formula is as follows: KOF threshold =mean(KOF var )+ω*std(KOF var ) Wherein mean (KOF) var ) Mean value of the degree of change of the degree of outliers, std (KOF) var ) The standard deviation of the rate of change of the degree of outliers, and ω is the adjustment factor.
The calculation step of the upper limit of the outlier of the data object p comprises the following steps:
computing an adaptive weighted kernel density upper bound AKDE for a data object p max (p):
Figure BDA0003277228610000053
Wherein the data object o is the data closest to the data object p in the natural neighbor set of the data object p;
computing an adaptive weighted kernel density lower bound AKDE for a data object p min (p):
Figure BDA0003277228610000054
Wherein data object q is the data farthest from p in the natural neighbor set of data object p;
calculate the upper outlier UBKOF (p) of data object p:
Figure BDA0003277228610000061
wherein NaN (p) is the natural adjacent set of the data object p, | NaN (p) | is the data number in the natural adjacent set of the data object p, AKDE min (p) AKDE, the lower bound of the adaptively weighted kernel density for data object p max (x) KOF (p) is the outlier of data object p for the upper bound of the adaptive weighted kernel density of data x in the natural neighbor set of data object p.
The preferable scheme of the network connection abnormity identification method comprises the following steps: the step of selecting the n data with the maximum degree of outlier in the network connection recording parameters is as follows:
(1) Randomly selecting n data, and constructing a minimum heap based on the outliers of the n data to make the heap top outlier KOF (top);
(2) Traversing the remaining data in the dataset:
for a data object p, if the upper bound of the degree of outlier UBKOF (p) of the data object p is less than the top of heap degree of outlier KOF (top), continuing to perform step (2); otherwise, executing the step (3); after the data traversal is finished, executing the step (5);
(3) Calculating an outlier KOF (p) of the data object p, and if KOF (p) is less than KOF (top), performing step (2); otherwise, executing step (4).
(4) Popping the heap top element, putting the value of KOF (p) into the heap, and updating the minimum value of the degree of outlier in the heap to be used as the KOF (top);
(5) And outputting data corresponding to the n outliers in the heap.
The calculation of the top-n problem is accelerated, and the data with the maximum n outliers in the network connection recording parameters can be quickly selected.
The application also provides a computer storage medium, wherein at least one executable instruction is stored in the storage medium, and the executable instruction enables a processor to execute the operation corresponding to the network connection abnormity identification method based on the natural neighbor self-adaptive weighted kernel density.
The invention has the beneficial effects that: according to the method, the self-adaptive weight is used when the density estimation is carried out on the data, so that the density estimation of the data is more accurate, the density estimation which is more robust than an LOF algorithm can be obtained by adjusting the self-adaptive bandwidth coefficient in the kernel density estimation, and the degree of outlier (relative density) obtained by abnormal data in a sparser area is larger than that of the LOF algorithm; meanwhile, the calculation of the top-n problem is accelerated, and abnormal data can be solved under the condition that the quantity of the abnormal data is uncertain by using a statistical method.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a data set diagram of network connection record parameters in an embodiment;
FIG. 3 is a graph of data set outliers and outlier thresholds in an embodiment;
FIG. 4 is a diagram of anomaly data extracted for the top-n problem.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.
As shown in fig. 1, the present invention provides an embodiment of a network connection anomaly identification method based on a natural neighbor adaptive weighted kernel density, which is described in detail below.
The network connection record parameters are obtained first, as shown in fig. 2. The network connection recording parameters mainly include four categories: the connection basic characteristics, connection content characteristics, time-based network traffic statistical characteristics, host-based network traffic statistical characteristics, total 41 items, and sample data are shown in table 1:
TABLE 1
Figure BDA0003277228610000081
Then, preprocessing the acquired data set of the network connection record parameters, in the embodiment, the preprocessing operation of the data set comprises removing repeated network connection records, deleting network connection records with illegal formats, and selecting four attributes of { service, duration, srcbytes and dst _ bytes } as basic attributes, wherein the service is used as a label; and replacing the text with a numerical value, and carrying out numerical value normalization and label unique hot coding operation.
Data parameter examples after data preprocessing:
duration src_bytes dst_bytes labels
-2.302585092994046 10.906691489914584 9.025708147644988 1
after data preprocessing, self-adaptive iteration is carried out to obtain a natural neighbor set of each data.
Defining NaN (x) as a natural neighbor set of data x; RNN (x) is an inverse neighbor set of data x, which includes data having x as a neighbor; NN (x) is a neighbor set of data.
In this embodiment, the step of generating the natural neighbor set is as follows:
(1) Initializing parameters, and constructing a KD tree for the data set;
(2) Traversing the data set in the KD tree, searching k neighbors of each data, putting the k neighbors into a corresponding neighbor set NN, and updating an inverse neighbor set RNN of the data regarded as the k neighbors, wherein k is a positive integer with an initial value of 1;
(3) If the reverse neighbor set of the data set is empty or the quantity of data of which the reverse neighbor set is empty in two adjacent iterations changes, adding 1 to the k value and executing the step (2);
if each data in the data set has at least one reverse neighbor or the number of data with the reverse neighbor set being empty in two adjacent iterations is not changed, the state of the data set can be considered to be stable at the moment, the k value is not increased, and then the step (4) is executed;
(4) And (4) solving the intersection of each data neighbor set NN and the reverse neighbor set RNN, so as to obtain the natural neighbor set NaN of each data.
And after a natural adjacent set of each data is obtained, the self-adaptive bandwidth coefficient and the weight of each data are calculated according to the natural adjacent set. The method comprises the following specific steps:
for a data object p whose natural neighbor set is NaN (p), then p is adaptive to the bandwidth coefficient h p The calculation method is as follows: h is a total of p H × dist (p, q), where h is a fixed bandwidth coefficient and dist is a distance function, and in this embodiment, it is preferable, but not limited to, to use euclidean distance; the neighbor farthest from p in the natural neighbor set with q as p can be obtained immediately through the solved natural neighbor set; as can be seen from the definition of kernel density, if the region where the data object p is located is denser, the value of dist (p, q) is smaller, the obtained adaptive bandwidth coefficient is smaller, the value of kernel density estimation is larger, and vice versa.
The adaptive weight (p) of data object p is calculated as
Figure BDA0003277228610000101
Where NaN (p) | is the number of natural neighbors of data object pThe cost (p, x) is the cost that the data p can reach the data x, that is, the adaptive weight of the data object p is the average cost that the data p and the data in the natural adjacent set can reach each other; the calculation formula of the cost function is cost (p, x) = min (r) { r | x ∈ NaN r (p)∧p∈NaN r (x)},NaN r (p) refers to data in the natural neighbor set of data object p that is r-th nearest to data object p, naN r (x) Refers to data that is in the natural neighborhood of data object x, near the r-th of data object x.
As can be seen from the calculation of the adaptive weights, if a data object p is in a sparse region, the cost that p and the data in its natural neighbor set can reach each other is large, and vice versa.
And after the self-adaptive bandwidth coefficient and the weight of each data are obtained, calculating the self-adaptive weighted kernel density, the outlier, the upper bound of the outlier and/or the threshold of the outlier of each data according to the self-adaptive bandwidth coefficient and the weight in different application scenes.
For a data object p, the adaptive weighted kernel density AKDE (p) is calculated by the formula:
Figure BDA0003277228610000102
wherein weight (p) is the adaptive weight of the data object p, and the larger the value of weight (p) is, the smaller the value of the adaptive weighted kernel density is; KDE (p) is the kernel density estimation of the data object, and the calculation formula is as follows:
Figure BDA0003277228610000103
wherein | NaN (p) | is the number of data in the natural neighbor set of the data object p, d is the dimensionality of the data object p, i.e. the attributes of the data, which is determined according to the data in the acquired data set, h p Is the adaptive bandwidth factor of data object p. Data object q is the nearest neighbor in the naturally contiguous set of data object p that is farthest from data object p, i.e., data object q is in the naturally contiguous set of data object p, and the distance between q and p is farthest compared to the distance between data object p and other data in its naturally contiguous set.
Data pairThe formula for the outliers KOF (p) like p is:
Figure BDA0003277228610000111
wherein | NaN (p) | is the number of data in the natural neighbor set of the data object p, and AKDE (p) is the adaptive weighted kernel density of the data object, and it can be known from the calculation formula that if the data object p is an abnormal object, its KOF value is larger.
The upper bound of outliers for data object p is calculated as follows:
first, the upper and lower bounds of the adaptive weighted kernel density of the data object p are calculated according to the nearest and farthest neighbors in the natural neighbor set of the data object p. Because the natural neighbor set is obtained in a way that the distance is from small to large when the natural neighbor set is obtained, the nearest neighbor and the farthest neighbor of the data p can be obtained within O (1) time complexity;
the upper bound of the adaptive weighted kernel density is AKDE max (p):
Figure BDA0003277228610000112
Where data object o is the data closest to p in the natural neighbor set of data object p.
Adaptive weighted kernel density lower bound AKDE min (p) is:
Figure BDA0003277228610000113
where data object q is the data in the natural neighbor set of data object p that is farthest from p.
The upper outlier bound UBKOF (p) of the data object p can be calculated from the upper and lower bounds of the adaptive weighted kernel density of the data object p in the following manner:
Figure BDA0003277228610000121
where NaN (p) is the number of data in the natural neighbor set of data object p, AKDE min (p) AKDE, the lower bound of the adaptively weighted kernel density for data object p max (x) Adaptive weighting of data x in a natural neighbor set for a data object pUpper bound on nuclear density.
The outlier threshold is calculated as follows:
firstly, the calculated outliers are sorted according to a non-decreasing order, and the change rate KOF of the outliers is calculated in the following way var(i,j)
Figure BDA0003277228610000122
Where i, j is the subscript of two adjacent data objects; calculating an outlier threshold KOF based on the calculated outlier rate of change threshold The formula is as follows: KOF threshold =mean(KOF var )+ω*std(KOF var ) Wherein mean (KOF) var ) Mean value of the degree of change of the degree of outliers, std (KOF) var ) Omega is a regulation coefficient which is the standard deviation of the degree of change of the degree of departure and the value range is [0,3 ]]The value of ω is preferably 2.5, so ω =2.5 is preferred in the present embodiment.
As can be seen from fig. 3, the obtained outlier threshold can accurately distinguish between normal data and abnormal data in the data set.
And finally, outputting the n data with the maximum degree of outlier or all the data larger than the threshold value of the degree of outlier, thereby extracting the outlier.
The following description will take specific application scenarios as examples.
the top-n problem: when the first n pieces of data with the largest outliers need to be acquired, the n pieces of data may include normal data and abnormal data, that is, the scene specifies that the first n pieces of data with the largest outliers are acquired, and the data are cut quickly by using the upper outlier bound.
The algorithm is as follows:
(1) Randomly selecting n data, calculating the outliers of the n data, and constructing a minimum heap according to the outliers of the n data, wherein the heap top outlier is assumed to be KOF (top), and the heap top outlier is the minimum of heaps.
(2) Traversing the remaining data in the dataset:
for a data object p, the bandwidth coefficient h is adapted according to the data object p p Adaptive weight (p), nearest neighbor and nearest neighbor in natural neighbor set NaN (p)Calculating the upper outlier UBKOF (p) of p by the far neighbor, and if the UBKOF (p) is smaller than the KOF (top), continuing to execute the step (2); otherwise, executing the step (3); after the data traversal is finished, executing the step (5);
(3) Calculating the degree of outlier KOF (p) of p, if KOF (p) is smaller than KOF (top), performing step (2); otherwise, executing step (4).
(4) Popping the heap top element, putting the value of KOF (p) into the heap, and updating KOF (top);
(5) And outputting data corresponding to the n outliers in the heap.
As shown in fig. 4, the first 43 pieces of data with the largest outliers output by the data set for the top-n problem in this embodiment can be obtained accurately and quickly by using the upper bound of the outliers, as can be seen by comparing fig. 2 and fig. 4.
The abnormal data problem is automatically extracted, the abnormal data needs to be automatically identified in the application scene, and the algorithm is as follows:
(1) Traverse all data in the dataset:
for a data object p, the bandwidth coefficient h is adapted according to the data object p p Calculating the self-adaptive weighted kernel density AKDE (p) of all data objects in the self-adaptive weight (p) and the natural neighbor set NaN (p), and then calculating the outlier KOF (p) according to the NaN (p);
(2) Calculating an outlier threshold KOF threshold And traversing all the data in the data set again, and marking the data with the degree of outlier larger than the threshold value of the degree of outlier as abnormal data.
Fig. 3 shows the degree of outlier of all data in the entire exemplary data set and the degree of outlier threshold obtained by the statistical learning method, and it can be seen from fig. 3 that the obtained degree of outlier threshold can accurately distinguish between normal data and abnormal data in the data set.
The invention applies an outlier upper bound which can be obtained in O (1) time complexity aiming at the top-n problem, thereby accelerating the calculation; on the other hand, by using a statistical method, it is possible to obtain abnormal data without determining the number of abnormal data.
The present application further provides an embodiment of a computer storage medium, where at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to perform an operation corresponding to the above network connection anomaly identification method based on natural neighbor adaptive weighted kernel density.
In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (3)

1. A network connection abnormity identification method based on natural adjacent self-adaptive weighted kernel density is characterized by comprising the following steps:
carrying out data preprocessing on the network connection recording parameters;
self-adaptive iteration is carried out to obtain a natural neighbor set of each preprocessed data;
the generation steps of the natural adjacent set of each data are as follows:
(1) Constructing a KD tree for the preprocessed data set;
(2) Traversing the data set in the KD tree, searching k neighbors of each data, putting the k neighbors into a corresponding neighbor set NN, and updating an inverse neighbor set RNN of the data regarded as the k neighbors, wherein k is a positive integer with an initial value of 1;
(3) If the reverse neighbor set of the data set is empty or the number of data of which the reverse neighbor set is empty in two adjacent iterations is changed, adding 1 to the k value and executing the step (2);
if each data in the data set has at least one reverse neighbor or the number of data with the reverse neighbor set being empty in two adjacent iterations is not changed, the state of the data set can be considered to be stable at the moment, the k value is not increased, and then the step (4) is executed;
(4) Solving the intersection of each data neighbor set NN and the reverse neighbor set RNN, wherein the intersection is a natural neighbor set NaN of each data;
solving the self-adaptive bandwidth coefficient and weight of each data according to the natural neighbor set of each data;
specifically, the adaptive bandwidth coefficient calculation formula of the data object p is h p H _ dist (p, q), where h is a fixed bandwidth coefficient, dist is a distance function, and data object q is the nearest neighbor in the natural neighbor set of data object p that is farthest from data object p;
the method for calculating the self-adaptive weight of the data object p comprises the following steps: calculating the data object p as the cost (p, x) of the data x which can reach each other, wherein the cost (p, x) = min (r) { r | x ∈ NaN r (p)∧p∈NaN r (x) H, where data x is any one of a set of natural neighbors NaN (p) of data object p, naN r (p) refers to the data r-th nearest to data object p in the natural neighbor set of data object p, naN r (x) Refers to the data in the natural neighborhood set of data object x that is closer to the r-th data object x;
calculating the average cost of the data object p and all the data in the natural neighbor set NaN (p) which can reach each other to obtain the self-adaptive weight (p) of the data object p;
calculating the self-adaptive weighted kernel density, the outlier and the outlier threshold of each data according to the self-adaptive bandwidth coefficient and the weight, or calculating the upper bound of the outlier of each data;
specifically, the adaptive weighted kernel density AKDE (p) of the data object p is calculated by the formula:
Figure FDA0003815279040000021
where weight (p) is the adaptation of data object pAnd (3) weighting, wherein KDE (p) is the kernel density estimation of the data object, and the calculation formula is as follows:
Figure FDA0003815279040000022
where | NaN (p) | is the number of data in the natural neighbor set of data object p, d is the dimensionality of data object p, h p A self-adaptive bandwidth coefficient of the data object p is obtained, dist is a distance function, and the data object q is a neighbor farthest from the data object p in a natural neighbor set of the data object p;
the formula for the calculation of the degree of outlier KOF (p) of the data object p is:
Figure FDA0003815279040000023
wherein | NaN (p) | is the number of data in the natural neighbor set of the data object p, AKDE (p) is the adaptive weighted kernel density of the data object, and AKDE (q) is the adaptive weighted kernel density of the data object;
the outlier threshold calculation steps are as follows:
firstly, the calculated outliers are sorted according to non-decreasing order, and the change rate KOF of the outliers is calculated var(i,j)
Figure FDA0003815279040000031
Where i, j is the subscript of two adjacent data objects;
calculating an outlier threshold KOF based on the calculated outlier rate threshold The formula is as follows: KOF threshold =mean(KOF var )+ω*std(KOF var ) Wherein mean (KOF) var ) Mean value of the degree of change of the degree of outliers, std (KOF) var ) Is the standard deviation of the rate of change of the degree of outliers, omega is the adjustment coefficient;
the calculation step of the upper limit of the outlier of the data object p comprises the following steps:
calculating an adaptive weighted kernel density upper bound AKDE for data object p max (p):
Figure FDA0003815279040000032
Where data object o is a data pair in a natural neighbor set of data object pLike p closest data;
calculating an adaptive weighted kernel density lower bound AKDE for data object p min (p):
Figure FDA0003815279040000033
Wherein data object q is the data farthest from p in the natural neighbor set of data object p;
calculate the upper outlier UBKOF (p) of data object p:
Figure FDA0003815279040000034
where NaN (p) is the natural neighbor set of data object p, | NaN (p) | is the number of data in the natural neighbor set of data object p, AKDE min (p) AKDE, the lower bound of the adaptively weighted kernel density for data object p max (x) KOF (p) is the outlier of the data object p as the upper bound of the adaptive weighted kernel density of data x in the natural neighbor set of data object p;
and marking the data with the maximum n outliers or all the data larger than the threshold value of the outliers in the network connection record parameters as abnormal data to finish the network connection abnormality identification, wherein n is a positive integer.
2. The method for identifying network connection abnormality based on natural neighbor adaptive weighted kernel density as claimed in claim 1, wherein the step of selecting n pieces of data with the largest degree of outlier among the network connection recording parameters comprises:
(1) Randomly selecting n data, and constructing a minimum heap according to the outliers of the n data, wherein the top outlier of the heap is KOF (top);
(2) Traversing the remaining data in the dataset:
for a data object p, if the upper bound of the degree of outlier UBKOF (p) of the data object p is less than the top of heap degree of outlier KOF (top), continuing to perform step (2); otherwise, executing the step (3); after the data traversal is finished, executing the step (5);
(3) Calculating an outlier KOF (p) of the data object p, and if KOF (p) is less than KOF (top), performing step (2); otherwise, executing the step (4);
(4) Popping the heap top element, putting the value of KOF (p) into the heap, and updating the minimum value of the degree of outlier in the heap to be used as the KOF (top);
(5) And outputting data corresponding to the n outliers in the heap.
3. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the method for identifying network connectivity anomalies based on adaptive weighted kernel density of natural neighbors of any one of claims 1-2.
CN202111121169.9A 2021-09-24 2021-09-24 Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium Active CN113852629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111121169.9A CN113852629B (en) 2021-09-24 2021-09-24 Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111121169.9A CN113852629B (en) 2021-09-24 2021-09-24 Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium

Publications (2)

Publication Number Publication Date
CN113852629A CN113852629A (en) 2021-12-28
CN113852629B true CN113852629B (en) 2022-10-28

Family

ID=78979718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111121169.9A Active CN113852629B (en) 2021-09-24 2021-09-24 Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium

Country Status (1)

Country Link
CN (1) CN113852629B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009910B (en) * 2023-10-08 2023-12-15 湖南工程学院 Intelligent monitoring method for abnormal change of ambient temperature

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649339A (en) * 2015-10-30 2017-05-10 北大方正集团有限公司 Method and device for mining outlier
CN112364887A (en) * 2020-10-16 2021-02-12 重庆大学 Minimum spanning tree clustering algorithm and system based on density core
CN112800115A (en) * 2021-04-07 2021-05-14 腾讯科技(深圳)有限公司 Data processing method and data processing device
CN113011888A (en) * 2021-03-11 2021-06-22 中南大学 Method, device, equipment and medium for detecting abnormal transaction behaviors of digital currency

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10073887B2 (en) * 2015-07-06 2018-09-11 Conduent Business Services, Llc System and method for performing k-nearest neighbor search based on minimax distance measure and efficient outlier detection
CN109067725B (en) * 2018-07-24 2021-05-14 成都亚信网络安全产业技术研究院有限公司 Network flow abnormity detection method and device
US11005872B2 (en) * 2019-05-31 2021-05-11 Gurucul Solutions, Llc Anomaly detection in cybersecurity and fraud applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649339A (en) * 2015-10-30 2017-05-10 北大方正集团有限公司 Method and device for mining outlier
CN112364887A (en) * 2020-10-16 2021-02-12 重庆大学 Minimum spanning tree clustering algorithm and system based on density core
CN113011888A (en) * 2021-03-11 2021-06-22 中南大学 Method, device, equipment and medium for detecting abnormal transaction behaviors of digital currency
CN112800115A (en) * 2021-04-07 2021-05-14 腾讯科技(深圳)有限公司 Data processing method and data processing device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A near-autonomous and incremental intrusion detection system through active learning of known and unknown attacks;Lynda Boukela等;《2021 International Conference on Security, Pattern Analusis, and Cybernetics(SPAC)》;20210620;全文 *
An Efficient Density-Based Local Outlier Detection Approach for Scatterred Data;Shubin Su等;《IEEE ACCESS》;20181211;全文 *
基于模糊C均值的文本迁移学习算法研究;田宏泽;《中国优秀硕士学位论文全文数据库(电子期刊)》;20180615;全文 *
基于离群点检测的网络异常检测算法研究;刘人毓;《中国优秀硕士学位论文全文数据库(电子期刊)》;20190415;全文 *

Also Published As

Publication number Publication date
CN113852629A (en) 2021-12-28

Similar Documents

Publication Publication Date Title
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
US8045800B2 (en) Active segmentation for groups of images
Wang et al. Input feature selection method based on feature set equivalence and mutual information gain maximization
CN107682319A (en) A kind of method of data flow anomaly detection and multiple-authentication based on enhanced angle Outlier factor
CN108427713B (en) Video abstraction method and system for self-made video
CN110826618A (en) Personal credit risk assessment method based on random forest
CN109933619B (en) Semi-supervised classification prediction method
CN110298024B (en) Method and device for detecting confidential documents and storage medium
CN112001788A (en) Credit card default fraud identification method based on RF-DBSCAN algorithm
CN113852629B (en) Network connection abnormity identification method based on natural neighbor self-adaptive weighted kernel density and computer storage medium
Shi et al. An improved agglomerative hierarchical clustering anomaly detection method for scientific data
Elmasri et al. Evaluation of CICIDS2017 with qualitative comparison of Machine Learning algorithm
Rahman et al. An efficient approach for selecting initial centroid and outlier detection of data clustering
CN117149581A (en) Abnormal information analysis and early warning method and system based on association rule algorithm
Din et al. A reliable adaptive prototype-based learning for evolving data streams with limited labels
Wang et al. Comparison and Analysis of Several Clustering Algorithms for Pavement Crack Segmentation Guided by Computational Intelligence
Al-Khamees et al. Survey: Clustering techniques of data stream
Li et al. Fuzzy multilevel image thresholding based on modified quick artificial bee colony algorithm and local information aggregation
CN114519605A (en) Advertisement click fraud detection method, system, server and storage medium
CN113190851A (en) Active learning method of malicious document detection model, electronic device and storage medium
Cherednichenko Outlier detection in clustering
Zhou et al. An outlier detection algorithm based on an integrated outlier factor
Bunke et al. Graph edit distance–optimal and suboptimal algorithms with applications
CN111401783A (en) Power system operation data integration feature selection method
Sachdeva et al. A Study on Anomaly Detection with Deep Learning Models for IoT Time Series Sensor Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant