CN114564814A

CN114564814A - Dynamic threshold Gaussian kernel density estimation system and method for sparse data

Info

Publication number: CN114564814A
Application number: CN202210033900.0A
Authority: CN
Inventors: 杭菲璐; 张振红; 郭威; 陈何雄; 罗震宇; 毛正雄; 何映军; 谢林江; 周程昊; 占梦来; 张军
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-05-31

Abstract

The invention relates to a dynamic threshold Gaussian kernel density estimation system and method for sparse data, wherein the method comprises the following steps: preprocessing the pcap packet data of the original data, extracting useful information into a csv format file, and then selecting a proper kernel function and a proper bandwidth according to data characteristics to complete kernel density estimation; dynamically establishing a login historical state distribution baseline by utilizing a kernel density estimation result, and optimizing a dynamic threshold value aiming at the sparse characteristic of login data to better adapt to a sparse data part; and finally, detecting login abnormity according to the established baseline and the threshold. Compared with the traditional algorithm which does not consider the data characteristics and does not establish the dynamic threshold, the estimation method provided by the invention has the advantage that the abnormal detection rate of the self-owned data set is greatly improved.

Description

Dynamic threshold Gaussian kernel density estimation system and method for sparse data

Technical Field

The invention relates to the field of baseline calculation, in particular to a dynamic threshold Gaussian kernel density estimation system and method for sparse data.

Background

The estimation method can be mainly classified into a parametric estimation method and a non-parametric estimation method according to whether there is sufficient prior information. Parameter estimation is a method of estimating unknown parameters in a population distribution from samples taken from the population. People often need to analyze or infer the intrinsic rules reflected by the data according to the data in hands. I.e. how to select statistics to infer the distribution or numerical characteristics of the population, etc. from the sample data.

In the parameter estimation analysis, one assumes that the data distribution conforms to a certain behavior, such as linear, quantifiable linear, or exponential behavior, and then finds a specific solution in the objective function family, i.e., determines the unknown parameters in the regression model. In the parameter discrimination analysis, it is assumed that data samples which are taken as discrimination bases and take values randomly are subjected to specific distribution in each possible category. Experience and theory have shown that there is often a large gap between this basic assumption for parametric models and the actual physical model, and that these methods do not always achieve satisfactory results.

In non-parametric estimation, the form of the overall probability density function is unknown, requiring us to directly infer the probability density function itself. Some typical distribution forms common in statistics are not always able to fit the distribution in reality. Furthermore, multimodal distributions are often encountered in many practical problems, which forces us to use samples to infer the overall distribution. Non-parametric estimation, also known as parameterless density estimation, is a method that requires minimal a priori knowledge, relies entirely on training data for estimation, and can be used for arbitrary shape density estimation. Common non-parametric estimation methods are: histogram, kernel density estimation, K-nearest neighbor estimation.

The existing baseline calculation algorithm mainly comprises a polynomial fitting algorithm, a probability algorithm, a sequencing algorithm, a wavelet theory, a neural network algorithm and the like. The method mainly comprises two types, namely a static baseline algorithm and a dynamic baseline algorithm. The static baseline is mainly set manually or automatically, the manual setting mode can utilize experience knowledge of setting personnel, the adjustment is flexible and controllable, but when the fluctuation range of the index is large, the method is easily affected by subjectivity, and the updating is not timely enough. The automatic setting method can set the baseline setting through simple statistics according to the fluctuation range of the data, but the method is difficult to operate and maintain, is not flexible enough, and has poor performance effect on the data with flexible index change. The existing dynamic baseline algorithms are roughly divided into two types, namely a dynamic baseline algorithm based on a probability method and a dynamic baseline algorithm based on a sequencing method. However, a common problem of the above algorithms is that the characteristics of the samples cannot be taken into account, and thus the improvement space for correlating the threshold value by using the characteristics of the samples is ignored.

Disclosure of Invention

Aiming at the defects of the existing method, the invention provides a dynamic threshold Gaussian kernel density estimation system and method aiming at sparse data.

The technical scheme of the invention is as follows:

a dynamic threshold Gaussian kernel density estimation system for sparse data comprises a collector and a processor; a collector collects data;

the processor preprocesses the pcap packet data of the original data, extracts useful information into a csv format file, and then selects a proper kernel function and bandwidth according to the data characteristics to complete kernel density estimation; dynamically establishing a login historical state distribution baseline by utilizing a kernel density estimation result, and optimizing a dynamic threshold value aiming at the sparse characteristic of login data to better adapt to a sparse data part; and finally, detecting login abnormity according to the established baseline and the threshold.

Further, the collector extracts the pcap flow data packet.

Further, the preprocessing is to integrate the corresponding data packets into a stream, and to arrange the required portions into a csv file format and store them for further use.

Further, a gaussian kernel is selected for kernel density estimation, and bandwidth selection is performed by minimizing the mean integral squared error, which is calculated as follows:

wherein the content of the first and second substances,

represents the result of kernel density estimation using bandwidth h, f (x) represents the true value, and E represents the mean value.

Further, the kernel density estimation specifically includes n sample points of independent equal distribution F, and the probability density function is set as F, and the kernel density estimation is as follows:

wherein K (x) is a selected kernel function; h is>0 is a smoothing parameter, which is the calculated bandwidth; k_h(x) Is the kernel function after bandwidth scaling.

Further, completing the result sampling and index reduction of the kernel density estimation before establishing the baseline, and establishing a dynamic threshold value for sparse data:

the model distribution of the pixel points can be regarded as the weight of all sample contributions, the threshold Th is a critical value of the kernel density, M sample points are firstly arranged in an ascending order, and then x is obtained according to the following method_beginAnd x_endAs shown in the following formula:

wherein D represents a constant, about 2.5 is taken, and a proper value is selected according to actual needs; h represents a bandwidth; xmax represents the maximum value in ascending order, Xmin represents the minimum value;

x is to be_endSubstituting into the kernel density equation, the result is:

after the overall dynamic threshold is obtained, the threshold needs to be added to the dynamic probability baseline, and the threshold added at the point is adjusted according to the normalized probability density at different points so as to achieve the purpose of optimizing the threshold at the data sparse point.

And further, detecting login abnormity, namely performing kernel density estimation on the data of the user on the day according to a selected kernel function and bandwidth and comparing the kernel density estimation with a historical baseline threshold value based on the acquired historical login state baseline and threshold value of the user, and finishing the detection of the login abnormity.

The invention also relates to a dynamic threshold Gaussian kernel density estimation method for sparse data, which comprises the following steps;

preprocessing the pcap packet data of the original data, extracting useful information into a csv format file, and then selecting a proper kernel function and a proper bandwidth according to data characteristics to complete kernel density estimation; dynamically establishing a login historical state distribution baseline by using a kernel density estimation result, and optimizing a dynamic threshold value aiming at the sparse characteristic of login data to better adapt to a sparse data part; and finally, detecting login abnormity according to the established baseline and the threshold.

Compared with the prior art, the invention has the following beneficial effects:

the dynamic threshold Gaussian kernel density estimation method for sparse data provided by the invention is evaluated on the own data set, and experiments prove that the abnormal detection rate of the estimation method provided by the invention on the own data set is greatly improved compared with the traditional algorithm which does not establish a dynamic threshold without considering the data characteristics.

The method provided by the invention fully considers the special position of the sparse data in the whole sample set, although the sparse data part has very small contribution to the whole probability distribution estimation, the regions have extremely low tolerance rate to the threshold value, and the influence brought by the characteristics of the sparse data part needs to be considered when the dynamic threshold value is established.

Finally, verification shows that the dynamic threshold Gaussian kernel density estimation method for sparse data has better anomaly detection effect than the traditional method for estimating and establishing the baseline threshold when logging in an anomaly detection task, makes full use of the characteristics of the data and has stronger task adaptability.

Drawings

FIG. 1 is a system block diagram of the present embodiment;

FIG. 2 is a diagram showing the effect of different kernel functions;

FIG. 3 is a different bandwidth effect presentation;

FIG. 4 is a dynamic probability baseline graph;

FIG. 5 is a construction of a dynamic threshold map.

Detailed Description

The technical solutions in the embodiments will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples without making any creative effort, shall fall within the protection scope of the present application.

Unless otherwise defined, technical or scientific terms used in the embodiments of the present application should have the ordinary meaning as understood by those having ordinary skill in the art. The use of "first," "second," and similar terms in the present embodiments does not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. "mounted," "connected," and "coupled" are to be construed broadly and may, for example, be fixedly coupled, detachably coupled, or integrally coupled; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. "Upper," "lower," "left," "right," "lateral," "vertical," and the like are used solely in relation to the orientation of the components in the figures, and these directional terms are relative terms that are used for descriptive and clarity purposes and that can vary accordingly depending on the orientation in which the components in the figures are placed.

As shown in fig. 1, the dynamic threshold gaussian kernel density estimation system for sparse data according to the present embodiment includes a collector 101, a processor 102 and a display 103; the collector 101 collects data.

The processor 102 preprocesses the pcap packet data of the original data, extracts useful information into a csv format file, and then selects a proper kernel function and a proper bandwidth according to the data characteristics to complete kernel density estimation; dynamically establishing a login historical state distribution baseline by utilizing a kernel density estimation result, and optimizing a dynamic threshold value aiming at the sparse characteristic of login data to better adapt to a sparse data part; and finally, detecting login abnormity according to the established baseline and the threshold. The display 103 presents the final result.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware.

The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a readable storage medium or transmitted from one readable storage medium to another readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

According to the method, the target distribution estimation capability of the nuclear density estimation is utilized, the target distribution is estimated, the base line is established, meanwhile, dynamic threshold optimization is carried out according to the sparse characteristic of sample data, and the task of detecting the user login abnormity is completed. Specifically, firstly, preprocessing is carried out on the pcap packet data of the original data to extract useful information into a csv format file, and then appropriate kernel functions and bandwidths are selected according to data characteristics to complete kernel density estimation. And then dynamically establishing a login historical state distribution baseline by utilizing a kernel density estimation result and better adapting to a sparse data part aiming at a sparse characteristic optimization dynamic threshold of login data. And finally, detecting login abnormity according to the established baseline and the threshold.

Specifically, the method for estimating the density of the dynamic threshold gaussian kernel for sparse data of the present embodiment includes the following steps:

step 1: and (4) preprocessing data.

The original data of the task is different from common log files, and needs to be extracted from a pcap flow data packet, so that the pcap file needs to be processed firstly, corresponding data packets in the pcap file are integrated into a stream, and parts needed by the user are arranged into a csv file format and stored for the next use.

And 2, step: and selecting a kernel function.

The kernel function is a probability density function that is needed in kernel density estimation, the kernel functions have many kinds, different kernel functions also have different areas of excellence, and fig. 2 shows the effect of the same sample data when kernel density estimation is performed using different kernel functions.

As shown in fig. 2, it can be seen that even the same data can be distinguished when different kernel functions are used, wherein fig. 2(1) is a histogram, fig. 2(3) is a non-smooth kernel, and fig. 2(2, 4, 5, 6) is a smooth kernel. In most cases, the smoothing kernel has many usage scenarios, and the gaussian kernel is selected for kernel density estimation in this embodiment because the statistical characteristics of the gaussian kernel function are obvious and the whole is excessively smooth.

And step 3: a suitable bandwidth is selected.

Although the different kernel functions can be used to obtain consistent conclusion that the overall trend and the density distribution regularity are basically consistent, the kernel density function is not perfect. Besides the selection of the kernel algorithm, the bandwidth (bandwidth) also affects the density estimation, and the value of the bandwidth, which is too large or too small, affects the estimation result, as shown in fig. 3, and a smaller bandwidth means a higher sensitivity, but also means a higher false alarm rate.

The same sample data and kernel function have a very large difference in the case of using different bandwidths, and in this embodiment, the bandwidth is selected by minimizing the mean integral squared error, and the calculation formula is as follows:

wherein the content of the first and second substances,

represents the calculated kernel density estimate using a bandwidth of h, f (x) represents the true value, and E represents the mean value.

And 4, step 4: and performing nuclear density estimation.

After selecting a proper kernel function and bandwidth, kernel density estimation is performed to calculate probability distribution, and the kernel density estimation is performed by fitting observed data points by adopting a smooth peak function (kernel function) so as to simulate a real probability distribution curve.

The kernel density estimation is a nonparametric method for estimating a probability density function, which is a probability density function F of n independent sample points with the same distribution F, and the kernel density estimation is as follows:

wherein K (x) is the kernel function selected in step 2. h is>0 is a smoothing parameter, i.e. the bandwidth calculated in step 3. K_h(x) Is the kernel function after bandwidth scaling.

And 5: a dynamic baseline is established based on the estimated distribution.

We first complete sampling and exponential reduction of the results of the kernel density estimation before completing the baseline establishment. The number of sampling points is selected according to the task requirement, and the horizontal axis of the task of the embodiment is one day, so the number of sampling points is set to 2401.

Since the result of the kernel density estimation in step 4 using the sklern function library in python is the log value of the probability corresponding to the sampling point, the present embodiment performs exponential reduction on the result of the kernel density estimation in step 4, and finally draws a dynamic probability baseline result as shown in fig. 4.

Step 6: a dynamic threshold is established for sparse data.

The traditional fixed threshold value strategy is very simple but is very insensitive to sparse data, in the embodiment, data sparse points are more points which are easy to be abnormal, but the traditional fixed threshold value cannot well detect the abnormality, so that a dynamic threshold value construction method is used.

Assume that a sample set contains M sample points, each sample x_iCentered on itself, over a width contributing to the overall distribution, away from the center x_iThe further away the distance is from the user,the smaller the contribution to the overall distribution, so the contribution function can be regarded as a function with a large middle and two small sides.

Since the neighboring sample points on the sample axis usually originate from the same local distribution, and since the gaussian kernel function is used in step 2, we consider this local distribution to follow the gaussian distribution, and the contribution function can be regarded as a gaussian density function.

The model distribution of pixel points can be regarded as the weighting of all sample contributions, the threshold Th is actually a critical value of kernel density, M sample points are firstly arranged in an ascending order, and then x is obtained according to the following method_beginAnd x_end：

Wherein D represents a constant, about 2.5 is taken, and a proper value can be selected according to actual needs; h represents a bandwidth; xmax denotes the maximum value in ascending order and Xmin denotes the minimum value. The reason for this is to consider the contribution of a sample as a gaussian function when the j sample point is too far from the i sample point, considering that the sample point contributes very little to the overall probability density, already less than 0.0062. Thus x will be_beginAnd x_endThe dynamic threshold Th can be obtained by substituting it into the kernel density formula, in practice, x is used_beginAnd x_endThe probability density is likely to be unequal because of the total distribution, but not too much difference, if x is determined_endTaken into the formula for the core density, the result is:

after the overall dynamic threshold is obtained, the threshold needs to be added to the dynamic probability baseline, but adding the same threshold at all points is obviously not an optimal scheme, and the threshold at the data sparse point is too high, so that the threshold added at the point is adjusted according to the normalized probability densities of different points to achieve the purpose of optimizing the threshold at the data sparse point, and the final effect is as shown in fig. 5, so that a set of complete dynamic probability historical baseline which can be used for anomaly detection and the threshold optimized for sparse data are obtained.

And 7: and detecting login abnormity.

After the historical login state baseline and the threshold of the user are obtained through the first six steps, the detection of login abnormality can be completed only by performing kernel density estimation on the current-day data of the user according to the kernel function and the bandwidth selected in the

steps

2 and 3 and comparing the kernel density estimation with the historical baseline threshold.

That is, find out the part of the current kernel density estimation baseline greater than the historical baseline dynamic threshold, if there is a log-in record in this part, this log-in is determined as abnormal log-in behavior.

Therefore, the problem solved by the technology of the embodiment is that the user logs in abnormal detection, and the experimental data needs to be subjected to distribution estimation and a reasonable baseline is established. According to the data characteristics and the task target, the embodiment selects to use the kernel density estimation in the non-parametric estimation to carry out data distribution estimation and baseline establishment, and completes the setting of the threshold value on the basis.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A dynamic threshold gaussian kernel density estimation system for sparse data, characterized by: comprises a collector and a processor; a collector collects data;

2. The system of claim 1, wherein: and the collector extracts the pcap flow data packet.

3. The system of claim 1, wherein: preprocessing is to integrate corresponding data packets into a stream, arrange the required portions into a csv file format and store the csv file format for further use.

4. The system of claim 1, wherein: and selecting a Gaussian kernel for kernel density estimation, selecting bandwidth by minimizing the mean integral squared error, and calculating according to the following formula:

wherein the content of the first and second substances,

5. The system of claim 1, wherein: the kernel density estimation is specifically n sample points of independent same distribution F, the probability density function is set as F, and the kernel density estimation is as follows:

wherein K (x) is a selected kernel function; h is>0 is a smoothing parameter, which is the calculated bandwidth; k is_h(x) Is a kernel function after bandwidth scaling.

6. The system of claim 1, wherein: completing the result sampling and index reduction of the kernel density estimation before establishing the baseline, and establishing a dynamic threshold value aiming at sparse data:

x is to be_endTaken into the formula for the core density, the result is:

7. The system of claim 1, wherein: and (4) detecting login abnormity, namely performing kernel density estimation on the data of the user on the day according to a selected kernel function and bandwidth and comparing the kernel density estimation with a historical baseline threshold value based on the acquired historical login state baseline and the historical login state threshold value of the user, and finishing the detection of the login abnormity.

8. A dynamic threshold Gaussian kernel density estimation method for sparse data is characterized by comprising the following steps: comprises the following steps;

preprocessing the pcap packet data of the original data, extracting useful information into a csv format file, and then selecting a proper kernel function and a proper bandwidth according to data characteristics to complete kernel density estimation; dynamically establishing a login historical state distribution baseline by utilizing a kernel density estimation result, and optimizing a dynamic threshold value aiming at the sparse characteristic of login data to better adapt to a sparse data part; and finally, detecting login abnormity according to the established baseline and the threshold.