WO2019237492A1 - 一种基于半监督学习的异常用电用户检测方法 - Google Patents

一种基于半监督学习的异常用电用户检测方法 Download PDF

Info

Publication number
WO2019237492A1
WO2019237492A1 PCT/CN2018/100379 CN2018100379W WO2019237492A1 WO 2019237492 A1 WO2019237492 A1 WO 2019237492A1 CN 2018100379 W CN2018100379 W CN 2018100379W WO 2019237492 A1 WO2019237492 A1 WO 2019237492A1
Authority
WO
WIPO (PCT)
Prior art keywords
users
cluster
level
graylist
detection
Prior art date
Application number
PCT/CN2018/100379
Other languages
English (en)
French (fr)
Inventor
纪淑娟
周金萍
张纯金
李凯旋
Original Assignee
山东科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东科技大学 filed Critical 山东科技大学
Publication of WO2019237492A1 publication Critical patent/WO2019237492A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Definitions

  • the invention belongs to the field of detection technology, and particularly relates to a method for detecting abnormal power users based on semi-supervised learning.
  • Non-technical losses refer to operating losses caused by a series of false power consumption actions such as power theft and fraud by power users at the distribution network side.
  • the amount of power load data of power companies has increased, which has made it increasingly difficult to detect abnormal power users.
  • the present invention proposes a method for detecting abnormal power users based on semi-supervised learning, which is reasonable in design, overcomes the shortcomings of the prior art, and has good effects.
  • a method for detecting abnormal power users based on semi-supervised learning includes the following steps:
  • Step 1 Data preprocessing
  • Step 2 First-level greylist generation based on cluster analysis
  • cluster analysis is performed using user feature sequences to find points with fewer members in the clustering cluster, that is, electricity consumption and large Most users have different power behaviors. Users are clustered using an algorithm based on a Gaussian mixture model. Finally, the users in the separated group are set as suspicious users, and the outlier users are selected by the cluster analysis method. Gray list
  • Step 3 Generation of secondary gray list based on outlier calculation
  • Step 4 Three-level gray list generation based on behavior similarity calculation
  • a three-level graylist generation algorithm based on behavior similarity calculation is used to match the abnormal behavior of users in the blacklist database, detect suspicious users with similar behavior characteristics between various types of blacklisted users, and form a three-level graylist.
  • step 2 the following steps are specifically included:
  • Step 2.1 divide the user into n clusters according to a clustering algorithm based on a Gaussian mixture model
  • Step 2.2 Determine whether the number of members of each cluster is less than the threshold k of the clustering and separating cluster points;
  • the users in the cluster are added to the first-level gray list
  • the result of the judgment is that the number of members of each cluster is greater than or equal to the threshold k of the clustering and separating cluster points, it is added to the non-gray list users.
  • step 3 the following steps are specifically included:
  • Step 3.1 Calculate the outlier factor value of the users in the first-level graylist by using the local outlier factor algorithm
  • Step 3.2 Add the outlier factors of the first-level graylist users to the second-level graylist in descending order.
  • step 4 the following steps are specifically included:
  • Step 4.1 Use the DTW algorithm to calculate the DTW value of the behavior similarity between the users in the non-graylist and the users in the blacklist database by using the DTW algorithm as the unit of the cluster.
  • Step 4.2 Calculate the average DTW of each cluster member in the non-graylist database, and filter out users who are lower than the average DTW in each cluster and add them to the third-level graylist;
  • Step 4.3 Sort the users in the third-level gray list according to the DTW value from small to large.
  • the invention proposes an abnormal power user detection model based on semi-supervised learning, which aims to form an ordered list of user suspiciousness, provide a key detection list for on-site manual detection, and improve the accuracy and efficiency of on-site detection.
  • FIG. 1 is a framework diagram of a method for detecting abnormal power users based on semi-supervised learning.
  • Figure 2 is a diagram of local outlier screening.
  • Figure 3 is a schematic diagram of user DTW value selection.
  • FIG. 4 is a schematic diagram of a correlation matrix of a feature set.
  • FIG. 5 is a two-dimensional feature data distribution diagram.
  • FIG. 6 is a three-dimensional feature data distribution diagram.
  • FIG. 7 is a schematic diagram of a feature set correlation matrix after normalization.
  • FIG. 8 is a schematic diagram showing the relationship between the area AUC under the receiver operating characteristic curve (ROC) curve and the parameter n.
  • ROC receiver operating characteristic curve
  • FIG. 9 is a schematic diagram showing the relationship between the area AUC under the ROC curve and the parameter a.
  • FIG. 10 is a schematic diagram of a cumulative recall rate of an unsupervised learning anomaly detection model algorithm.
  • FIG. 11 is a graph of accuracy rates of the unsupervised learning anomaly detection model and the semi-supervised learning anomaly detection model.
  • the implementation of the method of the present invention mainly includes the following steps:
  • the outlier degree (LOF value) of the user is calculated, and the suspicious degree of the user is judged according to the outlier degree, and a second-level gray list with a suspiciousness ranking is formed.
  • the third step based on the secondary gray list, go to the scene to collect fake evidence of outliers, obtain a black list, and store it in the black list database.
  • the fourth step is to deal with the problem that some users may collaborate and cause a large number of abnormal users to have consistent behaviors.
  • This application further processes the result classes obtained in the first clustering operation.
  • the specific method is to combine the blacklist obtained in the third step of field detection with multiple classes obtained in the first step of clustering, and propose a three-level graylist generation algorithm based on behavior similarity calculation.
  • This algorithm uses the abnormal behavior of users in the blacklist library to detect suspicious users with similar behavior characteristics among various types of blacklisted users, forming a three-level graylist.
  • the fifth step is to collect evidence of user collaboration or conspiracy to falsify based on the three-level gray list, obtain the black list, and store it in the black list database.
  • the framework of the whole method is shown in Figure 1.
  • the framework is mainly implemented in two parts, namely detection of abnormal power consumers based on unsupervised learning (i.e. first-level graylist and second-level graylist users) and semi-supervised learning based on collaborative abnormal power users (i.e. first-level graylist , Second-level graylist, third-level graylist, and blacklist users).
  • unsupervised learning i.e. first-level graylist and second-level graylist users
  • semi-supervised learning based on collaborative abnormal power users i.e. first-level graylist , Second-level graylist, third-level graylist, and blacklist users.
  • the detection of a single abnormal power user based on unsupervised learning in Figure 1 is divided into three modules.
  • the core algorithms involved are: data preprocessing method, first-level graylist generation algorithm based on cluster analysis, and outlier-based Degree calculation of the secondary gray list generation algorithm.
  • the detection model for abnormal users of power consumption based on semi-supervised learning in FIG. 1 also involves a core algorithm—a three-level gray list generation algorithm based on behavior similarity calculation. The process of each module is described in detail below.
  • test data Before performing user model detection, the test data needs to be pre-processed first. This stage mainly performs data cleaning and collation. Because in real situations, power consumption data is collected in real time, and the time series acquisition process is an irreversible process. However, during the collection process, some dirty data is often collected due to some non-human errors, that is, it contains null values, error values, or there are isolated outliers that deviate from expectations. In order not to affect the experimental results, the data set needs to be interpolated with outliers and missing values before the experiment begins.
  • DoNothing processing method It considers that the default is also a kind of information. The processing method is to retain all information and replace it with a null value.
  • Linear interpolation method uses a first-order polynomial to perform interpolation. It performs interpolation and completion on time series data, which can better reduce noise caused by missing information. This method is mainly used on CNN and RNN networks.
  • Mean ⁇ median ⁇ mode interpolation it inserts the sequence mean ⁇ median ⁇ mode into the missing value.
  • Moving average interpolation method The data of the i-th position in the time series is missing data, then the average value of the data of the previous and subsequent windows is taken as the interpolation data.
  • this application analyzes the data in the used data set and finds that most of the user's time series in the data set have not many missing values, and the case where large sequences are continuously missing in the sequence containing the missing values. very few. Based on the above factors, this application uses a moving average interpolation method to process missing values.
  • the time window size is selected to be 7 days a week.
  • This application preprocesses the data set and uses the moving average interpolation method to process the dirty data in the data set, which is the basis of the model detection work.
  • the core idea of the first-level graylist generation algorithm based on cluster analysis is to use user feature sequences for cluster analysis to find points with fewer members in the clustering cluster, that is, the electricity consumption behavior is different from that of most users.
  • User In this application, an algorithm based on a Gaussian mixture model is used to cluster users, and finally the users of the separated group are set as suspicious users.
  • the number of clustering clusters n and the threshold k of clustering to separate cluster points There are two important parameters in this algorithm: the number of clustering clusters n and the threshold k of clustering to separate cluster points.
  • the calculation efficiency and accuracy of the algorithm depends on the setting of these two parameters.
  • the setting of the number of clusters and the selection of the threshold value for clustering to separate cluster points will affect the final calculation.
  • the parameters n and k are dynamically solved according to the scale of the actual data set.
  • the optimal solution of the parameters n and k is as follows.
  • the number of cluster categories in cluster analysis needs to be set manually, in reality, the size of the electricity users that need to be detected in each area is different, and there is a lack of flexibility in arbitrarily finding an optimal number of cluster categories. Therefore, in this application, parameter selection is performed in a proportional manner, and an optimal parameter is selected for cluster analysis.
  • the number of clustering clusters is selected according to the percentage of the number of people, and the optimal number of clustering clusters is selected through multiple sets of experiments.
  • the number of clusters is set to 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8 under the condition that the threshold for dividing cluster points remains unchanged. %, 9%, 10%.
  • the data set is randomly divided into four sets of data sets of different orders of magnitude and the four sets of data sets of different orders of magnitude are subjected to unsupervised power consumption abnormal behavior detection, where the number of clustering clusters n is 1- Experiments were performed at a 10% ratio.
  • the judging criterion relies on the parameter k that divides the cluster points.
  • the parameter k determines whether the cluster is an outlier. If the number of members in the cluster is less than k, the members in the cluster are considered to be outliers. Users in the cluster are set as outliers. Also in reality, different numbers of clusters are used for clustering, which correspond to different outlier partition thresholds.
  • This application sets the parameter k based on the optimal value n, and the calculation formula is:
  • k is the threshold for clustering and separating cluster points
  • p is the total number of users detected
  • n is the number of cluster categories
  • a is a natural number of 1-10.
  • the data set is used for clustering and thresholding experiments to separate cluster points.
  • the experiment set the number of clustering clusters n to 4.5% of the total number of corresponding data sets, and the parameter a takes a natural number of 1-10.
  • Unsupervised power consumption abnormality detection was performed on four sets of data sets of different orders of magnitude.
  • Algorithm 1 gives a first-level greylist generation algorithm based on cluster analysis.
  • the main execution process of the algorithm is as follows: First, the user is divided into n clusters according to the Gaussian cluster analysis method (steps (2)-(7) in Algorithm 1), where the formula for calculating the Gaussian probability is shown in Equation 2.
  • the purpose of clustering is to screen outliers.
  • outlier screening is performed and the outliers are added to the first-level gray list (steps (10)-(11) in Algorithm 1).
  • the non-outlier objects are added to the non-gray list (steps (12)-(13) in Algorithm 1).
  • a first-level greylisted user list list1 is generated, and a non-graylisted user set M is generated.
  • the above model can be used to obtain the first-level gray list of suspicious power users, but it was found in the field inspection that although a large number of abnormal power users can be filtered out, for large-scale data sets, the first-level gray list is often The list also contains a large number of users, and the detection is not targeted, resulting in low detection efficiency. Therefore, based on the first-level gray list generated by Algorithm 1, a second-level gray list generation algorithm based on outlier calculation is proposed.
  • LOF Local Outlier Factor
  • the time complexity is O (n 2 ). Among them, the greater the user's LOF value, the higher the degree of suspiciousness.
  • the second-level graylist generation algorithm based on outlier calculation uses the first-level graylist calculation to solve the disadvantage of directly calculating the running time of each user's LOF value when computing large-scale data sets.
  • the second-level graylist generated by the outlier algorithm is a list of users with suspicious rankings, which solves the problem of untargeted first-level graylist detection and can improve the accuracy and efficiency of field detection.
  • Algorithm 2 gives the process of the secondary gray list generation algorithm based on the outlier calculation.
  • the main execution process of the algorithm is as follows: Enter the first-level graylist user list, use formula 2 to calculate the user's LOF value in the first-level graylist, and sort the user's LOF value in descending order and write it into the second-level graylist. (Steps (2)-(5) in Algorithm 2).
  • the purpose is to calculate the degree of suspiciousness of the outliers of each outlier.
  • the local outlier factor is defined as:
  • MinPts representative of the number of neighbors, if lrd MinPts (p) is small, the target neighborhood of p lrd MinPts (o) will be large, the object is a large value of p LOF; Conversely, if p is a non-outlier object, the difference between the lrd value of the object p and the objects in its neighborhood is small, that is, the LOF value of the object p is close to 1, and the higher the LOF value, the greater the outlier degree.
  • the user set C in the non-secondary gray list is clustered as a unit, and each cluster calculates in parallel the DTW value of the user behavior similar to the user in the black list library.
  • the whole process involves a core algorithm-a three-level gray list generation algorithm based on behavior similarity calculation.
  • This algorithm uses the DTW (Dynamic Time Warping) algorithm to calculate user similarity. It mainly considers that the time series of the detected users are mostly unequal, and most of the similarity calculations at this stage use the European distance calculation method. The distance calculation method cannot calculate the similarity between two unequal sequences.
  • the DTW algorithm has the advantage that it can extend and shorten two unequal-length sequences to calculate the distance between two unequal-length sequences, and then judge the similarity of the two sequences.
  • the basic idea of the three-level gray list generation algorithm based on behavior similarity calculation is that the false methods used for abnormal power consumption such as stealing electricity are limited. Through multiple rounds of anomaly detection accumulation, the blacklist library gradually improves and updates the user's abnormal behavior, so it will be checked The user performs behavior similarity calculation with the blacklist library, and finds that users with a high degree of similarity with the users in the blacklist library have similar power usage behaviors to those in the blacklist.
  • the algorithm performs calculations in parallel with users in the blacklist library in units of clusters, greatly reducing the calculation time. Since there are many members in the blacklist database, the user to be checked and each member in the blacklist database will generate a similarity value, that is, a DTW value.
  • the principle of the DTW algorithm is to measure the similarity between two time series by using the sum of the distances between the similar points between the black lines (called Warp Path Distance).
  • the DTW value is calculated as follows: two time series X and Y, with lengths
  • , wk is (i, j), where i Represents the i coordinate in X and j represents the j coordinate in Y.
  • i and j of w (i, j) in W increase monotonically, so that the middle lines of the two time series will not intersect.
  • the monotonic increase mentioned here is:
  • D (i, j) Dist (i, j) + min [D (i-1, j), D (i, j-1), D (i-1, j-1)] (5);
  • the calculated path distance is D (
  • the method for setting the DTW value in this application is shown in FIG. 3.
  • user a has three DTW values such as 100, 200, and 300. End user a chooses the smallest value as its own DTW value, and user D's DTW value is finally selected as 100. Since the algorithm aims to find users with high similarity in the blacklist library, the minimum value of the user's DTW value is selected, that is, the distance between the user and a blacklist user with the closest behavior in the blacklist library.
  • Algorithm 3 provides a three-level gray list generation algorithm based on behavior similarity calculation.
  • the main execution process of the algorithm is as follows: the algorithm first uses users in the blacklist library to perform behavior similarity calculation in parallel with the cluster set of non-graylisted users (steps (1)-(4) in algorithm 4.1). The purpose of this step is to calculate the shortest distance between the user to be checked and the blacklisted user, that is, the maximum similarity. Then, the average DTW of each cluster is calculated, and the purpose is to screen out users who are below the average and add them to the third-level gray list list3 (steps (5)-(6) in algorithm 4.1). Add other users to the normal user list (steps (7)-(8) in Algorithm 3). Finally, sort list3 in ascending order to form the final version of the three-level graylist user list (step (10) in Algorithm 3).
  • the data set uses the data of consumer power consumption published by a local power company.
  • the time span is from January 2016 to January 2017.
  • the data set contains 3,000 honest users and 400 steal users.
  • the data set user type distribution is shown in Table 4.
  • the user's power consumption mode is represented by its average daily power consumption. Based on the data set of the present application, the feature quantity of the user's power consumption mode can be further extracted. The details of the data set attributes are shown in Table 5.
  • This application proposes the characteristics of 18 user power load sequences, and analyzes and normalizes the characteristics through experiments to reduce the dimension, so as to facilitate the calculation of the characteristics of different units of different magnitudes.
  • the two parameters in this application are assigned through two sets of experiments.
  • Section 3.2.1 and 3.2.2 are the experimental feature settings, and 3.2.3 and 3.2.4 are the experimental parameter settings.
  • Section 3.2.5 compares and analyzes the detection results under unsupervised learning (first-level greylist, second-level greylist) and semi-supervised learning (third-level greylist + blacklist library). It is worth noting that in the experimental feature setting, section 3.2.1 analyzes the relationship between the 18 features proposed in this application after applying the data set in this application and dimensionality reduction of the features; Section 3.2.2 on the power load Sequence features are normalized to facilitate calculation of features of different magnitudes in different units. In the experimental parameter settings, the optimal values of the two parameters of this application are solved experimentally in sections 3.2.3 and 3.2.4 respectively.
  • This application extracts a total of 18 features in the time domain and frequency domain features of the user power time series.
  • the specific characteristics are as follows:
  • Time domain features refer to the time-dependent attribute characteristics of a sequence as it changes over time.
  • the time-domain characteristics of the time series proposed in this application are as follows: mean, variance, standard deviation, maximum, minimum, difference between maximum and minimum, and mode.
  • n to represent the size of a time window (that is, the number of rows of data in the window), and i to represent the i-th row of data. The following briefly describes the calculation method of features:
  • the most frequently occurring number in the time series is the mode of the series.
  • Frequency domain features can be used to find the periodic information of a sequence.
  • Frequency domain analysis mainly uses fast Fourier transform.
  • the frequency domain characteristics of the time series proposed in this application are as follows: DC component, mean, variance, standard deviation, slope, and kurtosis of the graph, mean, variance, standard deviation, slope, and kurtosis of the amplitude. The following briefly introduces the calculation method of features:
  • the direct current (DC) is the first component after Fourier transform. It is the average value of these signals, which is generally much larger than other numbers.
  • Correlation analysis of features uses Pearson correlation coefficient (Pearson coefficient). Its value range is [-1,1]. If the absolute value is larger, the degree of positive / negative correlation is greater. When the value is 0, it indicates independence. Correlation analysis is performed on all the extracted features using this method, and the correlation matrix obtained is shown in Figure 4.
  • FIG. 4 shows the correlation among 18 features formed by performing feature extraction on the data set of the present application.
  • the line is the feature and the feature itself is calculated. Since the data is exactly the same, it is 1, which has no meaning).
  • PCA Principal Component Analysis
  • PCA The principle of PCA dimensionality reduction is to use the eigenvalues of the covariance matrix to analyze and finally obtain the principal components of the data.
  • PCA is used to eliminate the information overlap between the original features and enhance the effectiveness of the features.
  • the PCA calculation method is shown in Equation 20.
  • F 2 ⁇ , F m denote the variables X 1, m principal components X 2, ⁇ , X S, i.e.
  • Figures 5 and 6 are the renderings of reducing features to two and three dimensions, respectively.
  • Each dot in the figure represents a user, where the green dots represent normal users, and the red "+" dots represent abnormal users.
  • the points corresponding to anomalous users are mostly distributed in areas with low density.
  • the purpose of this application based on outlier detection is to find more outliers according to user density.
  • the point distribution corresponding to the abnormal user in FIG. 6 that is, the three-dimensional feature map
  • FIG. 5 that is, the two-dimensional feature map
  • Data standardization (normalization) processing is the basic work of data analysis. In order to eliminate the impact of different dimensions between features, the data needs to be standardized first. Data standardization is to scale the data proportionally so that the data falls into a smaller specific interval and make it into dimensionless pure numerical data. Through the processed data, the characteristics of different orders and different units can be calculated and compared for comprehensive evaluation.
  • This processing method makes the data conform to the standard normal distribution, and its processing function is Equation 5.2:
  • is the data mean and ⁇ is the sample standard deviation.
  • the z-score normalization method is applicable to the case where the data set contains outlier data beyond the value range.
  • the power load data belongs to real-time collection data, and sometimes there are abnormally large collection errors. There will be a large error when using the 0-1 normalization method.
  • the z-score normalization method is more suitable for the data set of this application.
  • the correlation matrix obtained by standardizing features in this application is shown in FIG. 7. By comparing FIG. 7 and FIG. 4, it is found that there is no change in the correlation matrix of the feature set, so the feature standardization has no linear relationship between the features, and no error will be caused to the experiment.
  • Figure 8 is a line chart of the change in AUC obtained according to the solution method in Section 2.2.1.
  • the abscissa is the ratio of the number of clusters to the total number of users, and the ordinate is the AUC value of classification effect.
  • the AUC value varies with the percentage value, and the change is not monotonic. Therefore, there is an optimal value that makes the AUC larger.
  • the algorithm is the most efficient. In this application, through multiple sets of experimental verification and comparative analysis, it is found that the parameter n is selected according to a proportion of 4% -5% of the total number of data sets, so that the AUC can obtain an optimal value. Therefore, the following conclusions can be drawn:
  • Figure 9 is a line chart of the AUC change obtained by experimenting with four sets of data sets obtained according to the solution method in section 2.2.1 under different values of parameter a.
  • the abscissa in the figure is the value of parameter a, and the ordinate is the classification. Effect AUC value.
  • the first set of experiments is to use the unsupervised learning anomaly detection model to test the data set of this application.
  • the purpose is to compare the detection efficiency using the first-level graylist with the field detection efficiency using the second-level graylist, and to prove that the second-level graylist is effective in detecting positive effects.
  • the second set of experiments is to compare the difference of the detection effect between the anomaly detection model based on unsupervised learning and the anomaly detection model based on semi-supervised learning, which proves that the detection effect of anomaly detection model based on semi-supervised learning is better.
  • This application uses an unsupervised learning-based anomaly detection model to detect whether there is an abnormal power consumption behavior such as power theft by a power user in a certain place without a blacklist.
  • the model detection results are now briefly analyzed.
  • a first-level gray list and a second-level gray list are generated.
  • the first-level gray list is generated by density-based Gaussian mixture model cluster analysis.
  • the second-level gray list is a list with suspiciousness formed by calculating local outliers on the basis of the first-level gray list.
  • the experimental data set used in this chapter is formed by randomly dividing the total data set into three groups, named data set one, data set two, and data set three, and matched the corresponding blacklist users for the three data sets ( Blacklist users have no overlap with users in the corresponding data set).
  • Figure 10 shows the cumulative recall rate curves of the first-level and second-level graylists generated by the three sets of data sets.
  • the abscissa represents the detection rate, that is, the number of users who detect the graylist, and the ordinate represents the cumulative check of the detection effect.
  • Full rate (where the detection rate in this experiment means: detection of 10% of the secondary gray list, that is, 10% of the detected users are predicted to be abnormal users, other users are predicted to be normal users, and will not be repeated hereafter).
  • a, b, and c in the figure each include two lines.
  • the lines with big dots at the bottom represent the cumulative recall curves of the first-level graylist at different detection rates of the data set, which are located at the top
  • the icon of the small triangle represents the cumulative recall rate curve of the secondary graylist at different detection rates of the data set.
  • the cumulative recall rate curve of the second-level graylist has been higher than the cumulative recall rate curve of the first-level graylist.
  • the cumulative recall rate of the first-level graylist has been increasing steadily during the increase in the detection rate. State, basically increasing the detection rate by 10% will increase the recall rate by 10%. This state indicates that abnormal power users are randomly scattered in the first-level gray list.
  • the second-level graylist is more targeted than the first-level graylist. Using the second-level graylist for on-site detection has higher detection efficiency.
  • the previous section used an unsupervised learning anomaly detection model for experimental analysis in the absence of a large number of training sets.
  • the unsupervised learning detection model has the advantage of detecting for the first time, finding outliers in the data set, that is, finding users with highly suspicious abnormal behaviors in power consumption, thereby improving the detection efficiency of field detection by power supply companies.
  • the frequency of field surveys by power supply companies is very high, and each round of surveys will generate blacklisted users.
  • this application uses the blacklist library user behavior information to screen out abnormal power users among non-outlier users. On the basis of this section, the recall rate and accuracy rate of detection are further improved.
  • the DTW algorithm is used to calculate the similarity between non-outlier users and users in the blacklist database.
  • the semi-supervised detection model of this application first detects outliers in the data set through the unsupervised detection model, and then performs behavior similarity calculation on the remaining users who are considered by the system to be non-outliers.
  • Figure 11 shows the classification accuracy of the secondary gray list generated by the unsupervised detection model at different detection rates and the classification accuracy of the gray list generated by the semi-supervised detection model at different detection rates.
  • the abscissa in the figure represents the detection rate. That is, the number of gray list users is detected, and the ordinate represents the accuracy of the detection effect.
  • Figures a, b, and c in Figure 11 are divided into two lines.
  • the line with a small triangle below is the classification accuracy curve of the secondary gray list generated by the unsupervised detection model under different detection rates.
  • the line with the cross above the icon is the classification accuracy curve of the gray list generated by the semi-supervised detection model under different detection rates.
  • the trends of the curves on the three different data sets are mostly the same. From the figure, it can be seen intuitively that the line with the cross icon is always higher than the line with the small triangle in the process of improving the detection rate. That is, in the case of the same detection rate, the accuracy rate of using the detection model based on semi-supervised learning is always higher than that of the detection model based on unsupervised learning alone.
  • the detection model based on unsupervised learning is suitable for the initial stage of detection without any blacklist library. In the case of a certain blacklist library, the detection effect based on the semi-supervised learning detection model is better.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于半监督学习的异常用电用户检测方法,属于检测技术领域,包括以下步骤:数据预处理;基于聚类分析的一级灰名单生成;基于离群度计算的二级灰名单生成;基于行为相似度计算的三级灰名单生成。本发明提出的基于半监督学习的异常用电用户检测模型,旨在形成用户可疑度排序列表,为现场人工检测提供重点检测名单,提高了现场检测的准确率及效率。

Description

一种基于半监督学习的异常用电用户检测方法 技术领域
本发明属于检测技术领域,具体涉及一种基于半监督学习的异常用电用户检测方法。
背景技术
据研究显示,每年我国电力系统中因非技术性问题造成的营运损失高达百亿美元。非技术性损失是指由配电网侧电力用户的窃电、欺诈等一系列虚假用电行为造成的运营损失。随着智能电网不断推进与传感采集技术的高速发展,电力公司用电负荷数据海量增加,这导致异常用电用户检测越来越困难。
近年来,人们提出一些智能检测算法来克服原始人工检测盲目性高、查准率低等弊端,提高现场检测的命中率,降低运营成本。现阶段大部分智能检测算法都是基于有监督学习进行的,需要以大量带标签的训练集为前提。但现实情况中,数据分析检测的初始阶段没有大量训练集进行模型训练。
发明内容
针对现有技术中存在的上述技术问题,本发明提出了一种基于半监督学习的异常用电用户检测方法,设计合理,克服了现有技术的不足,具有良好的效果。
为了实现上述目的,本发明采用如下技术方案:
一种基于半监督学习的异常用电用户检测方法,包括以下步骤:
步骤1:数据预处理
采用滑动平均插值法对数据集进行预处理;
步骤2:基于聚类分析的一级灰名单生成
假设大多数人都是正常用户,且正常用户和异常用户的行为特点是不同的,利用用户特征序列进行聚类分析,找到聚类类簇中成员数量较少的点,即用电行为与大多数用户用电行为不同的用户;采用基于高斯混合模型的算法对用户进行聚类,最终将部分离群用户设定为可疑用户,利用聚类分析方法筛选出离群点用户,即得到一级灰名单;
步骤3:基于离群度计算的二级灰名单生成
基于一级灰名单,计算用户的离群度,根据离群程度判断用户可疑程度,形成具有可疑度排名的二级灰名单;
步骤4:基于行为相似度计算的三级灰名单生成
应用基于行为相似度计算的三级灰名单生成算法,匹配黑名单库中用户的异常行为,检测出各类中与黑名单用户具有相似行为特征的可疑用户,形成三级灰名单。
优选地,在步骤2中,具体包括如下步骤:
步骤2.1:根据基于高斯混合模型的聚类算法将用户进行聚类划分为n个簇;
步骤2.2:判断各个簇成员个数是否小于聚类划分离群点的阈值k;
若:判断结果为各个簇成员个数小于聚类划分离群点的阈值k,则将簇中用户加入到一级灰名单中;
或判断结果为各个簇成员个数大于或者等于聚类划分离群点的阈值k,则加入到非灰名单用户中。
优选地,在步骤3中,具体包括如下步骤:
步骤3.1:利用局部离群因子算法计算一级灰名单中用户的离群因子值;
步骤3.2:将一级灰名单用户的离群因子值按照从大到小的顺序加入到二级灰名单中。
优选地,在步骤4中,具体包括如下步骤:
步骤4.1:将非灰名单中的用户以簇为单位,利用DTW算法计算非灰名单中的用户与黑名单库中的用户间的行为相似度DTW值;
步骤4.2:计算非灰名单库中各簇成员的DTW均值,将各簇中低于DTW均值的用户筛选出来加入到三级灰名单中;
步骤4.3:将三级灰名单中用户按照DTW值由小到大进行排序。
本发明所带来的有益技术效果:
本发明提出了基于半监督学习的异常用电用户检测模型,旨在形成用户可疑度排序列表,为现场人工检测提供重点检测名单,提高现场检测的准确率及效率。
附图说明
图1为基于半监督学习的异常用电用户检测方法框架图。
图2为局部离群点筛选图。
图3为用户DTW值选择示意图。
图4为特征集的相关矩阵示意图。
图5为二维特征数据分布图。
图6为三维特征数据分布图。
图7为归一化之后的特征集相关矩阵示意图。
图8为ROC(receiver operating characteristic curve,受试者工作特征曲线)曲线下面积AUC与参数n的关系示意图。
图9为ROC曲线下面积AUC与参数a的关系示意图。
图10为无监督学习异常检测模型算法的累积查全率曲线示意图。
图11为无监督学习异常检测模型与半监督学习异常检测模型准确率图。
具体实施方式
下面结合附图以及具体实施方式对本发明作进一步详细说明:
1、模型步骤和框架
本发明方法实现工作主要包括以下几步:
首先,假设大多数人都是正常用户,且正常用户和异常(偷电)用户的行为特点是不同的,利用聚类分析方法筛选出离群点用户,即得到一级灰名单。
其次,基于一级灰名单,计算用户的离群度(LOF值),根据离群程度判断用户可疑程度,形成具有可疑度排名的二级灰名单。
第三步,基于二级灰名单,去现场收集离群用户的造假证据,得到黑名单,并存入黑名单库中。
第四步,针对部分用户可能协同作案,造成大量异常用户的行为存在一致性的问题,本申请进一步对第一步聚类操作得到的结果类进行处理。具体方法为,融合第三步现场检测得到的黑名单和第一步聚类得到的多个类,提出了基于行为相似度计算的三级灰名单生成算法。此算法利用黑名单库中用户的异常行为,检测出各类中与黑名单用户具有相似行为特征的可疑用户,形成三级灰名单。
第五步,基于三级灰名单,去现场收集用户协同或者共谋造假的证据,得到黑名单,并存入黑名单库中。
整个方法的框架如图1所示。该框架主要分两大部分实现,即基于无监督学习的异常用电个体(即一级灰名单和二级灰名单用户)检测和基于半监督学习的协同异常用电用户(即一级灰名单、二级灰名单、三级灰名单和黑名单用户)检测。
2、模型核心算法
图1中基于无监督学习的单个异常用电用户的检测分为三大模块,其中涉及到的核心算法分别为:数据预处理方法、基于聚类分析的一级灰名单生成算法、基于离群度计算的二级灰名单生成算法。图1中基于半监督学习的用电异常用户的检测模型除了上述三个核心算法之外,还涉及一个核心算法—基于行为相似度计算的三级灰名单生成算法。下面详细介绍每个模块的处理过程。
2.1、数据预处理方法
在对用户进行模型检测之前,首先需要对检测数据进行预处理,此阶段主要进行数据清洗与整理工作。由于在现实情况中,用电量数据都为实时采集,而且时间序列采集过程是不可逆过程。但是采集过程中往往会因一些非人为的失误造成采集到部分脏数据,即包含空值、错误值或存在偏离期望的孤立点值等。为了不影响实验结果,在实验开始之前需要对数据集 进行异常值与缺失值的插值处理。
现阶段,主流处理方式分为五种:Do Nothing、填充0值或-1值、线性插值、均值\中位数\众数插值、滑动平均差值法。
(1)Do Nothing处理方式:其认为缺省也是一种信息,处理方式为保留所有信息,以空值替代。
(2)填充0值或-1值,这是一种最常见的缺省值处理方式,能够引入最少的人为主观信息,避免因人为主观信息导致预测偏离。
(3)线性插值法:线性插值利用一次多项式进行插值的方式,它对时序数据进行插值补全,能够较好的降低因为丢失信息带来的噪声。此种方式主要用在了CNN、RNN网络上。
(4)均值\中位数\众数插值法:它是将序列均值\中位数\众数插入到缺失值中。
(5)滑动平均插值法:在时间序列中的第i个位置数据为缺失数据,则取前后一个窗口的数据的平均值,作为插补数据。
综合上述几种主流处理方式,本申请对使用的数据集中的数据进行分析,发现数据集中大部分用户时间序列的缺失值不多,并且在包含缺失值的序列中连续缺失大段序列的情况也非常少。综合上述因素,本申请采用滑动平均插值法对缺失值进行处理,时间窗大小选择为7,以一星期7天为单位。
本申请对数据集进行预处理,采用滑动平均插值法处理数据集中的脏数据,此为模型检测工作的基础。
2.2、基于聚类分析的一级灰名单生成算法
基于聚类分析的一级灰名单生成算法的核心思想是:利用用户特征序列进行聚类分析,找到聚类类簇中成员数量较少的点,即用电行为与大多数用户用电行为不同的用户。本申请采用基于高斯混合模型的算法对用户进行聚类,最终将部分离群用户设定为可疑用户。
在此算法中存在两个重要参数为聚类类簇数量n以及聚类划分离群点的阈值k。该算法的计算效率与精度取决于这两个参数的设置,聚类类簇数量的设定以及聚类划分离群点的阈值选择过大或过小都会影响最终的计算。本申请将参数n与参数k根据实际数据集规模进行动态求解。参数n与参数k的最优求解方法如下。
2.2.1、参数n与参数k的求解方法
(1)聚类类簇数量n的求解方法
由于聚类分析中聚类的类别数量需要人工设定,在现实情况中,每个地区需要检测的用电用户的规模是不同的,武断地寻找一个最优聚类类别数量缺少一定的灵活性,所以本申请采用比例的方式进行参数选择,并选择一个最优参数进行聚类分析。本申请按照人数百分比 进行聚类类簇数量的选择,并通过多组实验选择最优聚类类簇数量值。
实验设定在划分离群点的阈值保持不变的情况下,设置聚类类簇数量为总用户数的1%,2%,3%,4%,5%,6%,7%,8%,9%,10%。将数据集随机分为四组不同数量级的数据集并将四组不同数量级的数据集进行基于无监督的用电异常行为检测,其中聚类类簇数量n按照每组对应的总数据的1-10%比例进行实验。
(2)聚类划分离群点的阈值k的求解方法
根据已经获得的参数n的最优值,进行聚类之后,如何判断哪些簇属于离群簇?判断标准依靠划分离群点的参数k,参数k决定着类簇是否为离群簇,若类簇内成员数量小于k则认为其类簇内成员为离群对象,将低于阈值k数量的簇中的用户设定为离群用户。同样在现实情况下,采用不同数量的类簇数进行聚类,则对应不同的离群点划分阈值。本申请基于最优值n进行参数k的设置,计算公式为:
k=p/n+(a-1)·10a=(1,2,...10)              (1);
其中,k为聚类划分离群点的阈值,p为检测的总用户数量,n为聚类类别数量,a为1-10的自然数。
利用数据集进行聚类划分离群点的阈值实验。实验设定聚类类簇数n为对应数据集总数的4.5%,参数a取1-10的自然数。对四组不同数量级的数据集进行基于无监督的用电异常行为检测。
2.2.2、基于聚类分析的一级灰名单生成算法
算法1给出了基于聚类分析的一级灰名单生成算法过程。该算法的主要执行过程如下:首先根据高斯聚类分析方法将用户分为n个簇(算法1中的(2)-(7)步),其中高斯概率计算公式如公式2所示。划分簇的目的是为了筛选出离群的点。接下来进行离群点筛选并将离群点加入一级灰名单中(算法1中的(10)-(11)步)。并将非离群对象加入非灰名单中(算法1中的(12)-(13)步)。最终生成一级灰名单用户列表list1,非灰名单用户集合M。
Figure PCTCN2018100379-appb-000001
表1基于聚类分析的一级灰名单生成算法
Figure PCTCN2018100379-appb-000002
Figure PCTCN2018100379-appb-000003
2.3、基于离群度计算的二级灰名单生成算法
利用上述模型可以求得可疑用电用户的一级灰名单列表,但在实地检测中发现一级灰名单列表虽然可以筛选出大量异常用电用户,但是针对大规模数据集时,往往一级灰名单列表中同样包含大量用户,检测没有针对性,造成检测效率低下。因此在算法1生成的一级灰名单基础上,提出基于离群度计算的二级灰名单生成算法。
基于离群度计算的二级灰名单生成算法核心思想如图2所示,在图2中对于C1集合的点,整体间距、密度、分散情况较为均匀一致,可以认为属于同一簇;对于C2集合的点,同样可认为属于同一簇。O1、O2点相对孤立,则认为是异常点或离散点。因此可以在总集合中计算这些异常点或离散点的离群度。
为计算一级灰名单用户的离群度,采用LOF(Local Outlier Factor,局部离群因子)算法计算一级灰名单中用户,获得每个用户的离群度值进而生成可疑度排名列表,其时间复杂度为O(n 2)。其中,用户LOF值越大,可疑度越高。基于离群度计算的二级灰名单生成算法利用一级灰名单计算,解决了在计算大规模数据集时直接计算每个用户LOF值运行时间非常长 的弊端。因为根据假设“异常用电用户数量远小于正常用电用户数量,即大多数人是好人”,通常情况下数据集中离群对象只占总数据集中对象的少数,假如为了找出少数的离群对象而计算所有对象的LOF值,这种做法的效率非常低,将要花费大量时间。采用离群度算法生成的二级灰名单是具有可疑度排名的用户列表,解决了一级灰名单检测无针对性的问题,可以提高现场检测的准确率以及效率。
基于以上思想,算法2给出了基于离群度计算的二级灰名单生成算法过程。该算法的主要执行过程如下:输入一级灰名单用户列表,利用公式2计算一级灰名单中用户的LOF值,并将用户LOF值按照由大到小进行排序并写入二级灰名单中(算法2中的(2)-(5)步)。目的是计算出各个离群用户的离群度即可疑程度。
定义1局部离群因子定义为:
Figure PCTCN2018100379-appb-000004
其中,函数lrd代表局部可达密度函数,MinPts代表近邻个数,若lrd MinPts(p)很小,则对象p的近邻的lrd MinPts(o)将很大,则对象p的LOF值较大;反之,若p是非离群对象,则对象p和其邻域内对象的lrd数值相差较小,即对象p的LOF数值接近于1,LOF数值越高则其离群度越大。
表2基于离群度计算的二级灰名单生成算法
Figure PCTCN2018100379-appb-000005
2.4、基于行为相似度计算的三级灰名单生成算法
由图1可知,基于半监督学习的异常用户检测模型流程可以分为以下几个具体步骤:
1)将非二级灰名单中用户集合C,以簇为单位,各簇并行的计算用户与黑名单库中用户行为相似度DTW值。
2)判断各簇中用户DTW值是否小于各簇DTW均值,若小于则将该用户加入三级灰名单中。
3)对三级灰名单进行DTW值降序排序。
4)输出三级灰名单,结束检测。
整个过程中涉及一个核心算法——基于行为相似度计算的三级灰名单生成算法。该算法采用DTW(Dynamic Time Warping,动态时间归整)算法进行用户相似度计算,主要考虑到检测用户的时间序列大多不等长,并且现阶段大部分相似度计算采用欧式距离计算法,但欧式距离计算法无法计算两个不等长序列之间的相似度。DTW算法优势在于可以将两个不等长时间序列进行延伸和缩短,来计算两个不等长序列之间的距离,进而判断两个序列的相似性。
基于行为相似度计算的三级灰名单生成算法基本思想是:偷电等异常用电行为采用的虚假手段有限,通过多轮异常检测积累,黑名单库逐渐完善更新用户异常行为,因此将待检用户与黑名单库进行行为相似度计算,找到与黑名单库中用户高相似度的用户即用电行为与黑名单中用户用电行为相似。
该算法以类簇为单位并行地与黑名单库中的用户进行计算,大大缩短了计算时间。由于黑名单库中成员较多,因此待检用户与黑名单库中每个成员都会生成一个相似度值即DTW值。DTW算法的原理是利用黑线之间相似点距离的和(称之为归整路径距离(Warp Path Distance))来衡量两个时间序列之间的相似性。
DTW值的计算方法如下:两个时间序列X和Y,长度分别为|X|和|Y|。归整路径为W=w 1,w 2,...,w k,Max(|X|,|Y|)≤K≤|X|+|Y|,wk为(i,j),其中i是代表X中的i坐标,j是代表Y中的j坐标,归整路径W从W 1=(1,1)开始,到wk=(|X|,|Y|)结束,以确保X和Y中的每一个坐标都在W中出现。另外,W中w(i,j)的i和j是单调增加,这样可以使两个时间序列的中间线不会相交,这里所说的单调增加为:
w k=(i,j),w k+1=(i',j')    i≤i'≤i+1,j≤j'≤j+1            (4);
最终得到的归整路径为
D(i,j)=Dist(i,j)+min[D(i-1,j),D(i,j-1),D(i-1,j-1)]      (5);
求得的归整路径距离为D(|X|,|Y|),用动态规划来对其进行求解。
本申请DTW值的设置方法如图3所示。例如,用户a有三个DTW值如100,200,300,最终用户a选择其中最小的值作为其自身DTW值,用户a的DTW值最后选择为100。由于算法是以找到与黑名单库中高相似度的用户为目的,因此选择用户DTW值中的最小值,即 该用户与黑名单库中行为最接近的某黑名单用户之间的距离值。
算法3给出了基于行为相似度计算的三级灰名单生成算法过程。该算法的主要执行过程如下:算法首先利用黑名单库中的用户,并行的与非灰名单用户的簇集合进行行为相似度计算(算法4.1中的(1)-(4)步)。此步目的是计算出待检用户与黑名单用户的最短距离即最大相似度。然后,计算各个类簇的DTW均值,目的是筛选出低于均值的用户,将其加入三级灰名单list3(算法4.1中的(5)-(6)步)。将其他用户加入到正常用户列表(算法3中的(7)-(8)步)。最后将list3由小到大的顺序进行排序,形成最终版本的三级灰名单用户列表(算法3中的(10)步)。
表3基于行为相似度计算的三级灰名单生成算法
Figure PCTCN2018100379-appb-000006
3、实验验证
3.1、数据集描述
数据集使用某地电力公司公布的用户用电量情况的数据。时间跨度从2016年1月到2017年1月。数据集包含诚实用电用户3000个,偷电用户400个。数据集用户类型分布如表4所示。
表4用户类型分布
用户类型 数量(人)
诚实用户(0) 3000
偷电用户(1) 400
用户的用电模式用其每天平均用电量来表示,在数据集本申请数据集的基础上可以进一步提取用户用电模式的特征量,数据集的属性详情如表5。
表5数据集属性表
Figure PCTCN2018100379-appb-000007
3.2、实验设置
本申请提出18个用户用电负荷序列特征,并通过实验对特征进行分析归一化与降维处理,以便于将不同单位不同量级的特征进行计算。通过两组实验为本申请中的两个参数进行赋值。
本节为实验设置部分,其中3.2.1节与3.2.2为实验特征设置,3.2.3节与3.2.4节为实验参数设置。3.2.5节对非监督学习(一级灰名单、二级灰名单)和半监督学习(三级灰名单+黑名单库)下检测结果进行了比较与分析。值得注意的是,在实验特征设置中,3.2.1节分析本申请提出的18个特征应用于本申请数据集后特征之间的关系以及对特征进行维度规约;3.2.2节对用电负荷序列特征进行归一化,以便于将不同单位不同量级的特征进行计算。在实验参数设置中,3.2.3节与3.2.4节分别通过实验求解本申请两个参数的最优值。
3.2.1、用电负荷序列特征
本申请分别提取了用户电量时间序列的时域特征与频域特征共18个特征。具体特征如下:
(1)时域特征
时域特征(Time domain features)是指,随时间变化时,序列与时间相关的属性特征。本申请提出了时间序列的时域特征如下:均值,方差,标准差,最大值,最小值,最大值与最小值之差,众数。我们用n来表示一个时间窗口的大小(即窗口内数据的行数),采用i表示第i行数据,下面简要介绍一下特征的计算方法:
a.均值mean:
Figure PCTCN2018100379-appb-000008
b.方差variance:
Figure PCTCN2018100379-appb-000009
c.标准差standard deviation:
Figure PCTCN2018100379-appb-000010
d.最大值max:
max=max(a i),i∈{1,2,...,n}               式(9)
e.最小值min:
min=min(a i),i∈{1,2,...,n}               式(10)
f:最大值与最小值之差:
max-min                      式(11)
g:众数mod:
时间序列中出现次数最多的数即该序列的众数。
(2)频域特征
频域特征(Frequency domain feature)可以发现序列的周期性信息,频域分析主要用快速傅里叶变换。本申请提出了时间序列的频域特征如下:直流分量,图形的均值、方差、标准差、斜度、峭度,幅度的均值、方差、标准差、斜度、峭度。下面简要介绍一下特征的计算方法:
直流分量DC
直流分量(Direct Current,DC)是傅里叶变换后的第一个分量,是这些信号的均值,一般要比其他的数大很多。
图形形状的统计特征
设C(i)是第i个窗口的频率幅度值,N表示窗口数,
Figure PCTCN2018100379-appb-000011
则形状统计特征的几个量计算方式如下:
a.均值mean:
Figure PCTCN2018100379-appb-000012
b.标准差standard deviation:
Figure PCTCN2018100379-appb-000013
c.偏度skewness:
Figure PCTCN2018100379-appb-000014
d.峰度kurtosis:
Figure PCTCN2018100379-appb-000015
图形幅度的统计特征
设C(i)是第i个窗口的频率幅度值,N表示窗口数,则幅度统计特征的几个量计算方式如下:
a.均值mean:
Figure PCTCN2018100379-appb-000016
b.标准差standard deviation:
Figure PCTCN2018100379-appb-000017
c.偏度skewness:
Figure PCTCN2018100379-appb-000018
d.峰度kurtosis:
Figure PCTCN2018100379-appb-000019
3.2.2、用电负荷序列特征规约
为了高效使用特征,故对提取的所有特征做相关性分析。特征的相关性分析采用皮森相关系数(Pearson系数)。其取值范围为[-1,1],若其绝对值越大,则正/负相关的程度越大,其值为0时则代表相互独立。利用此方法对提取的所有特征做相关性分析,得到的相关矩阵如图4所示。
图4为对本申请数据集进行特征提取形成的18个特征之间的相关度。相关度绝对越大则相关度越高,从图4可以看出,有部分特征之间有较大的相关性,其中黄色和紫色代表相关度很高(抛除对角线那条,对角线是特征与特征自身进行计算,由于数据完全一致则为1,没有任何意义)。为了消除特征之间的相关性,需对特征进行降维。本申请使用主成分分析算法(Principal Component Analysis,PCA)对数据特征进行重新构造,构造出新的相互独立的变量,消除原始特征之间重叠信息的影响。
PCA降维原理是利用协方差矩阵的特征值进行分析,最终得到数据的主成分。本节中利用PCA,去消除原始特征之间的信息重叠,增强特征的有效性。PCA计算方法如式20所示。
若用F 1、F 2、···、F m表示原变量X 1、X 2、···、X S的m个主成分,即
Figure PCTCN2018100379-appb-000020
为了可视化降维之后的分类效果,本申请将特征维度降低到二维与三维。图5与图6分别为将特征降到二维与三维的效果图。图中每个点表示一个用户,其中绿色圆点代表正常用户,红色“+”点代表异常用户。异常用户对应的点大多分布在密度较低的区域。本申请基于离群点的异常检测的目的就是根据用户密度找到更多离群对象。如图可以直观的看出图6(即三维特征图)中异常用户对应的点分布在密度较低的区域明显多于图5(即二维特征图)。因此获得如下结论。
结论1:将用电用户行为特征降低到三维可以有效检测出异常用电用户。
3.2.3、用电负荷序列特征标准化
数据标准化(归一化)处理是进行数据分析得基础性工作,为了消除特征间不同量纲的影响,首先需对数据进行标准化处理。数据标准化处理是将数据进行按比例缩放,使数据落入一个较小的特定区间,使其变为无量纲的纯数值数据。通过处理后的数据可以将不同单位不同量级的特征进行计算比较,进行综合的评价。
因为本申请提取的特征需要在聚类分析中需要使用,聚类分析中需要使用到欧式聚类,因此也需要消除指标之间的量纲影响,使每个特征居于相同地位,即使每个特征具有相同的权重。通常采用的标准化方法为以下两种:
1)Z-score归一化(Z-score normalization)
此处理方法使数据符合标准正态分布,其处理函数为式5.2:
X *=(x-u)/σ                 式(21)
其中μ为数据均值,σ为样本标准差。
2)0-1标准化方法(Min-Max normalization)。
此方法对数据进行线性变换,使数据最后归在[0,1]区间中,其处理函数为式5.3:
Figure PCTCN2018100379-appb-000021
式中max为数据的最大值,min为数据的最小值。
本申请选择z-score标准化方法,z-score标准化方法适用于数据集中包含超出取值范围的离群数据的情况。并且用电负荷数据属于实时采集数据,有时会存在异常大错误采集情况,使用0-1标准化方法会存在较大误差。综上所述选用z-score标准化方法更加适合本申请数据集。本申请将特征标准化之后的得到的相关矩阵如图7所示。通过对图7与图4对比发现,特征集相关矩阵没有任何变化,则特征标准化没有影响之间的线性关系,不会对实验产生误差。
3.2.4、聚类类簇数量n的最优值求解
图8为依据2.2.1节中的求解方法求得结果的AUC的变化折线图,图中横坐标为聚类类簇数量占总用户数的比例,纵坐标为分类效果AUC值。
如图8可见,AUC值随百分比的取值不同而变化,变化不是单调的。因此,存在一个最优值使得AUC较大,为了寻找一个最优参数值使算法有效性最高。本申请经过多组实验验证对比分析得出参数n按照数据集总数的4%-5%比例进行取值,可以使AUC获得到最优值。因此可以获得如下结论:
结论2:聚类类簇数n按照数据集总数的4.5%进行取值时,分类效果最佳。
3.2.5、聚类划分离群点的阈值k的最优值
图9为依据2.2.1节中的求解方法求得的四组数据集在参数a取不同值下进行实验得到的AUC变化折线图,图中横坐标为参数a的取值,纵坐标为分类效果AUC值。
由图9可见,实验中AUC随a取值的不同而变化,但变化并不是单调的。前期上升,后期属于下降状态,中间存在一个最优值使得AUC较大,本申请经过多组实验验证发现,当a取为3的时候效果最好。因此获得如下结论:
结论3:本申请聚类划分离群点的阈值k的最优取值为当a=3时最佳。
例如,数据集中正常用户与异常用户的数量共有800人,由上节可得n=p*4.5%=800*4.5%=36(即聚类设置的类簇个数为36),由本节可得a=3,k=p/n+(a-1)·10=800÷36+(3-1)*10=42(即划分离群点的阈值k为42)。
3.3、实验结果与分析
针对本申请提出的基于无监督学习的异常检测模型与基于半监督学习的异常检测模型进行两组实验。第一组实验是利用无监督学习异常检测模型检测本申请数据集,目的是对比利用一级灰名单的检测效率与利用二级灰名单的实地检测效率,并证明二级灰名单对实际检测的积极作用。第二组实验是对比基于无监督学习的异常检测模型与基于半监督学习的异常检 测模型检测本申请数据集的检测效果差异,证明基于半监督学习的异常检测模型检测效果更好。
3.3.1、基于无监督学习的异常检测模型实验结果分析
本申请在没有黑名单的情况下采用基于无监督学习的异常检测模型检测某地电力用户是否存在偷电等异常用电行为,现对模型检测结果进行简要分析。
在模型检测实验中会产生一级灰名单与二级灰名单。一级灰名单是由基于密度的高斯混合模型聚类分析产生的。二级灰名单是在一级灰名单的基础上进行局部离群点计算形成的具有可疑度排名的列表。本章使用的实验数据集是由总数据集随机等分为三组形成的,分别命名为数据集one,数据集two,数据集three,并分别为三个数据集匹配了相应的黑名单用户(黑名单用户与对应数据集中的用户无重叠)。如图10为三组数据集产生的一级灰名单与二级灰名单的累积查全率曲线,图中横坐标代表检测率,即检测灰名单用户的数量,纵坐标代表检测效果的累积查全率(其中检测率在本实验中含义为:检测10%的二级灰名单即将此10%被检测的用户预测为异常用户,其他用户预测为正常用户,其后不再赘述)。
由图10可知,图中a,b,c图都包含两条线,位于下方的图标为大圆点的线代表一级灰名单在数据集不同检测率下的累积查全率曲线,位于上方的图标为小三角的线代表二级灰名单在数据集不同检测率下的累积查全率曲线。图中三组实验,二级灰名单的累积查全率曲线一直高于一级灰名单的累积查全率曲线,一级灰名单在检测率提高的过程中累积查全率一直处于平稳的增长状态,基本上提高10%的检测率则提高10%的查全率,此状态表明,异常用电用户是无规律的分散在一级灰名单中。
由图10明显可以看出二级灰名单累积查全率曲线存在两个增长趋势,急速增长趋势以及平稳增长趋势。当检测率小于0.3时,曲线增长非常快;当检测率大于0.3以后,曲线增长明显变慢。对异常检测来说,这两个趋势代表的含义以及重要度不同。增长快速阶段说明检测前30%的用户即可查出约70%的异常用户,后一阶段说明检测剩余70%的用户只能查出30%的异常用户,即通过检测累积查全率曲线靠前的部分的少量数据即可找到大部分的异常用户,此特点表明二级灰名单中的异常用电用户不是无规律的分散在其中,这明显区别于一节灰名单。综上所述可以获得以下结论:
结论4:二级灰名单较一级灰名单更具有检测的针对性,利用二级灰名单进行实地检测,检测效率更高。
结论5:利用二级灰名单进行实地检测,只需检测前30%的用户即可获得较高的异常查全率,即通过检测累积查全率曲线靠前部分的少量的数据即可找到大部分的异常用户。
以上实验证明:结合聚类分析与局部离群点计算的无监督学习检测模型可以高效的检测 到异常用电用户。
3.3.2、基于半监督学习的检测模型实验结果分析
上一节在缺乏大量训练集的情况下采用无监督学习异常检测模型进行实验分析。无监督学习检测模型具有首次检测的优势,找出数据集的离群点,即找出高度可疑的用电行为异常的用户,进而提高供电公司实地检测的检测效率。在实际情况中,供电公司进行实地勘测的频率非常高,并且每轮勘测都会产生出黑名单用户。为了防止在单纯使用无监督学习检测模型的情况下,部分非离群点用户群体作案,所以本申请利用黑名单库用户行为信息筛选出非离群点用户中的异常用电行为用户,在上一节的基础上进一步提高检测的查全率以及准确率。采用DTW算法进行计算非离群点用户与黑名单库中用户的相似度。得到的DTW值越低则相似度越高,异常可能性越大。本申请半监督检测模型,首先通过无监督检测模型检测出数据集中的离群点,接下来对剩下被系统认为非离群点用户进行行为相似度计算。
图11为由无监督检测模型生成的二级灰名单在不同检测率下的分类准确率与半监督检测模型生成的灰名单在不同检测率下的分类准确率,图中横坐标代表检测率,即检测灰名单用户的数量,纵坐标代表检测效果的准确率。
由图11可见,图11中a,b,c图都分为两条线,下方图标为小三角的线为由无监督检测模型生成的二级灰名单在不同检测率下的分类准确率曲线,上方图标为叉号的线为半监督检测模型生成的灰名单在不同检测率下的分类准确率曲线。实验中,在三个不同的数据集上曲线的走势大多相同,由图可以直观的看出,图标为叉号的线在整个检测率提升的过程中一直高于图标为小三角的线。即在相同检测率的情况下,使用基于半监督学习的检测模型在检测过程中准确率是一直高于单纯使用基于无监督学习的检测模型。并且从图中可以看出,基于半监督学习检测模型的准确率在检测率为30%~40%左右时最高,准确率可以达到85%多,这对现场检测具有重要的价值。由上述分析可得以下结论:
结论6:基于无监督学习的检测模型适用于检测初始阶段,没有任何黑名单库的情况下。在具备一定黑名单库的情况下,使用基于半监督学习检测模型检测效果更佳。
结论7:半监督学习检测模型的准确率在检测率为30%~40%左右时可以达到85%多,对现场检测具有重要的价值。
在现实情形中,智能偷电装备愈发先进,出现团体作案的可能性非常大,增加有监督检测模型,可以高效地检测出部分团体作案行为,提高检测效率,节约人力物力财力投入。
当然,上述说明并非是对本发明的限制,本发明也并不仅限于上述举例,本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换,也应属于本发明的保护范围。

Claims (4)

  1. 一种基于半监督学习的异常用电用户检测方法,其特征在于:包括以下步骤:
    步骤1:数据预处理
    采用滑动平均插值法对数据集进行预处理;
    步骤2:基于聚类分析的一级灰名单生成
    假设大多数人都是正常用户,且正常用户和异常用户的行为特点是不同的,利用用户特征序列进行聚类分析,找到聚类类簇中成员数量较少的点,即用电行为与大多数用户用电行为不同的用户;采用基于高斯混合模型的算法对用户进行聚类,最终将部分离群用户设定为可疑用户,利用聚类分析方法筛选出离群点用户,即得到一级灰名单;
    步骤3:基于离群度计算的二级灰名单生成
    基于一级灰名单,计算用户的离群度,根据离群程度判断用户可疑程度,形成具有可疑度排名的二级灰名单;
    步骤4:基于行为相似度计算的三级灰名单生成
    利用基于行为相似度计算的三级灰名单生成算法,对应匹配黑名单库中用户的异常行为,检测出各类中与黑名单用户具有相似行为特征的可疑用户,形成三级灰名单。
  2. 根据权利要求1所述的基于半监督学习的异常用电用户检测方法,其特征在于:在步骤2中,具体包括如下步骤:
    步骤2.1:根据基于高斯混合模型的聚类算法将用户进行聚类划分为n个簇;
    步骤2.2:判断各个簇成员个数是否小于聚类划分离群点的阈值k;
    若:判断结果为各个簇成员个数小于聚类划分离群点的阈值k,则将簇中用户加入到一级灰名单中;
    或判断结果为各个簇成员个数大于或者等于聚类划分离群点的阈值k,则加入到非灰名单用户中。
  3. 根据权利要求1所述的基于半监督学习的异常用电用户检测方法,其特征在于:在步骤3中,具体包括如下步骤:
    步骤3.1:利用局部离群因子算法计算一级灰名单中用户的离群因子值;
    步骤3.2:将一级灰名单用户的离群因子值按照从大到小的顺序加入到二级灰名单中。
  4. 根据权利要求1所述的基于半监督学习的异常用电用户检测方法,其特征在于:在步骤4中,具体包括如下步骤:
    步骤4.1:将非灰名单中的用户以簇为单位,利用DTW算法计算非灰名单中的用户与黑名单库中的用户间的行为相似度DTW值;
    步骤4.2:计算非灰名单库中各簇成员的DTW均值,将各簇中低于DTW均值的用户筛 选出来加入到三级灰名单中;
    步骤4.3:将三级灰名单中用户按照DTW值由小到大进行排序。
PCT/CN2018/100379 2018-06-13 2018-08-14 一种基于半监督学习的异常用电用户检测方法 WO2019237492A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810604295.1 2018-06-13
CN201810604295.1A CN108805747A (zh) 2018-06-13 2018-06-13 一种基于半监督学习的异常用电用户检测方法

Publications (1)

Publication Number Publication Date
WO2019237492A1 true WO2019237492A1 (zh) 2019-12-19

Family

ID=64085381

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100379 WO2019237492A1 (zh) 2018-06-13 2018-08-14 一种基于半监督学习的异常用电用户检测方法

Country Status (2)

Country Link
CN (1) CN108805747A (zh)
WO (1) WO2019237492A1 (zh)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242701A (zh) * 2020-02-27 2020-06-05 国网北京市电力公司 一种电压异常时追补电费的方法
CN111401460A (zh) * 2020-03-24 2020-07-10 南京师范大学镇江创新发展研究院 一种基于限值学习的异常电量数据辨识方法
CN111612037A (zh) * 2020-04-24 2020-09-01 平安直通咨询有限公司上海分公司 异常用户检测方法、装置、介质及电子设备
CN111783875A (zh) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 基于聚类分析的异常用户检测方法、装置、设备及介质
CN111784093A (zh) * 2020-03-27 2020-10-16 国网浙江省电力有限公司 一种基于电力大数据分析的企业复工辅助判断方法
CN111915211A (zh) * 2020-08-11 2020-11-10 广东电网有限责任公司广州供电局 一种电力资源调度方法、装置和电子设备
CN112365164A (zh) * 2020-11-13 2021-02-12 国网江苏省电力有限公司扬州供电分公司 基于改进密度峰值快速搜索聚类算法的中大型能源用户用能特性画像方法
CN112488236A (zh) * 2020-12-07 2021-03-12 北京工业大学 一种集成的无监督学生行为聚类方法
CN112560940A (zh) * 2020-12-14 2021-03-26 广东电网有限责任公司广州供电局 一种用电异常检测方法、装置、设备和存储介质
CN112836747A (zh) * 2021-02-02 2021-05-25 首都师范大学 眼动数据的离群处理方法及装置、计算机设备、存储介质
CN112861989A (zh) * 2021-03-04 2021-05-28 水利部信息中心 一种基于密度筛选的深度神经网络回归模型
CN113469428A (zh) * 2021-06-24 2021-10-01 珠海卓邦科技有限公司 用水性质异常识别方法及装置、计算机装置及存储介质
CN113486971A (zh) * 2021-07-19 2021-10-08 国网山东省电力公司日照供电公司 基于主成分分析和神经网络的用户状态识别方法、系统
CN113591400A (zh) * 2021-08-23 2021-11-02 北京邮电大学 一种基于特征相关性分区回归的电力调度监控数据异常检测方法
CN113592533A (zh) * 2021-06-30 2021-11-02 国网上海市电力公司 一种基于无监督学习的异常用电检测方法及系统
CN113673579A (zh) * 2021-07-27 2021-11-19 国网湖北省电力有限公司营销服务中心(计量中心) 一种基于小样本的用电负荷分类算法
CN113780402A (zh) * 2021-09-07 2021-12-10 福州大学 一种基于改进式生成对抗网络的用户窃电检测方法
CN113822343A (zh) * 2021-09-03 2021-12-21 国网江苏省电力有限公司营销服务中心 一种基于细粒度用能数据的群租房识别方法
CN114067093A (zh) * 2021-09-23 2022-02-18 济南大学 基于时序与图像的散乱污用户精准捕获方法及系统
CN114089006A (zh) * 2021-11-19 2022-02-25 国网冀北电力有限公司唐山供电公司 一种低压窃电分析仪及使用方法
CN114553565A (zh) * 2022-02-25 2022-05-27 国网山东省电力公司临沂供电公司 一种基于请求频率的安全态势感知方法和系统
CN115147203A (zh) * 2022-06-08 2022-10-04 南京金威诚融科技开发有限公司 基于大数据的金融风险智能分析方法
CN115456097A (zh) * 2022-09-22 2022-12-09 国网四川省电力公司自贡供电公司 一种适用于高供低计专变用户的用电检测方法及检测终端
CN115508511A (zh) * 2022-09-19 2022-12-23 中节能天融科技有限公司 基于网格化设备全参数特征分析的传感器自适应校准方法
CN116051985A (zh) * 2022-12-20 2023-05-02 中国科学院空天信息创新研究院 一种基于多模型互馈学习的半监督遥感目标检测方法
CN116541731A (zh) * 2023-05-26 2023-08-04 北京百度网讯科技有限公司 网络行为数据的处理方法、装置和设备
CN116628529A (zh) * 2023-07-21 2023-08-22 山东科华电力技术有限公司 一种用于用户侧智能负荷控制系统的数据异常检测方法
CN116777124A (zh) * 2023-08-24 2023-09-19 国网山东省电力公司临沂供电公司 一种基于用户用电行为的窃电监测方法
CN116862081A (zh) * 2023-09-05 2023-10-10 北京建工环境修复股份有限公司 一种污染治理设备运维方法及系统
CN116976707A (zh) * 2023-09-22 2023-10-31 安徽融兆智能有限公司 基于用电信息采集的用户用电数据异常分析方法及系统
CN117009910A (zh) * 2023-10-08 2023-11-07 湖南工程学院 一种环境温度异常变化智能监测方法
CN117113248A (zh) * 2023-08-10 2023-11-24 深圳市华翌科技有限公司 基于数据驱动的燃气气量数据异常检测方法
CN117272198A (zh) * 2023-09-08 2023-12-22 广东美亚商旅科技有限公司 一种基于商旅行程业务数据的异常用户生成内容识别方法
CN117591971A (zh) * 2023-07-10 2024-02-23 国网四川省电力公司营销服务中心 一种基于多粒度模糊相对差的无监督窃电检测方法
CN117648647A (zh) * 2024-01-29 2024-03-05 国网山东省电力公司经济技术研究院 一种多能源配电网用户数据优化分类方法
TWI837819B (zh) 2022-09-12 2024-04-01 財團法人資訊工業策進會 用電行為分析裝置及用電行為分析方法

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805747A (zh) * 2018-06-13 2018-11-13 山东科技大学 一种基于半监督学习的异常用电用户检测方法
CN110046796A (zh) * 2019-01-04 2019-07-23 国网浙江省电力有限公司 一种基于机器学习模型的电力风险客户筛选方法
CN109727446B (zh) * 2019-01-15 2021-03-05 华北电力大学(保定) 一种用电数据异常值的识别与处理方法
CN109978358B (zh) * 2019-03-18 2021-08-13 中国科学院自动化研究所 基于半监督学习的销售风险点检测系统、装置
CN111708813A (zh) * 2019-03-18 2020-09-25 顺丰科技有限公司 一种用户日常行为异常检测方法和装置
CN111723118A (zh) * 2019-03-18 2020-09-29 顺丰科技有限公司 一种运单查询异常行为检测方法和装置
CN111723825A (zh) * 2019-03-18 2020-09-29 顺丰科技有限公司 一种客户信息查询异常行为检测方法和装置
CN110288383B (zh) * 2019-05-31 2024-02-02 国网上海市电力公司 基于用户属性标签的群体行为配电网用电异常检测方法
CN112017324A (zh) * 2019-05-31 2020-12-01 上海凌晗电子科技有限公司 一种驾驶信息实时交互系统及方法
CN110363510B (zh) * 2019-06-05 2022-09-06 西安电子科技大学 一种基于区块链的加密货币用户特征挖掘、异常用户检测方法
CN110736888A (zh) * 2019-10-24 2020-01-31 国网上海市电力公司 一种用户用电行为异常的监测方法
CN110929800B (zh) * 2019-11-29 2022-10-21 四川万益能源科技有限公司 一种基于sax算法的商业体异常用电检测方法
CN111428780B (zh) * 2020-03-20 2023-04-07 上海理工大学 基于数据驱动的电网异常运行状态识别方法
CN111504366B (zh) * 2020-03-23 2022-01-25 李方 基于人工智能的流体输送系统精确计量方法及计量装置
CN111539843B (zh) * 2020-04-17 2022-07-12 国网新疆电力有限公司营销服务中心(资金集约中心、计量中心) 基于数据驱动的反窃电智能预警方法
CN111785014B (zh) * 2020-05-26 2021-10-29 浙江工业大学 一种基于dtw-rgcn的路网交通数据修复的方法
CN111612650B (zh) * 2020-05-27 2022-06-17 福州大学 一种基于dtw距离的电力用户分群方法及系统
CN111738308A (zh) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 基于聚类及半监督学习的监控指标动态阈值检测方法
CN111797143B (zh) * 2020-07-07 2023-12-15 长沙理工大学 基于用电量统计分布偏度系数的水产养殖业窃电检测方法
CN112541016A (zh) * 2020-11-26 2021-03-23 南方电网数字电网研究院有限公司 用电异常检测方法、装置、计算机设备和存储介质
CN112633427B (zh) * 2021-03-15 2021-05-28 四川大学 一种基于离群点检测的超高次谐波发射信号检测方法
CN113052398A (zh) * 2021-04-21 2021-06-29 广州高谱技术有限公司 一种基于变分模态分解的用电量预测方法及其系统
CN113344589B (zh) * 2021-05-12 2022-10-21 兰州理工大学 一种基于vaegmm模型的发电企业串谋行为的智能识别方法
CN113723497A (zh) * 2021-08-26 2021-11-30 广西大学 基于混合特征提取及Stacking模型的异常用电检测方法、装置、设备及存储介质
CN117556108B (zh) * 2024-01-12 2024-03-26 泰安金冠宏食品科技有限公司 一种基于数据分析的油渣分离效率异常检测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103839197A (zh) * 2014-03-19 2014-06-04 国家电网公司 一种基于eemd方法的用户用电行为异常的判定方法
CN105141604A (zh) * 2015-08-19 2015-12-09 国家电网公司 一种基于可信业务流的网络安全威胁检测方法及系统
CN106850346A (zh) * 2017-01-23 2017-06-13 北京京东金融科技控股有限公司 用于监控节点变化及辅助识别黑名单的方法、装置及电子设备
CN108805747A (zh) * 2018-06-13 2018-11-13 山东科技大学 一种基于半监督学习的异常用电用户检测方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103839197A (zh) * 2014-03-19 2014-06-04 国家电网公司 一种基于eemd方法的用户用电行为异常的判定方法
CN105141604A (zh) * 2015-08-19 2015-12-09 国家电网公司 一种基于可信业务流的网络安全威胁检测方法及系统
CN106850346A (zh) * 2017-01-23 2017-06-13 北京京东金融科技控股有限公司 用于监控节点变化及辅助识别黑名单的方法、装置及电子设备
CN108805747A (zh) * 2018-06-13 2018-11-13 山东科技大学 一种基于半监督学习的异常用电用户检测方法

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242701A (zh) * 2020-02-27 2020-06-05 国网北京市电力公司 一种电压异常时追补电费的方法
CN111401460A (zh) * 2020-03-24 2020-07-10 南京师范大学镇江创新发展研究院 一种基于限值学习的异常电量数据辨识方法
CN111784093A (zh) * 2020-03-27 2020-10-16 国网浙江省电力有限公司 一种基于电力大数据分析的企业复工辅助判断方法
CN111784093B (zh) * 2020-03-27 2023-07-11 国网浙江省电力有限公司 一种基于电力大数据分析的企业复工辅助判断方法
CN111612037A (zh) * 2020-04-24 2020-09-01 平安直通咨询有限公司上海分公司 异常用户检测方法、装置、介质及电子设备
CN111612037B (zh) * 2020-04-24 2024-06-21 平安直通咨询有限公司上海分公司 异常用户检测方法、装置、介质及电子设备
CN111783875A (zh) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 基于聚类分析的异常用户检测方法、装置、设备及介质
CN111783875B (zh) * 2020-06-29 2024-04-30 中国平安财产保险股份有限公司 基于聚类分析的异常用户检测方法、装置、设备及介质
CN111915211A (zh) * 2020-08-11 2020-11-10 广东电网有限责任公司广州供电局 一种电力资源调度方法、装置和电子设备
CN112365164A (zh) * 2020-11-13 2021-02-12 国网江苏省电力有限公司扬州供电分公司 基于改进密度峰值快速搜索聚类算法的中大型能源用户用能特性画像方法
CN112365164B (zh) * 2020-11-13 2023-09-12 国网江苏省电力有限公司扬州供电分公司 基于改进密度峰值快速搜索聚类算法的中大型能源用户用能特性画像方法
CN112488236A (zh) * 2020-12-07 2021-03-12 北京工业大学 一种集成的无监督学生行为聚类方法
CN112488236B (zh) * 2020-12-07 2024-05-28 北京工业大学 一种集成的无监督学生行为聚类方法
CN112560940A (zh) * 2020-12-14 2021-03-26 广东电网有限责任公司广州供电局 一种用电异常检测方法、装置、设备和存储介质
CN112836747A (zh) * 2021-02-02 2021-05-25 首都师范大学 眼动数据的离群处理方法及装置、计算机设备、存储介质
CN112861989A (zh) * 2021-03-04 2021-05-28 水利部信息中心 一种基于密度筛选的深度神经网络回归模型
CN113469428A (zh) * 2021-06-24 2021-10-01 珠海卓邦科技有限公司 用水性质异常识别方法及装置、计算机装置及存储介质
CN113592533A (zh) * 2021-06-30 2021-11-02 国网上海市电力公司 一种基于无监督学习的异常用电检测方法及系统
CN113592533B (zh) * 2021-06-30 2023-09-12 国网上海市电力公司 一种基于无监督学习的异常用电检测方法及系统
CN113486971B (zh) * 2021-07-19 2023-10-27 国网山东省电力公司日照供电公司 基于主成分分析和神经网络的用户状态识别方法、系统
CN113486971A (zh) * 2021-07-19 2021-10-08 国网山东省电力公司日照供电公司 基于主成分分析和神经网络的用户状态识别方法、系统
CN113673579B (zh) * 2021-07-27 2024-05-28 国网湖北省电力有限公司营销服务中心(计量中心) 一种基于小样本的用电负荷分类算法
CN113673579A (zh) * 2021-07-27 2021-11-19 国网湖北省电力有限公司营销服务中心(计量中心) 一种基于小样本的用电负荷分类算法
CN113591400B (zh) * 2021-08-23 2023-06-27 北京邮电大学 一种基于特征相关性分区回归的电力调度监控数据异常检测方法
CN113591400A (zh) * 2021-08-23 2021-11-02 北京邮电大学 一种基于特征相关性分区回归的电力调度监控数据异常检测方法
CN113822343A (zh) * 2021-09-03 2021-12-21 国网江苏省电力有限公司营销服务中心 一种基于细粒度用能数据的群租房识别方法
CN113822343B (zh) * 2021-09-03 2023-08-25 国网江苏省电力有限公司营销服务中心 一种基于细粒度用能数据的群租房识别方法
CN113780402A (zh) * 2021-09-07 2021-12-10 福州大学 一种基于改进式生成对抗网络的用户窃电检测方法
CN114067093A (zh) * 2021-09-23 2022-02-18 济南大学 基于时序与图像的散乱污用户精准捕获方法及系统
CN114089006A (zh) * 2021-11-19 2022-02-25 国网冀北电力有限公司唐山供电公司 一种低压窃电分析仪及使用方法
CN114089006B (zh) * 2021-11-19 2023-12-05 国网冀北电力有限公司唐山供电公司 一种低压窃电分析仪及使用方法
CN114553565A (zh) * 2022-02-25 2022-05-27 国网山东省电力公司临沂供电公司 一种基于请求频率的安全态势感知方法和系统
CN114553565B (zh) * 2022-02-25 2024-02-02 国网山东省电力公司临沂供电公司 一种基于请求频率的安全态势感知方法和系统
CN115147203A (zh) * 2022-06-08 2022-10-04 南京金威诚融科技开发有限公司 基于大数据的金融风险智能分析方法
CN115147203B (zh) * 2022-06-08 2024-03-15 阿尔法时刻科技(深圳)有限公司 基于大数据的金融风险分析方法
TWI837819B (zh) 2022-09-12 2024-04-01 財團法人資訊工業策進會 用電行為分析裝置及用電行為分析方法
CN115508511A (zh) * 2022-09-19 2022-12-23 中节能天融科技有限公司 基于网格化设备全参数特征分析的传感器自适应校准方法
CN115508511B (zh) * 2022-09-19 2023-05-26 中节能天融科技有限公司 基于网格化设备全参数特征分析的传感器自适应校准方法
CN115456097A (zh) * 2022-09-22 2022-12-09 国网四川省电力公司自贡供电公司 一种适用于高供低计专变用户的用电检测方法及检测终端
CN116051985B (zh) * 2022-12-20 2023-06-23 中国科学院空天信息创新研究院 一种基于多模型互馈学习的半监督遥感目标检测方法
CN116051985A (zh) * 2022-12-20 2023-05-02 中国科学院空天信息创新研究院 一种基于多模型互馈学习的半监督遥感目标检测方法
CN116541731A (zh) * 2023-05-26 2023-08-04 北京百度网讯科技有限公司 网络行为数据的处理方法、装置和设备
CN117591971A (zh) * 2023-07-10 2024-02-23 国网四川省电力公司营销服务中心 一种基于多粒度模糊相对差的无监督窃电检测方法
CN116628529B (zh) * 2023-07-21 2023-10-20 山东科华电力技术有限公司 一种用于用户侧智能负荷控制系统的数据异常检测方法
CN116628529A (zh) * 2023-07-21 2023-08-22 山东科华电力技术有限公司 一种用于用户侧智能负荷控制系统的数据异常检测方法
CN117113248B (zh) * 2023-08-10 2024-06-11 深圳市华翌科技有限公司 基于数据驱动的燃气气量数据异常检测方法
CN117113248A (zh) * 2023-08-10 2023-11-24 深圳市华翌科技有限公司 基于数据驱动的燃气气量数据异常检测方法
CN116777124A (zh) * 2023-08-24 2023-09-19 国网山东省电力公司临沂供电公司 一种基于用户用电行为的窃电监测方法
CN116777124B (zh) * 2023-08-24 2023-11-07 国网山东省电力公司临沂供电公司 一种基于用户用电行为的窃电监测方法
CN116862081B (zh) * 2023-09-05 2023-11-21 北京建工环境修复股份有限公司 一种污染治理设备运维方法及系统
CN116862081A (zh) * 2023-09-05 2023-10-10 北京建工环境修复股份有限公司 一种污染治理设备运维方法及系统
CN117272198B (zh) * 2023-09-08 2024-05-28 广东美亚商旅科技有限公司 一种基于商旅行程业务数据的异常用户生成内容识别方法
CN117272198A (zh) * 2023-09-08 2023-12-22 广东美亚商旅科技有限公司 一种基于商旅行程业务数据的异常用户生成内容识别方法
CN116976707A (zh) * 2023-09-22 2023-10-31 安徽融兆智能有限公司 基于用电信息采集的用户用电数据异常分析方法及系统
CN116976707B (zh) * 2023-09-22 2023-12-26 安徽融兆智能有限公司 基于用电信息采集的用户用电数据异常分析方法及系统
CN117009910A (zh) * 2023-10-08 2023-11-07 湖南工程学院 一种环境温度异常变化智能监测方法
CN117009910B (zh) * 2023-10-08 2023-12-15 湖南工程学院 一种环境温度异常变化智能监测方法
CN117648647B (zh) * 2024-01-29 2024-04-23 国网山东省电力公司经济技术研究院 一种多能源配电网用户数据优化分类方法
CN117648647A (zh) * 2024-01-29 2024-03-05 国网山东省电力公司经济技术研究院 一种多能源配电网用户数据优化分类方法

Also Published As

Publication number Publication date
CN108805747A (zh) 2018-11-13

Similar Documents

Publication Publication Date Title
WO2019237492A1 (zh) 一种基于半监督学习的异常用电用户检测方法
Rajabi et al. A comparative study of clustering techniques for electrical load pattern segmentation
Himeur et al. Robust event-based non-intrusive appliance recognition using multi-scale wavelet packet tree and ensemble bagging tree
Qu et al. A combined genetic optimization with AdaBoost ensemble model for anomaly detection in buildings electricity consumption
CN103323749B (zh) 多分类器信息融合的局部放电诊断方法
CN109657547A (zh) 一种基于伴随模型的异常轨迹分析方法
Yeckle et al. Detection of electricity theft in customer consumption using outlier detection algorithms
Keyan et al. An improved support-vector network model for anti-money laundering
CN109902564B (zh) 一种基于结构相似性稀疏自编码网络的异常事件检测方法
CN110942099A (zh) 一种基于核心点保留的dbscan的异常数据识别检测方法
CN111783845A (zh) 一种基于局部线性嵌入和极限学习机的隐匿虚假数据注入攻击检测方法
WO2019200739A1 (zh) 数据欺诈识别方法、装置、计算机设备和存储介质
CN113542241A (zh) 一种基于CNN-BiGRU混合模型的入侵检测方法及装置
Kong et al. Anomaly detection based on joint spatio-temporal learning for building electricity consumption
CN114580934A (zh) 基于无监督异常检测的食品检测数据风险的早预警方法
CN114169998A (zh) 一种金融大数据分析与挖掘算法
CN114037001A (zh) 基于wgan-gp-c和度量学习的机械泵小样本故障诊断方法
CN116365519B (zh) 一种电力负荷预测方法、系统、存储介质及设备
CN113343123A (zh) 一种生成对抗多关系图网络的训练方法和检测方法
CN117493953A (zh) 一种基于缺陷数据挖掘的避雷器状态评估方法
CN117197591A (zh) 一种基于机器学习的数据分类方法
CN107454084B (zh) 基于杂交带的最近邻入侵检测算法
CN115545342A (zh) 一种企业电费回收的风险预测方法与系统
CN115017988A (zh) 一种用于状态异常诊断的竞争聚类方法
Jiang et al. Classification of surface defects based on improved Gabor filter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18922812

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18922812

Country of ref document: EP

Kind code of ref document: A1