CN111898637A

CN111898637A - Feature selection algorithm based on Relieff-DDC

Info

Publication number: CN111898637A
Application number: CN202010597594.4A
Authority: CN
Inventors: 邵琪; 包永强; 贾成宇; 张旭旭; 陆志文
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-11-06
Anticipated expiration: 2040-06-28
Also published as: CN111898637B

Abstract

The invention particularly relates to a feature selection algorithm based on Relieff-DDC, which comprises S1, obtaining a training set sample, and determining parameter values of the algorithm: s2, resetting all the feature weights to 0, and setting the feature weights as an empty set; s3, selecting samples from the training set, updating the weights of the features of all dimensions contained in the samples, determining 'important features' by utilizing the relevance between the features calculated by the Relieff and each category, and eliminating irrelevant features; s4, adding corresponding feature vectors to the set in descending order when the output feature vectors are larger than the threshold value; s5, utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; and S6, obtaining the optimal characteristic subset, and using the selected characteristics for non-invasive load identification. The method effectively reduces the feature dimension, improves the load recognition rate and shortens the operation time of the algorithm.

Description

Feature selection algorithm based on Relieff-DDC

Technical Field

The invention relates to a non-intrusive load characteristic selection algorithm, in particular to a characteristic selection algorithm based on Relieff-DDC.

Background

The Non-intrusive Load Monitoring (NILM) provides data support for realizing interaction between a smart grid and a user, and the method is characterized in that a sensor is installed at an entrance of a service line, so that electric quantity data such as voltage and current of a total Load are collected for analysis, and system data are refined so as to identify the category and the operating state of household appliances. Compared with an intrusive load monitoring method (ILM), the NILM has the advantages of low cost, high user acceptance, convenience in later maintenance and the like, but the requirement of the method on a load decomposition algorithm is high. The feature extraction and the load identification are used as two key technologies in the NILM, and powerful technical support is provided for the development of the NILM.

Most of the current relevant scholars are engaged in the research work of the characteristic selection and the load identification method of the electric load, and a series of achievements are obtained in the relevant fields. In contrast, studies on the selection of loading characteristics were somewhat deficient. The feature selection is to select an optimal feature subset for a subsequent task in the original high-dimensional features according to a certain evaluation criterion, and based on the representative features of a small quantity, the learning process of the model can be accelerated, and the generalization capability of the model can be improved. Feature selection is widely applied in the fields of image processing, data mining, machine learning and the like, and when high-dimensional data containing a large number of features is processed, the features inevitably contain noise, irrelevant features and redundant features. In this case, it is necessary to extract the most valuable information with the richest information amount.

Disclosure of Invention

1. The technical problem to be solved is as follows:

aiming at the technical problem, the invention provides a feature selection algorithm based on Relieff-DDC, which comprises the steps of firstly, calculating feature weights and arranging the feature weights in a descending order, and selecting features with larger weights to remove irrelevant features; secondly, calculating interaction information between each feature and a decision variable, and deleting redundant features by utilizing decision correlation analysis to obtain a final feature subset; and finally, a twin support vector machine (TWSVM) is used for load identification, and the method effectively reduces the feature dimension, improves the load identification rate and shortens the operation time of the algorithm.

2. The technical scheme is as follows:

a feature selection algorithm based on Relieff-DDC is characterized by comprising the following steps:

the method comprises the following steps: acquiring a training set sample, and determining parameter values involved in an algorithm; the method specifically comprises the following steps:

s11, setting the training set to be processed as D and sample X_l＝{x_l1，x_l2，…x_ld}，x_ldIs the D-dimension feature of the ith sample in the training set D.

S12, determining iteration times m, wherein m is more than or equal to 1; a characteristic weight threshold value tau, wherein tau is more than or equal to 0 and less than or equal to 1; the number k of nearest neighbor samples is an integer which is more than or equal to 1; and (4) evaluating a criterion threshold value, wherein the threshold value is more than or equal to 0 and less than or equal to 1.

Step two: all feature weights in the sample are reset to 0 and set to F, S as an empty set.

Step three: selecting sample X from training set D_lUpdating the weights of the features of all dimensions contained in the feature list, determining 'important features' by utilizing the correlation between the features calculated by the Relieff and each class, and excluding irrelevant features; the method specifically comprises the following steps:

s31, randomly selecting a sample X in the training set D in the m iteration processes_lSample X_lBelong to class C under the relation X_lSearching k nearest neighbor samples H in homogeneous samples_jJ ═ 1,2 … k, and X_lFinding k nearest neighbor samples M among heterogeneous samples_j。

S32, when r is more than or equal to 1 and less than or equal to d, updating the weight W (r) of the r-th dimension feature:

(1) in the formula (2), P (C) represents the prior probability distribution of class C in the data set, M_j(C) Represents the jth nearest neighbor sample of class C; where diff (d, a, B) represents the degree of discrimination between sample a and sample B at the d-dimension eigenvalue.

Step four, traversing all values of r in S32, outputting W (d)' with the value larger thanAll corresponding feature vectors F at threshold τ are added to the set F in descending order, F ═ F₁，f₂…f_n}，n＜d。

Step five: utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; the method specifically comprises the following steps:

s51, when e (S) < f_iSatisfies the following conditions:

let F ← F- { F_j}，S←S+{f_j}；

Wherein e (S) is I (C; f) and Q_C(f, s) together constitute a feature subset evaluation criterion, defined as follows:

in the formula (5), the entropy H (C) represents the uncertainty measure of the random variable C, the mutual information I (C; f)_j) Representing a random variable C, f_jThe shared information between the two is defined as:

I(C,f_j)＝H(C)-H(C|f_j) (6)

in the formulas (4) and (5), f and s are two features respectively, and the correlation measure Q_C(f, s) is represented by

S52, judging whether the F is an empty set or not, and if the F is not equal to phi, jumping to the step four; if F is empty; and continuing to the sixth step.

Step six: and obtaining an optimal feature subset S, and using the selected features for non-invasive load identification.

3. Has the advantages that:

(1) the feature extraction and the load identification are used as two key technologies in non-invasive load monitoring (NILM), and powerful technical support is provided for the development of the NILM. In the non-intrusive load monitoring system, the load identification accuracy is not high due to poor data feature selection, so that the load operation data features are extracted for load identification, and the method plays an important role in improving the accuracy of non-intrusive load identification. Feature selection is widely applied in the fields of image processing, data mining, machine learning and the like, and when high-dimensional data containing a large number of features is processed, the features inevitably contain noise, irrelevant features and redundant features. In this case, it is necessary to extract the most valuable information with the richest information amount. The Relieff algorithm solves the problem that the original Relief algorithm cannot perform feature selection on multi-class data, has a good processing effect on incomplete and noisy data, and cannot delete redundant features. The relevance measure calculation method provided by the DDC algorithm fully considers the relevance and the dependency degree between the features and decision variables, and compares the feature subset evaluation measure with a set threshold value so as to screen redundant features.

(2) The invention provides a feature selection algorithm based on Relieff-DDC, which is characterized by comprising the steps of firstly, calculating feature weights and arranging the feature weights in a descending order, and selecting features with larger weights to remove irrelevant features; secondly, calculating interaction information between each feature and a decision variable, and deleting redundant features by utilizing decision correlation analysis to obtain a final feature subset; and finally, a twin support vector machine (TWSVM) is used for load identification, and the method effectively reduces the feature dimension, improves the load identification rate and shortens the operation time of the algorithm.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

As shown in the flow chart of fig. 1, a feature selection algorithm based on ReliefF-DDC includes the following steps:

Step four, traversing all values of r in S32, and outputting all corresponding characters when the values of W (d)' are larger than the threshold value tauA eigenvector F added to the set F in descending order, F ═ F₁，f₂…f_n}，n＜d。

s51, when e (S) < f_iSatisfies the following conditions:

let F ← F- { F_j}，S←S+{f_j}；

I(C,f_j)＝H(C)-H(C|f_j) (6) in the formulas (4) and (5), f and s are two features respectively, and the correlation measure Q_C(f, s) is represented by

The invention provides a characteristic selection algorithm based on Relieff-DDC (direct digital control) in order to effectively select load characteristics and improve non-intrusive load identification precision by taking non-intrusive load monitoring as a research background. The algorithm firstly extracts the characteristics of each power load, calculates the distance from each characteristic of the selected sample to the nearest neighbor similar sample and the different samples by utilizing a RefiefF algorithm to obtain each characteristic weight, and removes irrelevant characteristics according to a set weight threshold value after arranging in a descending order; secondly, analyzing the degree of dependence between the features and the categories by calculating mutual information by using a DDC algorithm, and quantizing a judgment criterion into the size of comparison feature subset evaluation measurement and a set threshold value so as to delete redundant features; and finally, identifying and classifying by using a twin support vector machine. Experimental results show that the method effectively reduces the feature dimension, improves the load recognition rate and shortens the operation time of the algorithm.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A feature selection algorithm based on Relieff-DDC is characterized by comprising the following steps:

s11, setting the training set to be processed as D and sample X_l＝{x_l1，x_l2，…x_ld}，x_ldIs the D-dimension characteristic of the l-th sample in the training set D;

s12, determining iteration times m, wherein m is more than or equal to 1; a characteristic weight threshold value tau, wherein tau is more than or equal to 0 and less than or equal to 1; the number k of nearest neighbor samples is an integer which is more than or equal to 1; an evaluation criterion threshold value is more than or equal to 0 and less than or equal to 1;

Step three: selecting sample X from training set D_lAnd updateThe weight of the features of all dimensions contained in the feature list is calculated by utilizing the Relieff to calculate the correlation between the features and each class so as to determine 'important features', and irrelevant features are eliminated; the method specifically comprises the following steps:

s31, randomly selecting a sample X in the training set D in the m iteration processes_lSample X_lBelong to class C under the relation X_lSearching k nearest neighbor samples H in homogeneous samples_jJ ═ 1,2 … k and X_lFinding k nearest neighbor samples M among heterogeneous samples_j；

(1) in the formula (2), P (C) represents the prior probability distribution of class C in the data set, M_j(C) Represents the jth nearest neighbor sample of class C; wherein diff (d, A, B) represents the degree of discrimination of the sample A and the sample B on the d-dimension characteristic value;

step four, traversing all values of r in S32, outputting all corresponding feature vectors F when the values are larger than a threshold value tau in W (d)' and adding the feature vectors F to a set F in descending order, wherein F is { F ═ F₁,f₂…f_n},n<d；

s51, when e (S) < f_iSatisfies the following conditions:

let F ← F- { F_j}，S←S+{f_j}；

I(C,f_j)＝H(C)-H(C|f_j) (6)

S52, judging whether the F is an empty set or not, and if the F is not equal to phi, jumping to the step four; if F is empty; continuing the step six;