CN111898637A - Feature selection algorithm based on Relieff-DDC - Google Patents

Feature selection algorithm based on Relieff-DDC Download PDF

Info

Publication number
CN111898637A
CN111898637A CN202010597594.4A CN202010597594A CN111898637A CN 111898637 A CN111898637 A CN 111898637A CN 202010597594 A CN202010597594 A CN 202010597594A CN 111898637 A CN111898637 A CN 111898637A
Authority
CN
China
Prior art keywords
features
feature
sample
equal
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010597594.4A
Other languages
Chinese (zh)
Other versions
CN111898637B (en
Inventor
邵琪
包永强
贾成宇
张旭旭
陆志文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202010597594.4A priority Critical patent/CN111898637B/en
Publication of CN111898637A publication Critical patent/CN111898637A/en
Application granted granted Critical
Publication of CN111898637B publication Critical patent/CN111898637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S20/00Management or operation of end-user stationary applications or the last stages of power distribution; Controlling, monitoring or operating thereof
    • Y04S20/20End-user application control systems
    • Y04S20/242Home appliances

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention particularly relates to a feature selection algorithm based on Relieff-DDC, which comprises S1, obtaining a training set sample, and determining parameter values of the algorithm: s2, resetting all the feature weights to 0, and setting the feature weights as an empty set; s3, selecting samples from the training set, updating the weights of the features of all dimensions contained in the samples, determining 'important features' by utilizing the relevance between the features calculated by the Relieff and each category, and eliminating irrelevant features; s4, adding corresponding feature vectors to the set in descending order when the output feature vectors are larger than the threshold value; s5, utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; and S6, obtaining the optimal characteristic subset, and using the selected characteristics for non-invasive load identification. The method effectively reduces the feature dimension, improves the load recognition rate and shortens the operation time of the algorithm.

Description

Feature selection algorithm based on Relieff-DDC
Technical Field
The invention relates to a non-intrusive load characteristic selection algorithm, in particular to a characteristic selection algorithm based on Relieff-DDC.
Background
The Non-intrusive Load Monitoring (NILM) provides data support for realizing interaction between a smart grid and a user, and the method is characterized in that a sensor is installed at an entrance of a service line, so that electric quantity data such as voltage and current of a total Load are collected for analysis, and system data are refined so as to identify the category and the operating state of household appliances. Compared with an intrusive load monitoring method (ILM), the NILM has the advantages of low cost, high user acceptance, convenience in later maintenance and the like, but the requirement of the method on a load decomposition algorithm is high. The feature extraction and the load identification are used as two key technologies in the NILM, and powerful technical support is provided for the development of the NILM.
Most of the current relevant scholars are engaged in the research work of the characteristic selection and the load identification method of the electric load, and a series of achievements are obtained in the relevant fields. In contrast, studies on the selection of loading characteristics were somewhat deficient. The feature selection is to select an optimal feature subset for a subsequent task in the original high-dimensional features according to a certain evaluation criterion, and based on the representative features of a small quantity, the learning process of the model can be accelerated, and the generalization capability of the model can be improved. Feature selection is widely applied in the fields of image processing, data mining, machine learning and the like, and when high-dimensional data containing a large number of features is processed, the features inevitably contain noise, irrelevant features and redundant features. In this case, it is necessary to extract the most valuable information with the richest information amount.
Disclosure of Invention
1. The technical problem to be solved is as follows:
aiming at the technical problem, the invention provides a feature selection algorithm based on Relieff-DDC, which comprises the steps of firstly, calculating feature weights and arranging the feature weights in a descending order, and selecting features with larger weights to remove irrelevant features; secondly, calculating interaction information between each feature and a decision variable, and deleting redundant features by utilizing decision correlation analysis to obtain a final feature subset; and finally, a twin support vector machine (TWSVM) is used for load identification, and the method effectively reduces the feature dimension, improves the load identification rate and shortens the operation time of the algorithm.
2. The technical scheme is as follows:
a feature selection algorithm based on Relieff-DDC is characterized by comprising the following steps:
the method comprises the following steps: acquiring a training set sample, and determining parameter values involved in an algorithm; the method specifically comprises the following steps:
s11, setting the training set to be processed as D and sample Xl={xl1,xl2,…xld},xldIs the D-dimension feature of the ith sample in the training set D.
S12, determining iteration times m, wherein m is more than or equal to 1; a characteristic weight threshold value tau, wherein tau is more than or equal to 0 and less than or equal to 1; the number k of nearest neighbor samples is an integer which is more than or equal to 1; and (4) evaluating a criterion threshold value, wherein the threshold value is more than or equal to 0 and less than or equal to 1.
Step two: all feature weights in the sample are reset to 0 and set to F, S as an empty set.
Step three: selecting sample X from training set DlUpdating the weights of the features of all dimensions contained in the feature list, determining 'important features' by utilizing the correlation between the features calculated by the Relieff and each class, and excluding irrelevant features; the method specifically comprises the following steps:
s31, randomly selecting a sample X in the training set D in the m iteration processeslSample XlBelong to class C under the relation XlSearching k nearest neighbor samples H in homogeneous samplesjJ ═ 1,2 … k, and XlFinding k nearest neighbor samples M among heterogeneous samplesj
S32, when r is more than or equal to 1 and less than or equal to d, updating the weight W (r) of the r-th dimension feature:
Figure BDA0002557696470000021
Figure BDA0002557696470000022
(1) in the formula (2), P (C) represents the prior probability distribution of class C in the data set, Mj(C) Represents the jth nearest neighbor sample of class C; where diff (d, a, B) represents the degree of discrimination between sample a and sample B at the d-dimension eigenvalue.
Step four, traversing all values of r in S32, outputting W (d)' with the value larger thanAll corresponding feature vectors F at threshold τ are added to the set F in descending order, F ═ F1,f2…fn},n<d。
Step five: utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; the method specifically comprises the following steps:
s51, when e (S) < fiSatisfies the following conditions:
Figure 100002_1
Figure 2
let F ← F- { Fj},S←S+{fj};
Wherein e (S) is I (C; f) and QC(f, s) together constitute a feature subset evaluation criterion, defined as follows:
Figure BDA0002557696470000025
in the formula (5), the entropy H (C) represents the uncertainty measure of the random variable C, the mutual information I (C; f)j) Representing a random variable C, fjThe shared information between the two is defined as:
I(C,fj)=H(C)-H(C|fj) (6)
in the formulas (4) and (5), f and s are two features respectively, and the correlation measure QC(f, s) is represented by
Figure BDA0002557696470000031
Figure BDA0002557696470000032
S52, judging whether the F is an empty set or not, and if the F is not equal to phi, jumping to the step four; if F is empty; and continuing to the sixth step.
Step six: and obtaining an optimal feature subset S, and using the selected features for non-invasive load identification.
3. Has the advantages that:
(1) the feature extraction and the load identification are used as two key technologies in non-invasive load monitoring (NILM), and powerful technical support is provided for the development of the NILM. In the non-intrusive load monitoring system, the load identification accuracy is not high due to poor data feature selection, so that the load operation data features are extracted for load identification, and the method plays an important role in improving the accuracy of non-intrusive load identification. Feature selection is widely applied in the fields of image processing, data mining, machine learning and the like, and when high-dimensional data containing a large number of features is processed, the features inevitably contain noise, irrelevant features and redundant features. In this case, it is necessary to extract the most valuable information with the richest information amount. The Relieff algorithm solves the problem that the original Relief algorithm cannot perform feature selection on multi-class data, has a good processing effect on incomplete and noisy data, and cannot delete redundant features. The relevance measure calculation method provided by the DDC algorithm fully considers the relevance and the dependency degree between the features and decision variables, and compares the feature subset evaluation measure with a set threshold value so as to screen redundant features.
(2) The invention provides a feature selection algorithm based on Relieff-DDC, which is characterized by comprising the steps of firstly, calculating feature weights and arranging the feature weights in a descending order, and selecting features with larger weights to remove irrelevant features; secondly, calculating interaction information between each feature and a decision variable, and deleting redundant features by utilizing decision correlation analysis to obtain a final feature subset; and finally, a twin support vector machine (TWSVM) is used for load identification, and the method effectively reduces the feature dimension, improves the load identification rate and shortens the operation time of the algorithm.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
As shown in the flow chart of fig. 1, a feature selection algorithm based on ReliefF-DDC includes the following steps:
the method comprises the following steps: acquiring a training set sample, and determining parameter values involved in an algorithm; the method specifically comprises the following steps:
s11, setting the training set to be processed as D and sample Xl={xl1,xl2,…xld},xldIs the D-dimension feature of the ith sample in the training set D.
S12, determining iteration times m, wherein m is more than or equal to 1; a characteristic weight threshold value tau, wherein tau is more than or equal to 0 and less than or equal to 1; the number k of nearest neighbor samples is an integer which is more than or equal to 1; and (4) evaluating a criterion threshold value, wherein the threshold value is more than or equal to 0 and less than or equal to 1.
Step two: all feature weights in the sample are reset to 0 and set to F, S as an empty set.
Step three: selecting sample X from training set DlUpdating the weights of the features of all dimensions contained in the feature list, determining 'important features' by utilizing the correlation between the features calculated by the Relieff and each class, and excluding irrelevant features; the method specifically comprises the following steps:
s31, randomly selecting a sample X in the training set D in the m iteration processeslSample XlBelong to class C under the relation XlSearching k nearest neighbor samples H in homogeneous samplesjJ ═ 1,2 … k, and XlFinding k nearest neighbor samples M among heterogeneous samplesj
S32, when r is more than or equal to 1 and less than or equal to d, updating the weight W (r) of the r-th dimension feature:
Figure BDA0002557696470000041
Figure BDA0002557696470000042
(1) in the formula (2), P (C) represents the prior probability distribution of class C in the data set, Mj(C) Represents the jth nearest neighbor sample of class C; where diff (d, a, B) represents the degree of discrimination between sample a and sample B at the d-dimension eigenvalue.
Step four, traversing all values of r in S32, and outputting all corresponding characters when the values of W (d)' are larger than the threshold value tauA eigenvector F added to the set F in descending order, F ═ F1,f2…fn},n<d。
Step five: utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; the method specifically comprises the following steps:
s51, when e (S) < fiSatisfies the following conditions:
Figure BDA0002557696470000043
Figure 3
let F ← F- { Fj},S←S+{fj};
Wherein e (S) is I (C; f) and QC(f, s) together constitute a feature subset evaluation criterion, defined as follows:
Figure BDA0002557696470000051
in the formula (5), the entropy H (C) represents the uncertainty measure of the random variable C, the mutual information I (C; f)j) Representing a random variable C, fjThe shared information between the two is defined as:
I(C,fj)=H(C)-H(C|fj) (6) in the formulas (4) and (5), f and s are two features respectively, and the correlation measure QC(f, s) is represented by
Figure BDA0002557696470000052
Figure BDA0002557696470000053
S52, judging whether the F is an empty set or not, and if the F is not equal to phi, jumping to the step four; if F is empty; and continuing to the sixth step.
Step six: and obtaining an optimal feature subset S, and using the selected features for non-invasive load identification.
The invention provides a characteristic selection algorithm based on Relieff-DDC (direct digital control) in order to effectively select load characteristics and improve non-intrusive load identification precision by taking non-intrusive load monitoring as a research background. The algorithm firstly extracts the characteristics of each power load, calculates the distance from each characteristic of the selected sample to the nearest neighbor similar sample and the different samples by utilizing a RefiefF algorithm to obtain each characteristic weight, and removes irrelevant characteristics according to a set weight threshold value after arranging in a descending order; secondly, analyzing the degree of dependence between the features and the categories by calculating mutual information by using a DDC algorithm, and quantizing a judgment criterion into the size of comparison feature subset evaluation measurement and a set threshold value so as to delete redundant features; and finally, identifying and classifying by using a twin support vector machine. Experimental results show that the method effectively reduces the feature dimension, improves the load recognition rate and shortens the operation time of the algorithm.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (1)

1. A feature selection algorithm based on Relieff-DDC is characterized by comprising the following steps:
the method comprises the following steps: acquiring a training set sample, and determining parameter values involved in an algorithm; the method specifically comprises the following steps:
s11, setting the training set to be processed as D and sample Xl={xl1,xl2,…xld},xldIs the D-dimension characteristic of the l-th sample in the training set D;
s12, determining iteration times m, wherein m is more than or equal to 1; a characteristic weight threshold value tau, wherein tau is more than or equal to 0 and less than or equal to 1; the number k of nearest neighbor samples is an integer which is more than or equal to 1; an evaluation criterion threshold value is more than or equal to 0 and less than or equal to 1;
step two: all feature weights in the sample are reset to 0 and set to F, S as an empty set.
Step three: selecting sample X from training set DlAnd updateThe weight of the features of all dimensions contained in the feature list is calculated by utilizing the Relieff to calculate the correlation between the features and each class so as to determine 'important features', and irrelevant features are eliminated; the method specifically comprises the following steps:
s31, randomly selecting a sample X in the training set D in the m iteration processeslSample XlBelong to class C under the relation XlSearching k nearest neighbor samples H in homogeneous samplesjJ ═ 1,2 … k and XlFinding k nearest neighbor samples M among heterogeneous samplesj
S32, when r is more than or equal to 1 and less than or equal to d, updating the weight W (r) of the r-th dimension feature:
Figure FDA0002557696460000011
Figure FDA0002557696460000012
(1) in the formula (2), P (C) represents the prior probability distribution of class C in the data set, Mj(C) Represents the jth nearest neighbor sample of class C; wherein diff (d, A, B) represents the degree of discrimination of the sample A and the sample B on the d-dimension characteristic value;
step four, traversing all values of r in S32, outputting all corresponding feature vectors F when the values are larger than a threshold value tau in W (d)' and adding the feature vectors F to a set F in descending order, wherein F is { F ═ F1,f2…fn},n<d;
Step five: utilizing a DDC algorithm to analyze and remove redundant features according to the correlation between the features and decision variables; the method specifically comprises the following steps:
s51, when e (S) < fiSatisfies the following conditions:
Figure FDA0002557696460000013
Figure 1
let F ← F- { Fj},S←S+{fj};
Wherein e (S) is I (C; f) and QC(f, s) together constitute a feature subset evaluation criterion, defined as follows:
Figure FDA0002557696460000021
in the formula (5), the entropy H (C) represents the uncertainty measure of the random variable C, the mutual information I (C; f)j) Representing a random variable C, fjThe shared information between the two is defined as:
I(C,fj)=H(C)-H(C|fj) (6)
in the formulas (4) and (5), f and s are two features respectively, and the correlation measure QC(f, s) is represented by
Figure FDA0002557696460000022
Figure FDA0002557696460000023
S52, judging whether the F is an empty set or not, and if the F is not equal to phi, jumping to the step four; if F is empty; continuing the step six;
step six: and obtaining an optimal feature subset S, and using the selected features for non-invasive load identification.
CN202010597594.4A 2020-06-28 2020-06-28 Feature selection algorithm based on Relieff-DDC Active CN111898637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010597594.4A CN111898637B (en) 2020-06-28 2020-06-28 Feature selection algorithm based on Relieff-DDC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010597594.4A CN111898637B (en) 2020-06-28 2020-06-28 Feature selection algorithm based on Relieff-DDC

Publications (2)

Publication Number Publication Date
CN111898637A true CN111898637A (en) 2020-11-06
CN111898637B CN111898637B (en) 2022-09-02

Family

ID=73207098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010597594.4A Active CN111898637B (en) 2020-06-28 2020-06-28 Feature selection algorithm based on Relieff-DDC

Country Status (1)

Country Link
CN (1) CN111898637B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786111A (en) * 2021-01-18 2021-05-11 上海理工大学 Characteristic gene selection method based on Relieff and ant colony
CN113837276A (en) * 2021-09-24 2021-12-24 中国电子科技集团公司信息科学研究院 Feature selection method and target identification method based on electromagnetism and infrared
CN114325081A (en) * 2021-12-29 2022-04-12 润建股份有限公司 Non-invasive load identification method based on multi-modal characteristics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system
CN108875795A (en) * 2018-05-28 2018-11-23 哈尔滨工程大学 A kind of feature selecting algorithm based on Relief and mutual information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system
CN108875795A (en) * 2018-05-28 2018-11-23 哈尔滨工程大学 A kind of feature selecting algorithm based on Relief and mutual information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUANGZHI QU等: "A New Dependency and Correlation Analysis for Features", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 *
杨志伟等: "基于ReliefF的入侵特征选择方法", 《吉林大学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786111A (en) * 2021-01-18 2021-05-11 上海理工大学 Characteristic gene selection method based on Relieff and ant colony
CN113837276A (en) * 2021-09-24 2021-12-24 中国电子科技集团公司信息科学研究院 Feature selection method and target identification method based on electromagnetism and infrared
CN114325081A (en) * 2021-12-29 2022-04-12 润建股份有限公司 Non-invasive load identification method based on multi-modal characteristics

Also Published As

Publication number Publication date
CN111898637B (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN111898637B (en) Feature selection algorithm based on Relieff-DDC
CN108615177B (en) Electronic terminal personalized recommendation method based on weighting extraction interestingness
JP5322111B2 (en) Similar image search device
CN110852906B (en) Method and system for identifying electricity stealing suspicion based on high-dimensional random matrix
CN115186012A (en) Power consumption data detection method, device, equipment and storage medium
CN115983087A (en) Method for detecting time sequence data abnormity by combining attention mechanism and LSTM and terminal
CN111861667A (en) Vehicle recommendation method and device, electronic equipment and storage medium
Vieira et al. An Enhanced Seasonal-Hybrid ESD technique for robust anomaly detection on time series
CN116720145B (en) Wireless charging remaining time prediction method based on data processing
US20200279148A1 (en) Material structure analysis method and material structure analyzer
CN111027771A (en) Scenic spot passenger flow volume estimation method, system and device and storable medium
CN111144424A (en) Personnel feature detection and analysis method based on clustering algorithm
CN116561569A (en) Industrial power load identification method based on EO feature selection and AdaBoost algorithm
CN109408498A (en) The identification of time series feature and decomposition method based on eigenmatrix decision tree
CN112487991B (en) High-precision load identification method and system based on characteristic self-learning
CN114141316A (en) Method and system for predicting biological toxicity of organic matters based on spectrogram analysis
CN114528909A (en) Unsupervised anomaly detection method based on flow log feature extraction
CN113221995A (en) Data classification method, equipment and device based on semi-supervised deep classification algorithm
CN113535527A (en) Load shedding method and system for real-time flow data predictive analysis
CN113177078A (en) Efficient approximate query processing algorithm based on condition generation model
CN112735532A (en) Metabolite identification system based on molecular fingerprint prediction and application method thereof
CN112258235A (en) Method and system for discovering new service of electric power marketing audit
CN111178180A (en) Hyperspectral image feature selection method and device based on improved ant colony algorithm
JP2007305048A (en) Influencing factor estimation device and influencing factor estimation program
CN115953584B (en) End-to-end target detection method and system with learning sparsity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant