CN106897413A

CN106897413A - A kind of hybrid characteristic selecting method based on harmony search

Info

Publication number: CN106897413A
Application number: CN201710090165.6A
Authority: CN
Inventors: 徐光侠; 张钰柔; 刘榕; 刘俊; 解绍词; 代皓; 唐志京; 郑爽; 蒋鹏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2017-06-27

Abstract

The invention discloses a kind of hybrid characteristic selecting method based on harmony search, the advantage that filter is searched for harmony is combined, form a hybrid system.The system can concentrate selection optimal feature subset from a large number of users primitive character, and this feature subset is used for the classification and detection of junk user in social networks.At present, the performance quality of grader depending on character subset selection, and feature selection issues can be counted as optimization problem, and its target is selection optimal or close to optimal character subset.A kind of method is proposed, the purpose that optimal feature subset is chosen is realized using simple, the rapid advantage of the calculating of filtering method and harmony algorithm.Overcome dependence and the relatively costly shortcoming of wrapper calculating between filter method override feature.

Description

Mixed feature selection method based on harmony search

Technical Field

The invention relates to the field of social network data mining and security, in particular to a mixed feature selection method based on harmony search.

Background

With the continuous development of internet technology, social networks are leading to new growth potential of internet industry, and online social networks have become one of the essential communication modes for modern people's life. The user growth rate of platforms such as Twitter, Facebook, Sino microblog and Tencent microblog in China is doubled successively. Due to the operational mode of social networks, users generate and obtain a great deal of information each day through social networks. Theoretically, the features available in the social network are endless, not all the user features are critical, and only a small part of the features are decisive factors, so that how to solve the problem of selecting the user features is the key to accurately mining knowledge from the social network data. Feature selection has applications in many areas of data mining, machine learning, and pattern recognition, and its main objective is to find a minimal subset of features from a problem domain that maintains a reasonably high accuracy and can represent the original data. In a real-world problem, the feature selection usually discards noisy, irrelevant or misleading features, and by eliminating these features, the accuracy and efficiency of the classification problem (e.g., text and Web content classification) can be greatly improved.

Currently, feature selection can be broadly divided into two categories: filters (Filter) and wrappers (wrappers). In one aspect, filter-based methods are applied directly to a data set, typically taking into account only the intrinsic properties of the data and giving a relevant score. The high scoring features are used as input to the classification algorithm. The main drawback of this approach is that dependencies between features are ignored, which leads to the repeatability of some features. Wrapper-based approaches, on the other hand, typically use a learning algorithm to evaluate feature subsets, while using a learning algorithm's performance metrics to guide feature subset searches. This method takes into account the dependencies between features, but has a higher computational cost because it is computationally intensive. Aiming at the defects of the two methods, for a huge data set of social network user characteristics, a good effect cannot be achieved by only using one method.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a mixed feature selection method based on harmony search. The method combines a filter and acoustic search to form a hybrid system that allows feature selection. Harmony search is a meta-heuristic algorithm that mimics the impromptu playing process of a music player and has a low complexity. Harmony search has been successfully applied to various optimization problems, and has the advantages of simple calculation and easy understanding compared with the traditional optimization mode. The search space of the chorus search is greatly reduced by incorporating a filter.

In order to achieve the purpose, the invention adopts the following technical scheme: a mixed feature selection method based on harmony search comprises the following steps:

s1: and carrying out normalization and discrete processing on the social network user original feature set data.

S2: inputting the data processed in the step S1 into a plurality of filters respectively, wherein each filter obtains a respective feature subset through processing; and selecting a smaller number of superior feature subsets from the feature subsets through a majority voting algorithm.

S3: the number of Musicians (Musicians) for initializing and memorizing (Harmony Memory), the maximum iteration number, and the value probability (Harmony Memory consistency Rate) and the fine tuning probability (Pitchadjustment Rate) of the acoustic Memory; the subset of superior features obtained in S2 is stored in the harmony memory as random harmony.

S4: randomly selecting a note from the original feature set by each musician to form a new harmony; judging harmony quality through the Degree of Dependency (Dependency Degree) of harmony, if a new harmony is formed better than the worst harmony in the harmony memory, storing the new harmony into the harmony memory, and removing the worst harmony; otherwise the new harmony is dropped.

S5: and (5) iterating according to the step S4 until the maximum iteration number is reached, and outputting the new harmony at the moment as the optimal harmony.

The normalization process in step S1 allows each feature to have approximately the same scale, and the scale of each feature falls within [0,1], specifically by:

wherein min_fAnd max_fRespectively representing the minimum value and the maximum value of the characteristic, f is the original value of the characteristic, f' represents that the value range after normalization processing falls in [0,1]]The characteristic value of (2).

The filter in step S2 includes three filters of information gain, Relief algorithm and chi-square statistic.

The Information Gain (Information Gain) is a sorting-based feature selection method, and features with higher Information Gain can be ranked better. The calculation method is as follows:

the information gain of feature a can be expressed as:

Gain(A)＝H(S)-H(S|A)

wherein, H (S) is the entropy of classifying a tuple to S, and H (sa) is the entropy of classifying a tuple to S with the eigenvalue a; s represents a class in the classification system, and C is the total₁,C₂,C₃,…C_mAnd (4) class.

H (S) is calculated by the following formula:

wherein, p (C)_x) Is C_xThe probability of occurrence, m, represents the number of classes in the classification system;

the formula for H (S | A) is:

H(S|A)＝P(a)H(S|a)+P(a')H(S|a')

where P (a) represents the probability of the occurrence of feature a and P (a') represents the probability of the non-occurrence of feature a.

H (S | a) represents the conditional entropy of classification into S with eigenvalue a, as follows:

h (S | a') represents the conditional entropy of classification into S without the eigenvalue a, and the formula is as follows:

the Relief algorithm is a feature weight algorithm, different weights are given to the features according to the relevance of each feature and each category, and the features with the weights smaller than a threshold value are removed; the correlation of the features and the categories is determined from the distance between samples. The weight of feature a can be calculated by the following equation:

diff (A, R, H) represents the difference between the characteristic values A of the samples R and H, Diff (A, R, M) represents the difference between the characteristic values A of the samples R and M, H and M are respectively the nearest neighbor samples in the same type of samples and the nearest neighbor samples in different types of samples of the samples R, and M is the sampling times.

The larger the weight of a feature is, the stronger the classification capability of the feature is, and conversely, the weaker the classification capability of the feature is.

The Chi-Square Statistics is an entropy-based method. Each feature is evaluated by its Chi-Square Value. The calculation method of the chi-square value comprises the following steps:

where n is the number of intervals, c is the number of classes, S_ijRepresents a sample at i^thInterval and j^thNumber of classes, F_ijIs S_ijThe calculation formula of the expected frequency of (2) is:

F_ij＝K_i*C_j/N

wherein C is_jRepresents a sample at j^thNumber of classes, N being the total number of samples, K_iRepresents a sample at i^thThe number of intervals.

The degree of dependency of harmony described in step S4 is calculated by the following formula:

wherein,

u is a finite object set which is not empty, and X is a subset of U; p, Q are all subsets of A, a limited set of features that are not empty;Px is a low approximation, meaning that it must be classified at X; POS (Point of sale)_P(Q) represents the largest set of objects that are determined by P and must belong to the irrespectively relational decision of Q; gamma ray_P(Q) represents the degree to which Q depends on P.

The invention combines the filter and the sound search to form a hybrid system for feature selection. Because the computational complexity of the harmony search is low, and the search space of the harmony search is greatly reduced by combining the filter, the defects that the filtering method ignores the dependency relationship between the features and the calculation cost of the wrapper is high are overcome to a certain extent.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of the overall flow structure of the present invention;

FIG. 2 is a schematic representation of the feature ordering of the information gain of the present invention;

FIG. 3 is a flow chart of the Relief algorithm of the present invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar meanings throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

FIG. 1 is a schematic view of the overall flow structure of the present invention. As shown, the present invention provides a hybrid feature selection method based on harmony search. Firstly, processing an original feature set of a social network user; and then, selecting a feature subset by combining three filters of information gain, a Relief algorithm and chi-square statistics and combining a majority voting algorithm, and initializing the feature subset in harmony memory. And finally, selecting the optimal feature subset by utilizing harmony search and a rough dependence function. The method comprises the following specific steps:

s1: and carrying out normalization and discretization processing on all the characteristics of the data set.

S2: the processed data is used as input to several filters, each of which results in an optimal subset of features. And selecting a smaller and better feature subset from the obtained subsets through a majority voting algorithm.

S3: initialization and acoustic Memory (Harmony Memory), wherein the feature subsets obtained in S2 are stored as random subsets.

S4: the number of initial Musicians, the maximum number of iterations, and the acoustic memory value probability (Harmony memory consensus Rate) and the fine adjustment probability (Pitch adjustment Rate).

S5: a new note is randomly selected by each musician from their note domain and then improvised together to form a new harmony. A new harmony is a joint vote by all musicians. Notes within the entire set of original features are provided to each musician and allow multiple musicians to select the same feature or to choose not to select any feature. If the new harmony formed is better than the worst harmony in the harmony memorization (Dependency Degree), the new harmony is contained in the harmony memorization, and the existing worst harmony is removed, as determined by the harmony Degree.

S6: the algorithm is iterated continuously until the maximum iteration number is reached, and finally the optimal feature subset is obtained.

FIG. 2 is a diagram illustrating the feature ordering of information gain according to the present invention. As shown in the figure, the probability of a certain feature is calculated first, and then the entropy of the whole classification is calculated, and the formula is as follows:

where S represents a class in the classification system, a total of C₁,C₂,C₃,…C_mAnd (4) class. p (C)_x) Is C_xThe probability of occurrence, m, represents the number of classes in the classification system;

the conditional entropy of this feature is then calculated:

H(S|A)＝P(a)H(S|a)+P(a')H(S|a')

h (S | a) represents the conditional entropy of classification into S without the eigenvalue a, and the formula is as follows:

the information gain is calculated by using the formula gain (a) ═ H (S) — H (S | a), and finally, the information gain is sorted according to the magnitude of the information gain.

FIG. 3 is a flow chart of the Relief algorithm of the present invention. As shown in the figure, now, the weight of all the features is 0, the number of sampling times is M, one sample R is randomly extracted from the training set D, the number of features in the sample is N, adjacent samples with the same number of extracted features in the same type of samples are H, and adjacent samples M with the same number of features in different types of samples are extracted. The weight calculation formula for each feature is as follows:

diff (a, R, H) represents the difference between the eigenvalues a of the samples R, H, gain (a) ═ H (S) — H (S | a) represents the difference between the eigenvalues a of the samples R, M, H and M being respectively the nearest neighbor samples in the same class of samples R and the nearest neighbor samples in different classes of samples. m is the number of samples.

If R and H are less distant from R and M on a feature, the weight of the feature is increased; if R and H are farther apart on a feature than R and M, the weight of the feature is decreased. And finally, sorting according to the obtained weight.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A mixed feature selection method based on harmony search comprises the following steps:

s1: carrying out normalization and discrete processing on the original feature set data of the social network user;

s2: inputting the data processed in the step S1 into a plurality of filters respectively, wherein each filter obtains a respective feature subset through processing; selecting a better characteristic subset from the plurality of characteristic subsets through a majority voting algorithm;

s3: initializing the number of musicians with acoustic memory, the maximum iteration number, the acoustic memory value probability and the fine adjustment probability; storing the subset of the better features obtained in the S2 in harmony memory as random harmony;

s4: randomly selecting a note from the original feature set by each musician to form a new harmony; judging harmony quality through the dependency degree of harmony, if the formed new harmony is better than the worst harmony in the harmony memory, storing the new harmony into the harmony memory, and removing the worst harmony; otherwise, the new harmony is lost;

2. The method of claim 1, wherein the method comprises: the normalization process in step S1 allows each feature to have approximately the same scale, and the scale of each feature falls within [0,1], specifically by:

f^{'} = \frac{f - \min_{f}}{\max_{f} - \min_{f}}

3. The method of claim 1, wherein the method comprises: the filter of step S2 includes information gain, Relief algorithm and chi-square statistics.

4. A mixed feature selection method based on harmony search according to claim 3, wherein: the information gain calculation method is as follows:

the information gain of feature a can be expressed as:

Gain(A)＝H(S)-H(S|A)

wherein, H (S) is the entropy of classifying a tuple to S, and H (sa) is the entropy of classifying a tuple to S with the eigenvalue a; s represents a class in the classification system, and C is the total₁,C₂,C₃,…C_mClass;

h (S) is calculated by the following formula:

H (S) = - Σ_{x = 1}^{m} p (C_{x}) \log_{2} (C_{x})

the formula for H (S | A) is:

H(S|A)＝P(a)H(S|a)+P(a')H(S|a')

H (S | a) = - Σ_{x = 1}^{m} p (C_{x} | a) \log_{2} (C_{x} | a)

H (S | a^{'}) = - Σ_{x = 1}^{m} p (C_{x} | a^{'}) \log_{2} (C_{x} | a^{'})

5. a mixed feature selection method based on harmony search according to claim 3, wherein: the Relief algorithm gives different weights to the features according to the relevance of each feature and category, and the features with the weights smaller than a threshold value are removed; the correlation of the features and the categories is determined from the distance between samples. The weight of feature a can be calculated by the following equation:

W (A) = W (A) - \frac{d i f f (A, R, H) - d i f f (A, R, M)}{m}

6. A mixed feature selection method based on harmony search according to claim 3, wherein: the method for calculating the chi-square value in the chi-square statistics comprises the following steps:

χ^{2} = Σ_{i = 1}^{n} Σ_{j = 1}^{c} \frac{{(S_{i j} - F_{i j})}^{2}}{F_{i j}}

F_ij＝K_i*C_j/N

7. The method for selecting mixed features based on harmony search according to any one of claims 1 to 6, wherein: the degree of dependency of harmony described in step S4 is calculated by the following formula:

k = γ_{p} (Q) = \frac{| {POS}_{p} (Q) |}{| U |}

wherein,

{POS}_{P} (Q) = \underset{x &Element; U / Q}{\cup} \underset{&OverBar;}{P} X

u is a finite object set which is not empty, and X is a subset of U; p, Q are all subsets of A, a limited set of features that are not empty; PX is a low approximation, meaning that it must be classified at X; POS (Point of sale)_P(Q) represents the largest set of objects that are determined by P and must belong to the irrespectively relational decision of Q; gamma ray_P(Q) represents the degree to which Q depends on P.