CN106897413A - A kind of hybrid characteristic selecting method based on harmony search - Google Patents

A kind of hybrid characteristic selecting method based on harmony search Download PDF

Info

Publication number
CN106897413A
CN106897413A CN201710090165.6A CN201710090165A CN106897413A CN 106897413 A CN106897413 A CN 106897413A CN 201710090165 A CN201710090165 A CN 201710090165A CN 106897413 A CN106897413 A CN 106897413A
Authority
CN
China
Prior art keywords
harmony
feature
samples
subset
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710090165.6A
Other languages
Chinese (zh)
Inventor
徐光侠
张钰柔
刘榕
刘俊
解绍词
代皓
唐志京
郑爽
蒋鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710090165.6A priority Critical patent/CN106897413A/en
Publication of CN106897413A publication Critical patent/CN106897413A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of hybrid characteristic selecting method based on harmony search, the advantage that filter is searched for harmony is combined, form a hybrid system.The system can concentrate selection optimal feature subset from a large number of users primitive character, and this feature subset is used for the classification and detection of junk user in social networks.At present, the performance quality of grader depending on character subset selection, and feature selection issues can be counted as optimization problem, and its target is selection optimal or close to optimal character subset.A kind of method is proposed, the purpose that optimal feature subset is chosen is realized using simple, the rapid advantage of the calculating of filtering method and harmony algorithm.Overcome dependence and the relatively costly shortcoming of wrapper calculating between filter method override feature.

Description

Mixed feature selection method based on harmony search
Technical Field
The invention relates to the field of social network data mining and security, in particular to a mixed feature selection method based on harmony search.
Background
With the continuous development of internet technology, social networks are leading to new growth potential of internet industry, and online social networks have become one of the essential communication modes for modern people's life. The user growth rate of platforms such as Twitter, Facebook, Sino microblog and Tencent microblog in China is doubled successively. Due to the operational mode of social networks, users generate and obtain a great deal of information each day through social networks. Theoretically, the features available in the social network are endless, not all the user features are critical, and only a small part of the features are decisive factors, so that how to solve the problem of selecting the user features is the key to accurately mining knowledge from the social network data. Feature selection has applications in many areas of data mining, machine learning, and pattern recognition, and its main objective is to find a minimal subset of features from a problem domain that maintains a reasonably high accuracy and can represent the original data. In a real-world problem, the feature selection usually discards noisy, irrelevant or misleading features, and by eliminating these features, the accuracy and efficiency of the classification problem (e.g., text and Web content classification) can be greatly improved.
Currently, feature selection can be broadly divided into two categories: filters (Filter) and wrappers (wrappers). In one aspect, filter-based methods are applied directly to a data set, typically taking into account only the intrinsic properties of the data and giving a relevant score. The high scoring features are used as input to the classification algorithm. The main drawback of this approach is that dependencies between features are ignored, which leads to the repeatability of some features. Wrapper-based approaches, on the other hand, typically use a learning algorithm to evaluate feature subsets, while using a learning algorithm's performance metrics to guide feature subset searches. This method takes into account the dependencies between features, but has a higher computational cost because it is computationally intensive. Aiming at the defects of the two methods, for a huge data set of social network user characteristics, a good effect cannot be achieved by only using one method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a mixed feature selection method based on harmony search. The method combines a filter and acoustic search to form a hybrid system that allows feature selection. Harmony search is a meta-heuristic algorithm that mimics the impromptu playing process of a music player and has a low complexity. Harmony search has been successfully applied to various optimization problems, and has the advantages of simple calculation and easy understanding compared with the traditional optimization mode. The search space of the chorus search is greatly reduced by incorporating a filter.
In order to achieve the purpose, the invention adopts the following technical scheme: a mixed feature selection method based on harmony search comprises the following steps:
s1: and carrying out normalization and discrete processing on the social network user original feature set data.
S2: inputting the data processed in the step S1 into a plurality of filters respectively, wherein each filter obtains a respective feature subset through processing; and selecting a smaller number of superior feature subsets from the feature subsets through a majority voting algorithm.
S3: the number of Musicians (Musicians) for initializing and memorizing (Harmony Memory), the maximum iteration number, and the value probability (Harmony Memory consistency Rate) and the fine tuning probability (Pitchadjustment Rate) of the acoustic Memory; the subset of superior features obtained in S2 is stored in the harmony memory as random harmony.
S4: randomly selecting a note from the original feature set by each musician to form a new harmony; judging harmony quality through the Degree of Dependency (Dependency Degree) of harmony, if a new harmony is formed better than the worst harmony in the harmony memory, storing the new harmony into the harmony memory, and removing the worst harmony; otherwise the new harmony is dropped.
S5: and (5) iterating according to the step S4 until the maximum iteration number is reached, and outputting the new harmony at the moment as the optimal harmony.
The normalization process in step S1 allows each feature to have approximately the same scale, and the scale of each feature falls within [0,1], specifically by:
wherein minfAnd maxfRespectively representing the minimum value and the maximum value of the characteristic, f is the original value of the characteristic, f' represents that the value range after normalization processing falls in [0,1]]The characteristic value of (2).
The filter in step S2 includes three filters of information gain, Relief algorithm and chi-square statistic.
The Information Gain (Information Gain) is a sorting-based feature selection method, and features with higher Information Gain can be ranked better. The calculation method is as follows:
the information gain of feature a can be expressed as:
Gain(A)=H(S)-H(S|A)
wherein, H (S) is the entropy of classifying a tuple to S, and H (sa) is the entropy of classifying a tuple to S with the eigenvalue a; s represents a class in the classification system, and C is the total1,C2,C3,…CmAnd (4) class.
H (S) is calculated by the following formula:
wherein, p (C)x) Is CxThe probability of occurrence, m, represents the number of classes in the classification system;
the formula for H (S | A) is:
H(S|A)=P(a)H(S|a)+P(a')H(S|a')
where P (a) represents the probability of the occurrence of feature a and P (a') represents the probability of the non-occurrence of feature a.
H (S | a) represents the conditional entropy of classification into S with eigenvalue a, as follows:
h (S | a') represents the conditional entropy of classification into S without the eigenvalue a, and the formula is as follows:
the Relief algorithm is a feature weight algorithm, different weights are given to the features according to the relevance of each feature and each category, and the features with the weights smaller than a threshold value are removed; the correlation of the features and the categories is determined from the distance between samples. The weight of feature a can be calculated by the following equation:
diff (A, R, H) represents the difference between the characteristic values A of the samples R and H, Diff (A, R, M) represents the difference between the characteristic values A of the samples R and M, H and M are respectively the nearest neighbor samples in the same type of samples and the nearest neighbor samples in different types of samples of the samples R, and M is the sampling times.
The larger the weight of a feature is, the stronger the classification capability of the feature is, and conversely, the weaker the classification capability of the feature is.
The Chi-Square Statistics is an entropy-based method. Each feature is evaluated by its Chi-Square Value. The calculation method of the chi-square value comprises the following steps:
where n is the number of intervals, c is the number of classes, SijRepresents a sample at ithInterval and jthNumber of classes, FijIs SijThe calculation formula of the expected frequency of (2) is:
Fij=Ki*Cj/N
wherein C isjRepresents a sample at jthNumber of classes, N being the total number of samples, KiRepresents a sample at ithThe number of intervals.
The degree of dependency of harmony described in step S4 is calculated by the following formula:
wherein,
u is a finite object set which is not empty, and X is a subset of U; p, Q are all subsets of A, a limited set of features that are not empty;Px is a low approximation, meaning that it must be classified at X; POS (Point of sale)P(Q) represents the largest set of objects that are determined by P and must belong to the irrespectively relational decision of Q; gamma rayP(Q) represents the degree to which Q depends on P.
The invention combines the filter and the sound search to form a hybrid system for feature selection. Because the computational complexity of the harmony search is low, and the search space of the harmony search is greatly reduced by combining the filter, the defects that the filtering method ignores the dependency relationship between the features and the calculation cost of the wrapper is high are overcome to a certain extent.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram of the overall flow structure of the present invention;
FIG. 2 is a schematic representation of the feature ordering of the information gain of the present invention;
FIG. 3 is a flow chart of the Relief algorithm of the present invention.
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar meanings throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
FIG. 1 is a schematic view of the overall flow structure of the present invention. As shown, the present invention provides a hybrid feature selection method based on harmony search. Firstly, processing an original feature set of a social network user; and then, selecting a feature subset by combining three filters of information gain, a Relief algorithm and chi-square statistics and combining a majority voting algorithm, and initializing the feature subset in harmony memory. And finally, selecting the optimal feature subset by utilizing harmony search and a rough dependence function. The method comprises the following specific steps:
s1: and carrying out normalization and discretization processing on all the characteristics of the data set.
S2: the processed data is used as input to several filters, each of which results in an optimal subset of features. And selecting a smaller and better feature subset from the obtained subsets through a majority voting algorithm.
S3: initialization and acoustic Memory (Harmony Memory), wherein the feature subsets obtained in S2 are stored as random subsets.
S4: the number of initial Musicians, the maximum number of iterations, and the acoustic memory value probability (Harmony memory consensus Rate) and the fine adjustment probability (Pitch adjustment Rate).
S5: a new note is randomly selected by each musician from their note domain and then improvised together to form a new harmony. A new harmony is a joint vote by all musicians. Notes within the entire set of original features are provided to each musician and allow multiple musicians to select the same feature or to choose not to select any feature. If the new harmony formed is better than the worst harmony in the harmony memorization (Dependency Degree), the new harmony is contained in the harmony memorization, and the existing worst harmony is removed, as determined by the harmony Degree.
S6: the algorithm is iterated continuously until the maximum iteration number is reached, and finally the optimal feature subset is obtained.
FIG. 2 is a diagram illustrating the feature ordering of information gain according to the present invention. As shown in the figure, the probability of a certain feature is calculated first, and then the entropy of the whole classification is calculated, and the formula is as follows:
where S represents a class in the classification system, a total of C1,C2,C3,…CmAnd (4) class. p (C)x) Is CxThe probability of occurrence, m, represents the number of classes in the classification system;
the conditional entropy of this feature is then calculated:
H(S|A)=P(a)H(S|a)+P(a')H(S|a')
where P (a) represents the probability of the occurrence of feature a and P (a') represents the probability of the non-occurrence of feature a.
H (S | a) represents the conditional entropy of classification into S with eigenvalue a, as follows:
h (S | a) represents the conditional entropy of classification into S without the eigenvalue a, and the formula is as follows:
the information gain is calculated by using the formula gain (a) ═ H (S) — H (S | a), and finally, the information gain is sorted according to the magnitude of the information gain.
FIG. 3 is a flow chart of the Relief algorithm of the present invention. As shown in the figure, now, the weight of all the features is 0, the number of sampling times is M, one sample R is randomly extracted from the training set D, the number of features in the sample is N, adjacent samples with the same number of extracted features in the same type of samples are H, and adjacent samples M with the same number of features in different types of samples are extracted. The weight calculation formula for each feature is as follows:
diff (a, R, H) represents the difference between the eigenvalues a of the samples R, H, gain (a) ═ H (S) — H (S | a) represents the difference between the eigenvalues a of the samples R, M, H and M being respectively the nearest neighbor samples in the same class of samples R and the nearest neighbor samples in different classes of samples. m is the number of samples.
If R and H are less distant from R and M on a feature, the weight of the feature is increased; if R and H are farther apart on a feature than R and M, the weight of the feature is decreased. And finally, sorting according to the obtained weight.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (7)

1. A mixed feature selection method based on harmony search comprises the following steps:
s1: carrying out normalization and discrete processing on the original feature set data of the social network user;
s2: inputting the data processed in the step S1 into a plurality of filters respectively, wherein each filter obtains a respective feature subset through processing; selecting a better characteristic subset from the plurality of characteristic subsets through a majority voting algorithm;
s3: initializing the number of musicians with acoustic memory, the maximum iteration number, the acoustic memory value probability and the fine adjustment probability; storing the subset of the better features obtained in the S2 in harmony memory as random harmony;
s4: randomly selecting a note from the original feature set by each musician to form a new harmony; judging harmony quality through the dependency degree of harmony, if the formed new harmony is better than the worst harmony in the harmony memory, storing the new harmony into the harmony memory, and removing the worst harmony; otherwise, the new harmony is lost;
s5: and (5) iterating according to the step S4 until the maximum iteration number is reached, and outputting the new harmony at the moment as the optimal harmony.
2. The method of claim 1, wherein the method comprises: the normalization process in step S1 allows each feature to have approximately the same scale, and the scale of each feature falls within [0,1], specifically by:
f ′ = f - min f max f - min f
wherein minfAnd maxfRespectively representing the minimum value and the maximum value of the characteristic, f is the original value of the characteristic, f' represents that the value range after normalization processing falls in [0,1]]The characteristic value of (2).
3. The method of claim 1, wherein the method comprises: the filter of step S2 includes information gain, Relief algorithm and chi-square statistics.
4. A mixed feature selection method based on harmony search according to claim 3, wherein: the information gain calculation method is as follows:
the information gain of feature a can be expressed as:
Gain(A)=H(S)-H(S|A)
wherein, H (S) is the entropy of classifying a tuple to S, and H (sa) is the entropy of classifying a tuple to S with the eigenvalue a; s represents a class in the classification system, and C is the total1,C2,C3,…CmClass;
h (S) is calculated by the following formula:
H ( S ) = - Σ x = 1 m p ( C x ) log 2 ( C x )
wherein, p (C)x) Is CxThe probability of occurrence, m, represents the number of classes in the classification system;
the formula for H (S | A) is:
H(S|A)=P(a)H(S|a)+P(a')H(S|a')
where P (a) represents the probability of the occurrence of feature a and P (a') represents the probability of the non-occurrence of feature a.
H (S | a) represents the conditional entropy of classification into S with eigenvalue a, as follows:
H ( S | a ) = - Σ x = 1 m p ( C x | a ) log 2 ( C x | a )
h (S | a') represents the conditional entropy of classification into S without the eigenvalue a, and the formula is as follows:
H ( S | a ′ ) = - Σ x = 1 m p ( C x | a ′ ) log 2 ( C x | a ′ )
5. a mixed feature selection method based on harmony search according to claim 3, wherein: the Relief algorithm gives different weights to the features according to the relevance of each feature and category, and the features with the weights smaller than a threshold value are removed; the correlation of the features and the categories is determined from the distance between samples. The weight of feature a can be calculated by the following equation:
W ( A ) = W ( A ) - d i f f ( A , R , H ) - d i f f ( A , R , M ) m
diff (A, R, H) represents the difference between the characteristic values A of the samples R and H, Diff (A, R, M) represents the difference between the characteristic values A of the samples R and M, H and M are respectively the nearest neighbor samples in the same type of samples and the nearest neighbor samples in different types of samples of the samples R, and M is the sampling times.
6. A mixed feature selection method based on harmony search according to claim 3, wherein: the method for calculating the chi-square value in the chi-square statistics comprises the following steps:
χ 2 = Σ i = 1 n Σ j = 1 c ( S i j - F i j ) 2 F i j
where n is the number of intervals, c is the number of classes, SijRepresents a sample at ithInterval and jthNumber of classes, FijIs SijThe calculation formula of the expected frequency of (2) is:
Fij=Ki*Cj/N
wherein C isjRepresents a sample at jthNumber of classes, N being the total number of samples, KiRepresents a sample at ithThe number of intervals.
7. The method for selecting mixed features based on harmony search according to any one of claims 1 to 6, wherein: the degree of dependency of harmony described in step S4 is calculated by the following formula:
k = γ p ( Q ) = | POS p ( Q ) | | U |
wherein,
POS P ( Q ) = ∪ x ∈ U / Q P ‾ X
u is a finite object set which is not empty, and X is a subset of U; p, Q are all subsets of A, a limited set of features that are not empty; PX is a low approximation, meaning that it must be classified at X; POS (Point of sale)P(Q) represents the largest set of objects that are determined by P and must belong to the irrespectively relational decision of Q; gamma rayP(Q) represents the degree to which Q depends on P.
CN201710090165.6A 2017-02-20 2017-02-20 A kind of hybrid characteristic selecting method based on harmony search Pending CN106897413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710090165.6A CN106897413A (en) 2017-02-20 2017-02-20 A kind of hybrid characteristic selecting method based on harmony search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710090165.6A CN106897413A (en) 2017-02-20 2017-02-20 A kind of hybrid characteristic selecting method based on harmony search

Publications (1)

Publication Number Publication Date
CN106897413A true CN106897413A (en) 2017-06-27

Family

ID=59185640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710090165.6A Pending CN106897413A (en) 2017-02-20 2017-02-20 A kind of hybrid characteristic selecting method based on harmony search

Country Status (1)

Country Link
CN (1) CN106897413A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930723A (en) * 2016-04-20 2016-09-07 福州大学 Intrusion detection method based on feature selection
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930723A (en) * 2016-04-20 2016-09-07 福州大学 Intrusion detection method based on feature selection
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
H.HANNAH INBARANI•M. BAGYAMATHI•AHMAD TAHER AZAR: ""A novel hybrid feature selection method based on rough set and improved harmony serach"", 《NEURAL COMPUTING AND APPLICATIONS》 *
魏峻: ""基于改进的和声搜索算法的特征基因选择方法"", 《河南科学》 *

Similar Documents

Publication Publication Date Title
CN106570178B (en) High-dimensional text data feature selection method based on graph clustering
CN105512311B (en) A kind of adaptive features select method based on chi-square statistics
CN108363810B (en) Text classification method and device
CN108898479B (en) Credit evaluation model construction method and device
CN107807987A (en) A kind of string sort method, system and a kind of string sort equipment
Hristakieva et al. The spread of propaganda by coordinated communities on social media
CN110135167B (en) Edge computing terminal security level evaluation method for random forest
CN112001788B (en) Credit card illegal fraud identification method based on RF-DBSCAN algorithm
CN109271517B (en) IG TF-IDF text feature vector generation and text classification method
US10387805B2 (en) System and method for ranking news feeds
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN104391835A (en) Method and device for selecting feature words in texts
CN107180093A (en) Information search method and device and ageing inquiry word recognition method and device
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
CN105138653A (en) Exercise recommendation method and device based on typical degree and difficulty
CN109902823B (en) Model training method and device based on generation countermeasure network
CN112437053B (en) Intrusion detection method and device
Oktarina et al. Comparison of k-means clustering method and k-medoids on twitter data
CN106934410A (en) The sorting technique and system of data
CN102243641A (en) Method for efficiently clustering massive data
CN113568368B (en) Self-adaptive determination method for industrial control data characteristic reordering algorithm
CN110019563B (en) Portrait modeling method and device based on multi-dimensional data
Torres-Tramón et al. Topic detection in Twitter using topology data analysis
KR101064256B1 (en) Apparatus and Method for Selecting Optimal Database by Using The Maximal Concept Strength Recognition Techniques
Obadinma et al. Class-wise Calibration: A Case Study on COVID-19 Hate Speech.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170627