CN116662859A - Non-cultural-heritage data feature selection method - Google Patents
Non-cultural-heritage data feature selection method Download PDFInfo
- Publication number
- CN116662859A CN116662859A CN202310636101.7A CN202310636101A CN116662859A CN 116662859 A CN116662859 A CN 116662859A CN 202310636101 A CN202310636101 A CN 202310636101A CN 116662859 A CN116662859 A CN 116662859A
- Authority
- CN
- China
- Prior art keywords
- firefly
- feature
- genetic
- fitness
- individual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 16
- 241000254158 Lampyridae Species 0.000 claims abstract description 163
- 230000002068 genetic effect Effects 0.000 claims abstract description 79
- 238000000034 method Methods 0.000 claims description 34
- 238000012549 training Methods 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000006978 adaptation Effects 0.000 claims description 12
- 238000003066 decision tree Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 8
- 230000031700 light absorption Effects 0.000 claims description 6
- 238000010521 absorption reaction Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 229940060587 alpha e Drugs 0.000 claims description 2
- 230000009191 jumping Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 6
- 238000005457 optimization Methods 0.000 abstract description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 2
- 241000287127 Passeridae Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005469 granulation Methods 0.000 description 1
- 230000003179 granulation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a non-genetic culture data characteristic selection method, which comprises the following steps: acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set; moving the firefly individual with lower fitness towards the direction of the firefly individual with higher fitness, updating the position of the firefly and recalculating the fitness of the firefly individual; and outputting an optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual. Compared with the original data, the non-genetic culture data subjected to the feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy, and better data information completeness is maintained, so that the optimization of the non-genetic culture level classification effect is realized, and the purposes of reducing the data redundancy and optimizing the resources are achieved.
Description
Technical Field
The invention belongs to the technical field of data mining methods, and particularly relates to a non-genetic cultural data feature selection method.
Background
In recent years, non-genetic culture is increasingly valued by the state and society, and particularly with the rapid development of information technology, the digital construction of the non-genetic culture is also increasingly strong, and various non-genetic culture information resources are continuously emerging. By classifying and analyzing the non-genetic culture level, a more reasonable decision scheme can be provided for the related departments to divide the level of the future non-genetic culture, so that the non-genetic culture is more effectively protected. However, the existing non-genetic cultural data has a high dimension, which greatly increases the cost of classifying and analyzing the non-genetic cultural data. In addition, the existing non-cultural-relics data information has certain uncertainty, and unimportant characteristics not only increase the redundancy of data, but also cause that ideal effects cannot be achieved when the non-cultural-relics are predicted. Therefore, in order to more effectively analyze cultural data, reducing the cost of data processing, it is necessary to perform feature selection on the cultural data to reduce the data dimension and eliminate unimportant features.
Currently, a firefly algorithm in a swarm intelligent algorithm is designed by inspiring the action of firefly flickering, and is proposed by Xin-She Yang in 2008. Compared with other intelligent algorithms, the firefly algorithm has better performance, but the construction of the adaptability function of the standard firefly algorithm generally cannot ensure that the selected feature subset has smaller information loss, and meanwhile, the algorithm has the problems of low search precision and slow convergence speed in the optimizing process. The neighborhood rough set converts the equivalent relation of the rough set theory into the coverage relation of information particles in the neighborhood space by introducing the concepts of neighborhood granulation and measurement space, and can effectively measure the uncertainty of data information. Therefore, it is necessary to combine the neighborhood rough set with the firefly algorithm, and to conduct improved researches on the aspects of fitness function construction, search updating strategy and the like of the firefly algorithm. For processing high-dimensional complex cultural data.
Disclosure of Invention
The invention aims to provide a non-genetic culture data characteristic selection method which can delete irrelevant or low-importance attributes under the condition of not influencing the final classification result of non-genetic culture data.
The technical scheme adopted by the invention is as follows: the non-cultural-heritage data characteristic selection method comprises the following steps:
step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;
step 2, calculating fitness Fit of individuals in the firefly population by using the neighborhood granularity rough entropy and the attribute set importance NGRE ;
Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;
step 4, judging whether the current iteration reaches the maximum iteration number T max If not, returning to the execution step 3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the globally optimal firefly individual.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T max For 30, a firefly population fag= { S with size N is randomly initialized 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max The method comprises the steps of carrying out a first treatment on the surface of the Encoding each individual, i.e., each feature subset, using a sigmoid function prior to calculating the fitness of each firefly individual, fromAnd converting the value into 0 and 1 forms, and defining a sigmoid function as follows:
in the step 2, the calculation formula of the neighborhood granularity rough entropy is as follows:
NGRE(S)=NGK(D|S)×NE r (D|S) (2)
in formula (2), NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
in the formula (3) and the formula (4), delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
in the formula (5), lambda 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 1 +λ 2 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.
Step 3, comparing the sizes of the fitness of the firefly individuals, enabling the firefly individuals with lower fitness to move towards the direction of the firefly individuals with higher fitness, calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the fitness of the firefly individuals; the method specifically comprises the following steps:
step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
in formula (6), beta 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j A distance therebetween;
step 3.2, for any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
in the formula (7), d represents the space dimension of firefly individual, namely the characteristic dimension, alpha E [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]Random numbers in the interval, t is the iteration number;
step 3.3, updating the firefly individual S by using the formula (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
The step 4 further comprises dividing an optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to the proportion of 7:3, classifying the divided feature subsets by adopting a CART decision tree model, and selecting an initial root node of a CART decision tree by calculating the base index of each feature in the training set T in the classifying process to divide the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
in the formula (8), T represents the number of non-genetic culture data in the training set T and C k I represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and the training set T is divided into T by supposing the value of the characteristic A 1 And T 2 Two categories, then |T 1 I and T 2 The I respectively represents the non-genetic culture data amount contained in each category;
for each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
The beneficial effects of the invention are as follows: according to the non-genetic culture data feature selection method, compared with original data, the non-genetic culture data subjected to feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy and keeps better data information completeness, so that optimization of non-genetic culture level classification effects is realized, and the purposes of reducing data redundancy and optimizing resources are achieved.
Drawings
FIG. 1 is a flow chart of a non-cultural data feature selection method of the present invention;
FIG. 2 is a graph of the comparison results of example 3 using AUC, ACC, F1 evaluation criteria on three comparison methods in the non-cultural data feature selection method of the present invention;
FIG. 3 is a graph of the comparison results of example 3 employing feature subset scale evaluation metrics on three comparison methods in a non-cultural data feature selection method of the present invention.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and detailed description.
Example 1
As shown in fig. 1, the method comprises the following steps:
step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method is implemented according to the following steps:
and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations max 30. Randomly initializing firefly population FAG= { S with size of N 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method is implemented according to the following steps:
the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
in the formula, NGK (D|B) and NE r (D|B) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
wherein delta S (x i ) For the neighborhood class of samples in the attribute subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in the attribute subset S and the decision attribute D, and U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
wherein lambda is 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 1 +λ 2 =1,Fit NGRE Is the fitness of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method is implemented according to the following steps:
comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
wherein beta is 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j Distance between them.
For any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.
Updating firefly individual S using equation (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
Step 4, judging whether the current iteration reaches the maximum iteration number T max (maximum iteration count T in the present invention) max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm.
Example 2
In order to verify the effectiveness of the non-genetic cultural data feature selection method, the method utilizes a CART decision tree algorithm to execute classification operation on the processed non-genetic cultural data set and evaluate classification results. The method is implemented according to the following steps:
judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
wherein, T represents the number of non-genetic culture data in the training set T and C k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Let the value of feature A divide training set T into T 1 And T 2 Two categories, then |T 1 I and T 2 The i indicates the amount of non-genetic cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For the classification results, AUC, accuracy (hereinafter referred to as ACC), F1-score (hereinafter referred to as F1), and feature subset size were used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is a harmonic mean of Precision and Recall, and its value range is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model.
Through the mode, the non-genetic cultural data characteristic selection method of the invention carries out characteristic selection on the collected non-genetic cultural data A to generate a group of [ x ] 1 ,x 1 ,...,x n ]Feature subset of a vector set, where n is the largest dimension of the dataset features where x i =0 or 1, indicating whether the current feature is selected to screen out the key feature in the data, and reject the redundant data feature. The invention can generate a group of feature subsets, a decision maker can select an optimization scheme of the feature subsets according to decision requirements, and then generate new cultural data B based on the selected feature subset scheme and the non-genetic cultural data A. At this time, the cultural data B has a lower dimension than the non-cultural data a. When classifying the cultural data, the non-genetic cultural data B has lower dimensionality, and keeps better classification performance, so that the optimization of computing resources is realized.
Example 3
As shown in fig. 1, the method is specifically implemented according to the following steps:
step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method comprises the following steps: and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations max 30. Randomly initializing firefly population FAG= { S with size of N 1 ,S 2 ,...,S N Each of (E)Initial position s= { S corresponding to firefly only i1 ,S i2 ,...,S id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method comprises the following steps: the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NE r (D|S) (2)
in the formula, NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
wherein delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) I is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
wherein lambda is 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 1 +λ 2 =1,Fit NGRE Is the fitness of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method comprises the following steps: comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
wherein beta is 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j Distance between them.
For any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is the step lengthFactor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.
Updating firefly individual S using equation (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
Step 4, judging whether the current iteration reaches the maximum iteration number T max (maximum iteration count T in the present invention) max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm. In order to verify the effectiveness of the non-genetic cultural data feature selection method, the CART decision tree model is utilized to perform classification operation on the processed non-genetic cultural data set, and the classification result is evaluated. The method is implemented according to the following steps:
judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
wherein, T represents the number of non-genetic culture data in the training set T and C k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Supposing specialThe value of sign A divides training set T into T 1 And T 2 Two categories, then |T 1 I and T 2 The i indicates the amount of non-genetic cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For this embodiment, AUC, ACC, F1 and feature subset size are used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is the harmonic mean of Precision and Recall. The value range of F1 is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model. The feature subset size refers to the number of feature subsets after feature selection, and the smaller the feature subset size, the better the feature subset size.
In this example, the present invention was compared with three existing feature selection methods on four evaluation indicators, the comparison method comprising: SSA (sparrow search algorithm), HHO (harris eagle optimization algorithm), RFE (feature recursive elimination algorithm), and the comparison result is shown in fig. 2 and 3. From fig. 2 and 3, it can be seen that the effect of the present invention is optimal, and the four evaluation indexes are all significantly improved. The method and the device can effectively acquire the feature subset with high importance and acquire a better classification result.
Claims (5)
1. The non-cultural-heritage data characteristic selection method is characterized by comprising the following steps of:
step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;
step 2, calculating fitness Fit of individuals in the firefly population by using the neighborhood granularity rough entropy and the attribute set importance NGRE ;
Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;
step 4, judging whether the current iteration reaches the maximum iteration number T max If not, returning to the execution step 3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the globally optimal firefly individual.
2. The method for selecting non-cultural-of-missing data features as defined in claim 1, wherein said step 1 is specifically as follows: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T max For 30, a firefly population fag= { S with size N is randomly initialized 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max The method comprises the steps of carrying out a first treatment on the surface of the Before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:
3. the method for selecting non-genetic cultural data features as defined in claim 2, wherein the neighborhood granularity coarse entropy calculation formula in step 2 is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
in formula (2), NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
in the formula (3) and the formula (4), delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
in the formula (5), lambda 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 1 +λ 2 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.
4. The non-genetic culture data feature selection method as claimed in claim 3, wherein the step 3 includes comparing the sizes of the adaptation degree of the firefly individuals, moving the firefly individuals with lower adaptation degree toward the direction of the firefly individuals with higher adaptation degree, calculating the mutual attraction force between each firefly individual and other firefly individuals according to the space distance, and further updating the position of the firefly and recalculating the adaptation degree of the firefly individuals; the method specifically comprises the following steps:
step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
in formula (6), beta 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j A distance therebetween;
step 3.2, for any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
in the formula (7), d represents the space dimension of firefly individual, namely the characteristic dimension, alpha E [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]Random numbers in the interval, t is the iteration number;
step 3.3, updating the firefly individual S by using the formula (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
5. The method for selecting non-genetic culture data features as claimed in claim 4, wherein the step 4 further comprises dividing the optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to a ratio of 7:3, classifying the divided feature subsets by using a CART decision tree model, and selecting an initial root node of the CART decision tree by calculating a base index of each feature in the training set T during the classification process, dividing the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
in the formula (8), T represents the number of non-genetic culture data in the training set T and C k I represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and the training set T is divided into T by supposing the value of the characteristic A 1 And T 2 Two categories, then |T 1 I and T 2 The I respectively represents the non-genetic culture data amount contained in each category;
for each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310636101.7A CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310636101.7A CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116662859A true CN116662859A (en) | 2023-08-29 |
CN116662859B CN116662859B (en) | 2024-04-19 |
Family
ID=87720173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310636101.7A Active CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116662859B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104076512A (en) * | 2013-03-25 | 2014-10-01 | 精工爱普生株式会社 | Head-mounted display device and method of controlling head-mounted display device |
CN105493173A (en) * | 2013-06-28 | 2016-04-13 | 诺基亚技术有限公司 | Supporting activation of function of device |
CN105824937A (en) * | 2016-03-17 | 2016-08-03 | 合肥工业大学 | Attribute selection method based on binary system firefly algorithm |
CN106779063A (en) * | 2016-11-15 | 2017-05-31 | 河南理工大学 | A kind of hoist braking system method for diagnosing faults based on RBF networks |
CN107230213A (en) * | 2017-05-15 | 2017-10-03 | 昆明理工大学 | A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design |
US20170364933A1 (en) * | 2014-12-09 | 2017-12-21 | Beijing Didi Infinity Technology And Development Co., Ltd. | User maintenance system and method |
CN108417171A (en) * | 2017-02-10 | 2018-08-17 | 宏碁股份有限公司 | Display device and its display parameters method of adjustment |
CN110162841A (en) * | 2019-04-26 | 2019-08-23 | 南京航空航天大学 | A kind of Milling Process multi-objective method introducing three-dimensional stability constraint |
CN110537165A (en) * | 2017-10-26 | 2019-12-03 | 华为技术有限公司 | A kind of display methods and device |
CN110867172A (en) * | 2019-11-19 | 2020-03-06 | 苹果公司 | Electronic device for dynamically controlling standard dynamic range and high dynamic range content |
-
2023
- 2023-05-31 CN CN202310636101.7A patent/CN116662859B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104076512A (en) * | 2013-03-25 | 2014-10-01 | 精工爱普生株式会社 | Head-mounted display device and method of controlling head-mounted display device |
CN105493173A (en) * | 2013-06-28 | 2016-04-13 | 诺基亚技术有限公司 | Supporting activation of function of device |
US20170364933A1 (en) * | 2014-12-09 | 2017-12-21 | Beijing Didi Infinity Technology And Development Co., Ltd. | User maintenance system and method |
CN105824937A (en) * | 2016-03-17 | 2016-08-03 | 合肥工业大学 | Attribute selection method based on binary system firefly algorithm |
CN106779063A (en) * | 2016-11-15 | 2017-05-31 | 河南理工大学 | A kind of hoist braking system method for diagnosing faults based on RBF networks |
CN108417171A (en) * | 2017-02-10 | 2018-08-17 | 宏碁股份有限公司 | Display device and its display parameters method of adjustment |
CN107230213A (en) * | 2017-05-15 | 2017-10-03 | 昆明理工大学 | A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design |
CN110537165A (en) * | 2017-10-26 | 2019-12-03 | 华为技术有限公司 | A kind of display methods and device |
CN110162841A (en) * | 2019-04-26 | 2019-08-23 | 南京航空航天大学 | A kind of Milling Process multi-objective method introducing three-dimensional stability constraint |
CN110867172A (en) * | 2019-11-19 | 2020-03-06 | 苹果公司 | Electronic device for dynamically controlling standard dynamic range and high dynamic range content |
Non-Patent Citations (1)
Title |
---|
彭鹏 等: "基于改进二元萤火虫群优化算法和邻域粗糙集的属性约简方法", 《模式识别与人工智能》, vol. 33, no. 2, pages 95 - 105 * |
Also Published As
Publication number | Publication date |
---|---|
CN116662859B (en) | 2024-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112784881B (en) | Network abnormal flow detection method, model and system | |
CN107622182B (en) | Method and system for predicting local structural features of protein | |
CN109344698B (en) | Hyperspectral band selection method based on separable convolution and hard threshold function | |
CN112101430B (en) | Anchor frame generation method for image target detection processing and lightweight target detection method | |
CN112687349A (en) | Construction method of model for reducing octane number loss | |
CN110674865B (en) | Rule learning classifier integration method oriented to software defect class distribution unbalance | |
CN111601358B (en) | Multi-stage hierarchical clustering spatial correlation temperature perception data redundancy removing method | |
CN114580281A (en) | Model quantization method, apparatus, device, storage medium, and program product | |
CN111309577B (en) | Spark-oriented batch application execution time prediction model construction method | |
CN113724195B (en) | Quantitative analysis model and establishment method of protein based on immunofluorescence image | |
CN118151020A (en) | Method and system for detecting safety performance of battery | |
CN114663770A (en) | Hyperspectral image classification method and system based on integrated clustering waveband selection | |
CN112651424A (en) | GIS insulation defect identification method and system based on LLE dimension reduction and chaos algorithm optimization | |
CN116662859B (en) | Non-cultural-heritage data feature selection method | |
CN111832645A (en) | Classification data feature selection method based on discrete crow difference collaborative search algorithm | |
CN113177078B (en) | Approximate query processing algorithm based on condition generation model | |
CN113033768A (en) | Missing feature re-representation method and system based on graph convolution network | |
CN114117876A (en) | Feature selection method based on improved Harris eagle algorithm | |
CN112287437A (en) | Multimodal extreme value solving method applied to vehicle load analysis | |
CN112308151A (en) | Weighting-based classification method for hyperspectral images of rotating forest | |
CN111461199A (en) | Security attribute selection method based on distributed junk mail classified data | |
CN110782950A (en) | Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm | |
CN110766087A (en) | Method for improving data clustering quality of k-means based on dispersion maximization method | |
CN115017125B (en) | Data processing method and device for improving KNN method | |
CN117648623B (en) | Network classification algorithm based on pooling comparison learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20241011 Address after: Room 1404, Building B, Northwest Guojin Center, Fengcheng Eighth Road, Xi'an Economic and Technological Development Zone, Shaanxi Province 710018 Patentee after: Xi'an Qingtian Zhanchuang Network Technology Co.,Ltd. Country or region after: China Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 19 Patentee before: XI'AN POLYTECHNIC University Country or region before: China |