CN116662859B - Non-cultural-heritage data feature selection method - Google Patents
Non-cultural-heritage data feature selection method Download PDFInfo
- Publication number
- CN116662859B CN116662859B CN202310636101.7A CN202310636101A CN116662859B CN 116662859 B CN116662859 B CN 116662859B CN 202310636101 A CN202310636101 A CN 202310636101A CN 116662859 B CN116662859 B CN 116662859B
- Authority
- CN
- China
- Prior art keywords
- firefly
- fitness
- feature
- individuals
- genetic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 15
- 241000254158 Lampyridae Species 0.000 claims abstract description 159
- 230000002068 genetic effect Effects 0.000 claims abstract description 75
- 238000000034 method Methods 0.000 claims description 31
- 238000012549 training Methods 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000006978 adaptation Effects 0.000 claims description 12
- 238000003066 decision tree Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 8
- 230000031700 light absorption Effects 0.000 claims description 6
- 238000010521 absorption reaction Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000009191 jumping Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 6
- 238000005457 optimization Methods 0.000 abstract description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 2
- 241000287127 Passeridae Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005469 granulation Methods 0.000 description 1
- 230000003179 granulation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a non-genetic culture data characteristic selection method, which comprises the following steps: acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set; moving the firefly individual with lower fitness towards the direction of the firefly individual with higher fitness, updating the position of the firefly and recalculating the fitness of the firefly individual; and outputting an optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual. Compared with the original data, the non-genetic culture data subjected to the feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy, and better data information completeness is maintained, so that the optimization of the non-genetic culture level classification effect is realized, and the purposes of reducing the data redundancy and optimizing the resources are achieved.
Description
Technical Field
The invention belongs to the technical field of data mining methods, and particularly relates to a non-genetic cultural data feature selection method.
Background
In recent years, non-genetic culture is increasingly valued by the state and society, and particularly with the rapid development of information technology, the digital construction of the non-genetic culture is also increasingly strong, and various non-genetic culture information resources are continuously emerging. By classifying and analyzing the non-genetic culture level, a more reasonable decision scheme can be provided for the related departments to divide the level of the future non-genetic culture, so that the non-genetic culture is more effectively protected. However, the existing non-genetic cultural data has a high dimension, which greatly increases the cost of classifying and analyzing the non-genetic cultural data. In addition, the existing non-cultural-relics data information has certain uncertainty, and unimportant characteristics not only increase the redundancy of data, but also cause that ideal effects cannot be achieved when the non-cultural-relics are predicted. Therefore, in order to more effectively analyze cultural data, reducing the cost of data processing, it is necessary to perform feature selection on the cultural data to reduce the data dimension and eliminate unimportant features.
Currently, a firefly algorithm in a swarm intelligent algorithm is designed by inspiring the action of firefly flickering, and is proposed by Xin-She Yang in 2008. Compared with other intelligent algorithms, the firefly algorithm has better performance, but the construction of the adaptability function of the standard firefly algorithm generally cannot ensure that the selected feature subset has smaller information loss, and meanwhile, the algorithm has the problems of low search precision and slow convergence speed in the optimizing process. The neighborhood rough set converts the equivalent relation of the rough set theory into the coverage relation of information particles in the neighborhood space by introducing the concepts of neighborhood granulation and measurement space, and can effectively measure the uncertainty of data information. Therefore, it is necessary to combine the neighborhood rough set with the firefly algorithm, and to conduct improved researches on the aspects of fitness function construction, search updating strategy and the like of the firefly algorithm. For processing high-dimensional complex cultural data.
Disclosure of Invention
The invention aims to provide a non-genetic culture data characteristic selection method which can delete irrelevant or low-importance attributes under the condition of not influencing the final classification result of non-genetic culture data.
The technical scheme adopted by the invention is as follows: the non-cultural-heritage data characteristic selection method comprises the following steps:
Step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;
Step 2, calculating fitness Fit NGRE of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set;
Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;
And 4, judging whether the current iteration reaches the maximum iteration number T max, if not, returning to the execution step3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual.
The present invention is also characterized in that,
The step 1 specifically comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; the characteristic subset of the non-genetic culture data set, namely firefly number N, is 50, the maximum iteration number T max is 30, the firefly population FAG= { S 1,S2,...,SN } with the size of N is randomly initialized, the initial position S= { S i1,Si2,...,Sid } corresponding to each firefly is equal to or more than 1 and equal to or less than N, and d represents the characteristic number; setting an initial attractive force beta 0, an absorption coefficient gamma of the propagation medium to light, a disturbance factor alpha of a step length and a maximum iteration number T max; before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:
in the step2, the calculation formula of the neighborhood granularity rough entropy is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
In the formula (2), NGK (d|s) and NE r (d|s) are respectively the neighborhood knowledge granularity and the neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
In the formulas (3) and (4), δ S(xi) is a neighborhood class of samples in the feature subset S, |δ S∪D(xi) | is a neighborhood class of samples in the feature subset S and the decision attribute D, and U is a sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
In formula (5), λ 1 and λ 2 are used to adjust the influence degree of neighborhood granularity coarse entropy and attribute set importance, and λ 1+λ2 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.
Step 3, comparing the sizes of the fitness of the firefly individuals, enabling the firefly individuals with lower fitness to move towards the direction of the firefly individuals with higher fitness, calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the fitness of the firefly individuals; the method specifically comprises the following steps:
Step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
In formula (6), β 0 is the attractive force when r=0, γ is the light absorption coefficient, and r ij is the distance between firefly individuals x i and x j;
Step 3.2, for any two fireflies S i and S j e FAG, if the fitness of S j is higher than S i, the firefly S i is moved towards the direction of the position where S j is located, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
In the formula (7), d represents the space dimension of firefly individuals, namely the characteristic dimension, alpha epsilon [0,1] is a step factor, beta (r ij) is attractive force between firefly x i and x j, (rand-1/2) is a random number in the interval of [ -0.5,0.5], and t is the iteration number;
And 3.3, updating the fitness of the firefly individual S i by using the formula (5), sequencing all fireflies, and finding out the firefly individual with the optimal fitness in the current iteration times.
The step 4 further comprises dividing an optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to the proportion of 7:3, classifying the divided feature subsets by adopting a CART decision tree model, and selecting an initial root node of a CART decision tree by calculating the base index of each feature in the training set T in the classifying process to divide the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
In the formula (8), t|represents the number of non-genetic culture data in the training set T, C k |represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and assuming that the value of the feature a divides the training set T into two categories T 1 and T 2, T 1 |and T 2 |represent the non-genetic culture data amount contained in each category, respectively;
For each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
The beneficial effects of the invention are as follows: according to the non-genetic culture data feature selection method, compared with original data, the non-genetic culture data subjected to feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy and keeps better data information completeness, so that optimization of non-genetic culture level classification effects is realized, and the purposes of reducing data redundancy and optimizing resources are achieved.
Drawings
FIG. 1 is a flow chart of a non-cultural data feature selection method of the present invention;
FIG. 2 is a graph of the comparison results of example 3 using AUC, ACC, F A evaluation criteria on three comparison methods in the non-cultural data feature selection method of the present invention;
FIG. 3 is a graph of the comparison results of example 3 employing feature subset scale evaluation metrics on three comparison methods in a non-cultural data feature selection method of the present invention.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and detailed description.
Example 1
As shown in fig. 1, the method comprises the following steps:
Step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method is implemented according to the following steps:
And initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the firefly (i.e., feature subset) number N is 50 and the maximum number of iterations T max is 30. And randomly initializing a firefly population FAG= { S 1,S2,...,SN }, wherein the initial position S= { S i1,Si2,...,Sid } corresponding to each firefly, i is more than or equal to 1 and less than or equal to N, and d represents a characteristic number. The initial attractive force beta 0, the absorption coefficient gamma of the propagation medium to the light, the disturbance factor alpha of the step size and the maximum iteration number T max are set. Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method is implemented according to the following steps:
the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
Wherein NGK (d|b) and NE r (d|b) are respectively the neighborhood knowledge granularity and the neighborhood coarse entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
Where δ S(xi) is the neighborhood class of the sample in the attribute subset S, |δ S∪D(xi) | is the neighborhood class of the sample in the attribute subset S and the decision attribute D, and U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
Wherein, lambda 1 and lambda 2 are used to adjust the influence degree of the rough entropy of the neighborhood granularity and the importance of the attribute set, and lambda 1+λ2=1,FitNGRE is the adaptability of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method is implemented according to the following steps:
Comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
Where β 0 is the attractive force when r=0, γ is the light absorption coefficient, and r ij is the distance between firefly individuals x i and x j.
For any two fireflies S i and S j epsilon FAG, if the adaptability of S j is higher than that of S i, the fireflies S i are moved towards the position of S j, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
Where d represents the spatial dimension (i.e., the characteristic dimension) of the firefly individual, α ε [0,1] is the step size factor, β (r ij) is the attractive force between firefly x i and x j, (rand-1/2) is the random number in the interval [ -0.5,0.5], and t is the number of iterations.
And updating the fitness of the firefly individual S i by using the formula (5), sequencing all fireflies and finding out the firefly individual with the optimal fitness in the current iteration number.
And 4, judging whether the current iteration reaches the maximum iteration number T max (the maximum iteration number T max is 30 in the invention), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the globally optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm.
Example 2
In order to verify the effectiveness of the non-genetic cultural data feature selection method, the method utilizes a CART decision tree algorithm to execute classification operation on the processed non-genetic cultural data set and evaluate classification results. The method is implemented according to the following steps:
Judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
Wherein, T represents the number of non-genetic culture data in the training set T, C k represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, K is the number of non-genetic culture levels, and the value of K is 2 in the invention. Assuming that the value of feature A divides training set T into two categories, T 1 and T 2, then |T 1 | and |T 2 | represent the amount of non-cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For the classification results, AUC, accuracy (hereinafter referred to as ACC), F1-score (hereinafter referred to as F1), and feature subset size were used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is a harmonic mean of Precision and Recall, and its value range is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model.
Through the mode, the non-genetic culture data feature selection method disclosed by the invention is used for carrying out feature selection on the acquired non-genetic culture data A to generate the feature subset of a group of [ x 1,x1,...,xn ] vector sets, wherein n is the largest dimension of the data set features, wherein x i =0 or 1, and represents whether the current features are selected or not so as to screen out key features in the data and reject redundant data features. The invention can generate a group of feature subsets, a decision maker can select an optimization scheme of the feature subsets according to decision requirements, and then generate new cultural data B based on the selected feature subset scheme and the non-genetic cultural data A. At this time, the cultural data B has a lower dimension than the non-cultural data a. When classifying the cultural data, the non-genetic cultural data B has lower dimensionality, and keeps better classification performance, so that the optimization of computing resources is realized.
Example 3
As shown in fig. 1, the method is specifically implemented according to the following steps:
Step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method comprises the following steps: and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the firefly (i.e., feature subset) number N is 50 and the maximum number of iterations T max is 30. And randomly initializing a firefly population FAG= { S 1,S2,...,SN }, wherein the initial position S= { S i1,Si2,...,Sid } corresponding to each firefly, i is more than or equal to 1 and less than or equal to N, and d represents a characteristic number. The initial attractive force beta 0, the absorption coefficient gamma of the propagation medium to the light, the disturbance factor alpha of the step size and the maximum iteration number T max are set. Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
And step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method comprises the following steps: the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
Wherein NGK (d|s) and NE r (d|s) are the neighborhood knowledge granularity and neighborhood coarse entropy of the candidate feature subset S relative to the decision attribute D, respectively, and the calculation formula is as follows:
where δ S(xi) is the neighborhood class of the samples in the feature subset S, |δ S∪D(xi) | is the neighborhood class of the samples in the feature subset S and the decision attribute D, and U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
Wherein, lambda 1 and lambda 2 are used to adjust the influence degree of the rough entropy of the neighborhood granularity and the importance of the attribute set, and lambda 1+λ2=1,FitNGRE is the adaptability of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method comprises the following steps: comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
Where β 0 is the attractive force when r=0, γ is the light absorption coefficient, and r ij is the distance between firefly individuals x i and x j.
For any two fireflies S i and S j epsilon FAG, if the adaptability of S j is higher than that of S i, the fireflies S i are moved towards the position of S j, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
Where d represents the spatial dimension (i.e., the characteristic dimension) of the firefly individual, α ε [0,1] is the step size factor, β (r ij) is the attractive force between firefly x i and x j, (rand-1/2) is the random number in the interval [ -0.5,0.5], and t is the number of iterations.
And updating the fitness of the firefly individual S i by using the formula (5), sequencing all fireflies and finding out the firefly individual with the optimal fitness in the current iteration number.
And 4, judging whether the current iteration reaches the maximum iteration number T max (the maximum iteration number T max is 30 in the invention), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the globally optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm. In order to verify the effectiveness of the non-genetic cultural data feature selection method, the CART decision tree model is utilized to perform classification operation on the processed non-genetic cultural data set, and the classification result is evaluated. The method is implemented according to the following steps:
Judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
Wherein, T represents the number of non-genetic culture data in the training set T, C k represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, K is the number of non-genetic culture levels, and the value of K is 2 in the invention. Assuming that the value of feature A divides training set T into two categories, T 1 and T 2, then |T 1 | and |T 2 | represent the amount of non-cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For this embodiment, AUC, ACC, F and feature subset size are used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is the harmonic mean of Precision and Recall. The value range of F1 is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model. The feature subset size refers to the number of feature subsets after feature selection, and the smaller the feature subset size, the better the feature subset size.
In this example, the present invention was compared with three existing feature selection methods on four evaluation indicators, the comparison method comprising: SSA (sparrow search algorithm), HHO (harris eagle optimization algorithm), RFE (feature recursive elimination algorithm), and the comparison result is shown in fig. 2 and 3. From fig. 2 and 3, it can be seen that the effect of the present invention is optimal, and the four evaluation indexes are all significantly improved. The method and the device can effectively acquire the feature subset with high importance and acquire a better classification result.
Claims (2)
1. The non-cultural-heritage data characteristic selection method is characterized by comprising the following steps of:
Step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm; the method comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; the characteristic subset of the non-genetic culture data set, namely firefly number N, is 50, the maximum iteration number T max is 30, the firefly population FAG= { S 1,S2,...,SN } with the size of N is randomly initialized, the initial position S= { S i1,Si2,...,Sid } corresponding to each firefly is equal to or more than 1 and equal to or less than N, and d represents the characteristic number; setting an initial attractive force beta 0, an absorption coefficient gamma of the propagation medium to light, a disturbance factor alpha of a step length and a maximum iteration number T max; before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:
Step 2, calculating fitness Fit NGRE of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set;
the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
In the formula (2), NGK (d|s) and NE r (d|s) are respectively the neighborhood knowledge granularity and the neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
In the formulas (3) and (4), δ S(xi) is a neighborhood class of samples in the feature subset S, |δ S∪D(xi) | is a neighborhood class of samples in the feature subset S and the decision attribute D, and U is a sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
in formula (5), λ 1 and λ 2 are used to adjust the influence degree of neighborhood granularity coarse entropy and attribute set importance, and λ 1+λ2 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is the rough entropy of the neighborhood granularity;
Step 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attraction force between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the adaptation degrees of the firefly individuals; the method specifically comprises the following steps:
Step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
In formula (6), β 0 is the attractive force when r=0, γ is the light absorption coefficient, and r ij is the distance between firefly individuals x i and x j;
Step 3.2, for any two fireflies S i and S j e FAG, if the fitness of S j is higher than S i, the firefly S i is moved towards the direction of the position where S j is located, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
In the formula (7), d represents the space dimension of firefly individuals, namely the characteristic dimension, alpha epsilon [0,1] is a step factor, beta (r ij) is attractive force between firefly x i and x j, (rand-1/2) is a random number in the interval of [ -0.5,0.5], and t is the iteration number;
Step 3.3, updating the fitness of the firefly individual S i by using a formula (5), sequencing all fireflies and finding out the firefly individual with the optimal fitness in the current iteration times;
And 4, judging whether the current iteration reaches the maximum iteration number T max, if not, returning to the execution step3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual.
2. The method for selecting non-genetic culture data features according to claim 1, wherein the step 4 further comprises dividing the output optimal feature subset R of the non-genetic culture data set into a training set T and a test set V according to a ratio of 7:3, classifying the divided feature subsets by using a CART decision tree model, and selecting an initial root node of the CART decision tree by calculating a base index of each feature in the training set T during the classification process, dividing the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
In the formula (8), t|represents the number of non-genetic culture data in the training set T, C k |represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and assuming that the value of the feature a divides the training set T into two categories T 1 and T 2, T 1 |and T 2 |represent the non-genetic culture data amount contained in each category, respectively;
For each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310636101.7A CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310636101.7A CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116662859A CN116662859A (en) | 2023-08-29 |
CN116662859B true CN116662859B (en) | 2024-04-19 |
Family
ID=87720173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310636101.7A Active CN116662859B (en) | 2023-05-31 | 2023-05-31 | Non-cultural-heritage data feature selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116662859B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824937A (en) * | 2016-03-17 | 2016-08-03 | 合肥工业大学 | Attribute selection method based on binary system firefly algorithm |
CN106779063A (en) * | 2016-11-15 | 2017-05-31 | 河南理工大学 | A kind of hoist braking system method for diagnosing faults based on RBF networks |
CN107230213A (en) * | 2017-05-15 | 2017-10-03 | 昆明理工大学 | A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design |
CN110162841A (en) * | 2019-04-26 | 2019-08-23 | 南京航空航天大学 | A kind of Milling Process multi-objective method introducing three-dimensional stability constraint |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9335547B2 (en) * | 2013-03-25 | 2016-05-10 | Seiko Epson Corporation | Head-mounted display device and method of controlling head-mounted display device |
US20160147300A1 (en) * | 2013-06-28 | 2016-05-26 | Nokia Technologies Oy | Supporting Activation of Function of Device |
US20170364933A1 (en) * | 2014-12-09 | 2017-12-21 | Beijing Didi Infinity Technology And Development Co., Ltd. | User maintenance system and method |
CN108417171A (en) * | 2017-02-10 | 2018-08-17 | 宏碁股份有限公司 | Display device and its display parameters method of adjustment |
WO2019080065A1 (en) * | 2017-10-26 | 2019-05-02 | 华为技术有限公司 | Display method and apparatus |
CN110867172B (en) * | 2019-11-19 | 2021-02-26 | 苹果公司 | Electronic device for dynamically controlling standard dynamic range and high dynamic range content |
-
2023
- 2023-05-31 CN CN202310636101.7A patent/CN116662859B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824937A (en) * | 2016-03-17 | 2016-08-03 | 合肥工业大学 | Attribute selection method based on binary system firefly algorithm |
CN106779063A (en) * | 2016-11-15 | 2017-05-31 | 河南理工大学 | A kind of hoist braking system method for diagnosing faults based on RBF networks |
CN107230213A (en) * | 2017-05-15 | 2017-10-03 | 昆明理工大学 | A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design |
CN110162841A (en) * | 2019-04-26 | 2019-08-23 | 南京航空航天大学 | A kind of Milling Process multi-objective method introducing three-dimensional stability constraint |
Non-Patent Citations (1)
Title |
---|
基于改进二元萤火虫群优化算法和邻域粗糙集的属性约简方法;彭鹏 等;《模式识别与人工智能》;第33卷(第2期);95-105 * |
Also Published As
Publication number | Publication date |
---|---|
CN116662859A (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112784881B (en) | Network abnormal flow detection method, model and system | |
CN112101430B (en) | Anchor frame generation method for image target detection processing and lightweight target detection method | |
CN107545275A (en) | The unbalanced data Ensemble classifier method that resampling is merged with cost sensitive learning | |
CN110322445B (en) | Semantic segmentation method based on maximum prediction and inter-label correlation loss function | |
CN107292097B (en) | Chinese medicine principal symptom selection method based on feature group | |
CN103888541B (en) | Method and system for discovering cells fused with topology potential and spectral clustering | |
CN112687349A (en) | Construction method of model for reducing octane number loss | |
CN110674865B (en) | Rule learning classifier integration method oriented to software defect class distribution unbalance | |
CN106203534A (en) | A kind of cost-sensitive Software Defects Predict Methods based on Boosting | |
CN109492816B (en) | Coal and gas outburst dynamic prediction method based on hybrid intelligence | |
CN111601358B (en) | Multi-stage hierarchical clustering spatial correlation temperature perception data redundancy removing method | |
CN114580281A (en) | Model quantization method, apparatus, device, storage medium, and program product | |
CN111309577B (en) | Spark-oriented batch application execution time prediction model construction method | |
CN114663770A (en) | Hyperspectral image classification method and system based on integrated clustering waveband selection | |
CN113066528B (en) | Protein classification method based on active semi-supervised graph neural network | |
CN116662859B (en) | Non-cultural-heritage data feature selection method | |
CN112651424A (en) | GIS insulation defect identification method and system based on LLE dimension reduction and chaos algorithm optimization | |
CN111832645A (en) | Classification data feature selection method based on discrete crow difference collaborative search algorithm | |
CN110796198A (en) | High-dimensional feature screening method based on hybrid ant colony optimization algorithm | |
CN114117876A (en) | Feature selection method based on improved Harris eagle algorithm | |
KR101085066B1 (en) | An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset | |
CN113033768A (en) | Missing feature re-representation method and system based on graph convolution network | |
CN112287437A (en) | Multimodal extreme value solving method applied to vehicle load analysis | |
CN110782950A (en) | Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm | |
CN113269217A (en) | Radar target classification method based on Fisher criterion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20241011 Address after: Room 1404, Building B, Northwest Guojin Center, Fengcheng Eighth Road, Xi'an Economic and Technological Development Zone, Shaanxi Province 710018 Patentee after: Xi'an Qingtian Zhanchuang Network Technology Co.,Ltd. Country or region after: China Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 19 Patentee before: XI'AN POLYTECHNIC University Country or region before: China |