CN116662859B

CN116662859B - Non-cultural-heritage data feature selection method

Info

Publication number: CN116662859B
Application number: CN202310636101.7A
Authority: CN
Inventors: 赵雪青; 杨晗; 师昕; 刘浩; 吴祯鴻
Original assignee: Xian Polytechnic University
Current assignee: Xi'an Qingtian Zhanchuang Network Technology Co.,Ltd.
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2024-04-19
Anticipated expiration: 2043-05-31
Also published as: CN116662859A

Abstract

The invention discloses a non-genetic culture data characteristic selection method, which comprises the following steps: acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set; moving the firefly individual with lower fitness towards the direction of the firefly individual with higher fitness, updating the position of the firefly and recalculating the fitness of the firefly individual; and outputting an optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual. Compared with the original data, the non-genetic culture data subjected to the feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy, and better data information completeness is maintained, so that the optimization of the non-genetic culture level classification effect is realized, and the purposes of reducing the data redundancy and optimizing the resources are achieved.

Description

Non-cultural-heritage data feature selection method

Technical Field

The invention belongs to the technical field of data mining methods, and particularly relates to a non-genetic cultural data feature selection method.

Background

In recent years, non-genetic culture is increasingly valued by the state and society, and particularly with the rapid development of information technology, the digital construction of the non-genetic culture is also increasingly strong, and various non-genetic culture information resources are continuously emerging. By classifying and analyzing the non-genetic culture level, a more reasonable decision scheme can be provided for the related departments to divide the level of the future non-genetic culture, so that the non-genetic culture is more effectively protected. However, the existing non-genetic cultural data has a high dimension, which greatly increases the cost of classifying and analyzing the non-genetic cultural data. In addition, the existing non-cultural-relics data information has certain uncertainty, and unimportant characteristics not only increase the redundancy of data, but also cause that ideal effects cannot be achieved when the non-cultural-relics are predicted. Therefore, in order to more effectively analyze cultural data, reducing the cost of data processing, it is necessary to perform feature selection on the cultural data to reduce the data dimension and eliminate unimportant features.

Currently, a firefly algorithm in a swarm intelligent algorithm is designed by inspiring the action of firefly flickering, and is proposed by Xin-She Yang in 2008. Compared with other intelligent algorithms, the firefly algorithm has better performance, but the construction of the adaptability function of the standard firefly algorithm generally cannot ensure that the selected feature subset has smaller information loss, and meanwhile, the algorithm has the problems of low search precision and slow convergence speed in the optimizing process. The neighborhood rough set converts the equivalent relation of the rough set theory into the coverage relation of information particles in the neighborhood space by introducing the concepts of neighborhood granulation and measurement space, and can effectively measure the uncertainty of data information. Therefore, it is necessary to combine the neighborhood rough set with the firefly algorithm, and to conduct improved researches on the aspects of fitness function construction, search updating strategy and the like of the firefly algorithm. For processing high-dimensional complex cultural data.

Disclosure of Invention

The invention aims to provide a non-genetic culture data characteristic selection method which can delete irrelevant or low-importance attributes under the condition of not influencing the final classification result of non-genetic culture data.

The technical scheme adopted by the invention is as follows: the non-cultural-heritage data characteristic selection method comprises the following steps:

Step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;

Step 2, calculating fitness Fit _NGRE of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set;

Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;

And 4, judging whether the current iteration reaches the maximum iteration number T _max, if not, returning to the execution step3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual.

The present invention is also characterized in that,

The step 1 specifically comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; the characteristic subset of the non-genetic culture data set, namely firefly number N, is 50, the maximum iteration number T _max is 30, the firefly population FAG= { S ₁,S₂,...,S_N } with the size of N is randomly initialized, the initial position S= { S _i1,S_i2,...,S_id } corresponding to each firefly is equal to or more than 1 and equal to or less than N, and d represents the characteristic number; setting an initial attractive force beta ₀, an absorption coefficient gamma of the propagation medium to light, a disturbance factor alpha of a step length and a maximum iteration number T _max; before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:

in the step2, the calculation formula of the neighborhood granularity rough entropy is as follows:

NGRE(S)＝NGK(D|S)×NE_r(D|S) (2)

In the formula (2), NGK (d|s) and NE _r (d|s) are respectively the neighborhood knowledge granularity and the neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:

In the formulas (3) and (4), δ _S(x_i) is a neighborhood class of samples in the feature subset S, |δ _S∪D(x_i) | is a neighborhood class of samples in the feature subset S and the decision attribute D, and U is a sample space;

calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:

In formula (5), λ ₁ and λ ₂ are used to adjust the influence degree of neighborhood granularity coarse entropy and attribute set importance, and λ ₁+λ₂ =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.

Step 3, comparing the sizes of the fitness of the firefly individuals, enabling the firefly individuals with lower fitness to move towards the direction of the firefly individuals with higher fitness, calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the fitness of the firefly individuals; the method specifically comprises the following steps:

Step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:

In formula (6), β ₀ is the attractive force when r=0, γ is the light absorption coefficient, and r _ij is the distance between firefly individuals x _i and x _j;

Step 3.2, for any two fireflies S _i and S _j e FAG, if the fitness of S _j is higher than S _i, the firefly S _i is moved towards the direction of the position where S _j is located, and the position update calculation formula of the firefly individual is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

In the formula (7), d represents the space dimension of firefly individuals, namely the characteristic dimension, alpha epsilon [0,1] is a step factor, beta (r _ij) is attractive force between firefly x _i and x _j, (rand-1/2) is a random number in the interval of [ -0.5,0.5], and t is the iteration number;

And 3.3, updating the fitness of the firefly individual S _i by using the formula (5), sequencing all fireflies, and finding out the firefly individual with the optimal fitness in the current iteration times.

The step 4 further comprises dividing an optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to the proportion of 7:3, classifying the divided feature subsets by adopting a CART decision tree model, and selecting an initial root node of a CART decision tree by calculating the base index of each feature in the training set T in the classifying process to divide the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:

In the formula (8), t|represents the number of non-genetic culture data in the training set T, C _k |represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and assuming that the value of the feature a divides the training set T into two categories T ₁ and T ₂, T ₁ |and T ₂ |represent the non-genetic culture data amount contained in each category, respectively;

For each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.

The beneficial effects of the invention are as follows: according to the non-genetic culture data feature selection method, compared with original data, the non-genetic culture data subjected to feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy and keeps better data information completeness, so that optimization of non-genetic culture level classification effects is realized, and the purposes of reducing data redundancy and optimizing resources are achieved.

Drawings

FIG. 1 is a flow chart of a non-cultural data feature selection method of the present invention;

FIG. 2 is a graph of the comparison results of example 3 using AUC, ACC, F A evaluation criteria on three comparison methods in the non-cultural data feature selection method of the present invention;

FIG. 3 is a graph of the comparison results of example 3 employing feature subset scale evaluation metrics on three comparison methods in a non-cultural data feature selection method of the present invention.

Detailed Description

The invention will be described in detail with reference to the accompanying drawings and detailed description.

Example 1

As shown in fig. 1, the method comprises the following steps:

Step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method is implemented according to the following steps:

And initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the firefly (i.e., feature subset) number N is 50 and the maximum number of iterations T _max is 30. And randomly initializing a firefly population FAG= { S ₁,S₂,...,S_N }, wherein the initial position S= { S _i1,S_i2,...,S_id } corresponding to each firefly, i is more than or equal to 1 and less than or equal to N, and d represents a characteristic number. The initial attractive force beta ₀, the absorption coefficient gamma of the propagation medium to the light, the disturbance factor alpha of the step size and the maximum iteration number T _max are set. Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:

and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method is implemented according to the following steps:

the neighborhood granularity coarse entropy calculation formula is as follows:

NGRE(S)＝NGK(D|S)×NEr(D|S) (2)

Wherein NGK (d|b) and NE _r (d|b) are respectively the neighborhood knowledge granularity and the neighborhood coarse entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:

Where δ _S(x_i) is the neighborhood class of the sample in the attribute subset S, |δ _S∪D(x_i) | is the neighborhood class of the sample in the attribute subset S and the decision attribute D, and U is the sample space.

Wherein, lambda ₁ and lambda ₂ are used to adjust the influence degree of the rough entropy of the neighborhood granularity and the importance of the attribute set, and lambda ₁+λ₂＝1,Fit_NGRE is the adaptability of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.

And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method is implemented according to the following steps:

Comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:

Where β ₀ is the attractive force when r=0, γ is the light absorption coefficient, and r _ij is the distance between firefly individuals x _i and x _j.

For any two fireflies S _i and S _j epsilon FAG, if the adaptability of S _j is higher than that of S _i, the fireflies S _i are moved towards the position of S _j, and the position update calculation formula of the firefly individual is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

Where d represents the spatial dimension (i.e., the characteristic dimension) of the firefly individual, α ε [0,1] is the step size factor, β (r _ij) is the attractive force between firefly x _i and x _j, (rand-1/2) is the random number in the interval [ -0.5,0.5], and t is the number of iterations.

And updating the fitness of the firefly individual S _i by using the formula (5), sequencing all fireflies and finding out the firefly individual with the optimal fitness in the current iteration number.

And 4, judging whether the current iteration reaches the maximum iteration number T _max (the maximum iteration number T _max is 30 in the invention), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the globally optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm.

Example 2

In order to verify the effectiveness of the non-genetic cultural data feature selection method, the method utilizes a CART decision tree algorithm to execute classification operation on the processed non-genetic cultural data set and evaluate classification results. The method is implemented according to the following steps:

Judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:

Wherein, T represents the number of non-genetic culture data in the training set T, C _k represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, K is the number of non-genetic culture levels, and the value of K is 2 in the invention. Assuming that the value of feature A divides training set T into two categories, T ₁ and T ₂, then |T ₁ | and |T ₂ | represent the amount of non-cultural data contained in each category, respectively.

For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.

For the classification results, AUC, accuracy (hereinafter referred to as ACC), F1-score (hereinafter referred to as F1), and feature subset size were used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is a harmonic mean of Precision and Recall, and its value range is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model.

Through the mode, the non-genetic culture data feature selection method disclosed by the invention is used for carrying out feature selection on the acquired non-genetic culture data A to generate the feature subset of a group of [ x ₁,x₁,...,x_n ] vector sets, wherein n is the largest dimension of the data set features, wherein x _i =0 or 1, and represents whether the current features are selected or not so as to screen out key features in the data and reject redundant data features. The invention can generate a group of feature subsets, a decision maker can select an optimization scheme of the feature subsets according to decision requirements, and then generate new cultural data B based on the selected feature subset scheme and the non-genetic cultural data A. At this time, the cultural data B has a lower dimension than the non-cultural data a. When classifying the cultural data, the non-genetic cultural data B has lower dimensionality, and keeps better classification performance, so that the optimization of computing resources is realized.

Example 3

As shown in fig. 1, the method is specifically implemented according to the following steps:

Step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method comprises the following steps: and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the firefly (i.e., feature subset) number N is 50 and the maximum number of iterations T _max is 30. And randomly initializing a firefly population FAG= { S ₁,S₂,...,S_N }, wherein the initial position S= { S _i1,S_i2,...,S_id } corresponding to each firefly, i is more than or equal to 1 and less than or equal to N, and d represents a characteristic number. The initial attractive force beta ₀, the absorption coefficient gamma of the propagation medium to the light, the disturbance factor alpha of the step size and the maximum iteration number T _max are set. Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:

And step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method comprises the following steps: the neighborhood granularity coarse entropy calculation formula is as follows:

NGRE(S)＝NGK(D|S)×NE_r(D|S) (2)

Wherein NGK (d|s) and NE _r (d|s) are the neighborhood knowledge granularity and neighborhood coarse entropy of the candidate feature subset S relative to the decision attribute D, respectively, and the calculation formula is as follows:

where δ _S(x_i) is the neighborhood class of the samples in the feature subset S, |δ _S∪D(x_i) | is the neighborhood class of the samples in the feature subset S and the decision attribute D, and U is the sample space.

And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method comprises the following steps: comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

And 4, judging whether the current iteration reaches the maximum iteration number T _max (the maximum iteration number T _max is 30 in the invention), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the globally optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm. In order to verify the effectiveness of the non-genetic cultural data feature selection method, the CART decision tree model is utilized to perform classification operation on the processed non-genetic cultural data set, and the classification result is evaluated. The method is implemented according to the following steps:

For this embodiment, AUC, ACC, F and feature subset size are used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is the harmonic mean of Precision and Recall. The value range of F1 is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model. The feature subset size refers to the number of feature subsets after feature selection, and the smaller the feature subset size, the better the feature subset size.

In this example, the present invention was compared with three existing feature selection methods on four evaluation indicators, the comparison method comprising: SSA (sparrow search algorithm), HHO (harris eagle optimization algorithm), RFE (feature recursive elimination algorithm), and the comparison result is shown in fig. 2 and 3. From fig. 2 and 3, it can be seen that the effect of the present invention is optimal, and the four evaluation indexes are all significantly improved. The method and the device can effectively acquire the feature subset with high importance and acquire a better classification result.

Claims

1. The non-cultural-heritage data characteristic selection method is characterized by comprising the following steps of:

Step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm; the method comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; the characteristic subset of the non-genetic culture data set, namely firefly number N, is 50, the maximum iteration number T _max is 30, the firefly population FAG= { S ₁,S₂,...,S_N } with the size of N is randomly initialized, the initial position S= { S _i1,S_i2,...,S_id } corresponding to each firefly is equal to or more than 1 and equal to or less than N, and d represents the characteristic number; setting an initial attractive force beta ₀, an absorption coefficient gamma of the propagation medium to light, a disturbance factor alpha of a step length and a maximum iteration number T _max; before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:

the neighborhood granularity coarse entropy calculation formula is as follows:

NGRE(S)＝NGK(D|S)×NEr(D|S) (2)

in formula (5), λ ₁ and λ ₂ are used to adjust the influence degree of neighborhood granularity coarse entropy and attribute set importance, and λ ₁+λ₂ =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is the rough entropy of the neighborhood granularity;

Step 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attraction force between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the adaptation degrees of the firefly individuals; the method specifically comprises the following steps:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

Step 3.3, updating the fitness of the firefly individual S _i by using a formula (5), sequencing all fireflies and finding out the firefly individual with the optimal fitness in the current iteration times;

2. The method for selecting non-genetic culture data features according to claim 1, wherein the step 4 further comprises dividing the output optimal feature subset R of the non-genetic culture data set into a training set T and a test set V according to a ratio of 7:3, classifying the divided feature subsets by using a CART decision tree model, and selecting an initial root node of the CART decision tree by calculating a base index of each feature in the training set T during the classification process, dividing the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows: