CN110991518A

CN110991518A - Two-stage feature selection method and system based on evolution multitask

Info

Publication number: CN110991518A
Application number: CN201911192139.XA
Authority: CN
Inventors: 周风余; 陈科; 孙鸿昌; 尹磊; 刘进; 常致富
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-10
Anticipated expiration: 2039-11-28
Also published as: CN110991518B

Abstract

The disclosure provides an evolution-multitask-based two-stage feature selection method and system. The method comprises a classification task construction stage: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; calling all feature subsets in a feature database, and determining an initial feature subset to be recorded as a task 1; recording all the sorted feature subsets as tasks 2; and (3) optimal feature screening: and for the task 1 and the task 2, searching and outputting the optimal characteristic subset matched with the preset search condition by utilizing a particle group algorithm in the evolution multitask method. The method has the advantages of simple implementation, high classification precision, few adjustable parameters and the like.

Description

Two-stage feature selection method and system based on evolution multitask

Technical Field

The disclosure belongs to the technical field of artificial intelligence, and particularly relates to a two-stage feature selection method and system based on evolution multitask.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of data collection and knowledge management technologies, the data volume in the fields of machine learning, data mining and the like presents an exponential growth situation. Ideally, the information provided by these data sets is useful for classification goals, but in reality, most of these data are redundant and irrelevant information for classification goals, which seriously affects the performance of the learning algorithm. Therefore, how to rapidly and efficiently mine useful data information from huge data has become a key issue to prevent the development of data application field. Feature selection is an important data preprocessing technology, which can effectively extract key information from a data set and simplify data analysis. The goal of feature selection is to select as few features as possible in order to achieve the highest classification accuracy, i.e. to screen out the best subset of features from the original dataset. At present, the feature selection technology is widely applied to numerous practical applications such as recommendation systems, text classification, pattern recognition and fault diagnosis, and performance improvement is achieved to a greater extent.

In the classification system, due to the existence of redundant and irrelevant feature information, the difficulty and the calculation cost of model training are increased, and in addition, the classifier is easy to fall into the problem of overfitting. Therefore, before the data set is used for constructing the classification model, the optimal feature subset is found out by using the feature selection technology, so that the difficulty and the construction time of classifier training in the classification system can be reduced, and the performance of the classification system can be improved to the maximum extent.

Although feature subset selection methods have been proposed in large numbers, it is still a challenge to search for a feature subset with higher classification performance in high-dimensional data. This is because there is one-way, two-way, and multi-way information interaction between features. In other words, when a feature having a high correlation is combined with other features, the feature is quite likely to become a redundant or weakly correlated feature. Therefore, the optimal feature subset selected from the original data feature set should be a set of feature information with high correlation. The feature selection method can be broadly classified into a filtering type and a wrapping type according to the evaluation manner of the selected feature subset. Filtering is the individual evaluation of features by intrinsic information in the data set, such as distance metrics, correlation metrics, consistency metrics, and information metrics. Finally, the finally selected feature subset is determined by a user-defined parameter. The wrapping type is to perform quality evaluation on the selected feature subset through a learning algorithm, such as K-nearest neighbor, support vector machine, neural network, bayesian network, and the like. Using a wrapping method generally results in a higher classification accuracy of the feature subset, but it is more time consuming. And the filtering type method is used for feature selection, the required calculation cost is low, but the precision is low.

The inventor finds that the current feature selection method has the problem that feature information is lost, so that the adjustable parameters in the later classifier training process are more and the precision is poor.

Disclosure of Invention

In order to solve the above problems, the present disclosure provides a method and a system for selecting two-stage features based on evolutionary multitask, which have the advantages of simple implementation, high classification accuracy, few adjustable parameters, and the like.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

a first aspect of the present disclosure provides an evolutionary multitasking-based two-stage feature selection method, comprising:

a classification task construction stage: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; calling all feature subsets in a feature database, and determining an initial feature subset to be recorded as a task 1; recording all the sorted feature subsets as tasks 2;

and (3) optimal feature screening: and for the task 1 and the task 2, searching and outputting the optimal characteristic subset matched with the preset search condition by utilizing a particle group algorithm in the evolution multitask method.

As an embodiment, the two-stage feature selection method based on evolutionary multitasking specifically includes:

step 1: two related classification tasks are determined: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; calling all feature subsets in the feature database, determining an initial feature subset according to an inflection point mechanism, and recording as a task 1; sorting all the feature subsets by using a feature sorting method, and recording all the sorted feature subsets as tasks 2; the step 1 is a classification task construction stage;

step 2: initializing the population size and the maximum iteration times of the particle swarm algorithm, and randomly initializing the initial position and the speed of the particle individual; one feature for each dimension of the particle;

and step 3: the method comprises the steps of representing the particle selection probability in a particle swarm algorithm in a variable range by adopting a characteristic sorting method; limiting the characteristic search space in the task 1 and the task 2 to reduce the particle swarm algorithm search space;

and 4, step 4: calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function, and initializing the individual optimal position and the global optimal position of the task 1 and the task 2;

and 5: updating the inertia weight of the particle swarm algorithm in a linear decreasing mode;

step 6: updating the speed and the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm;

and 7: updating the individual optimal position and the global optimal position of the task 1 and the task 2, and further updating the optimal complete solution of the search problem;

and 8: an initial sub-population change mechanism, wherein if the conditions are met, the initial characteristic subset is changed; otherwise, the initial feature subset is unchanged; executing a particle swarm algorithm on the basis of the initial feature subset, judging whether the evolution of the particle swarm reaches the maximum iteration times set by the particle swarm algorithm, if so, stopping searching, and outputting an optimal complete solution as an optimal feature subset of the data set; otherwise, go to step 4.

A second aspect of the present disclosure provides an evolutionary multitasking based two-stage feature selection system comprising:

a classification task construction unit for: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; calling all feature subsets in a feature database, and determining an initial feature subset to be recorded as a task 1; recording all the sorted feature subsets as tasks 2;

an optimal feature screening unit for: and for the task 1 and the task 2, searching and outputting the optimal characteristic subset matched with the preset search condition by utilizing a particle group algorithm in the evolution multitask method.

As an embodiment, the classification task construction unit includes a task marking module, configured to determine an initial feature subset according to a knee mechanism, and mark the initial feature subset as task 1; sorting all the feature subsets by using a feature sorting method, and recording all the sorted feature subsets as tasks 2;

the optimal feature screening unit comprises: the particle swarm algorithm initialization module is used for initializing the size of a population and the maximum iteration times of the particle swarm algorithm and randomly initializing the initial position and the speed of the particle individual; each particle corresponds to a feature;

the particle swarm algorithm representation module is used for representing the particle selection probability in the particle swarm algorithm in a variable range by adopting a characteristic sorting method; limiting the characteristic search space in the task 1 and the task 2 to reduce the particle swarm algorithm search space;

the optimal position calculation module is used for calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function, and initializing the individual optimal position and the global optimal position of the task 1 and the task 2;

the inertia weight calculation module is used for updating the inertia weight of the particle swarm algorithm in a linear decreasing mode;

the particle fitness value updating module is used for updating the speed and the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm;

the optimal complete solution updating module is used for updating the individual optimal position and the global optimal position of the task 1 and the task 2 so as to update the optimal complete solution of the search problem;

the optimal characteristic subset output module is used for initiating a sub-population change mechanism, and if the optimal characteristic subset output module meets the conditions, the initial characteristic subset is changed; otherwise, the initial feature subset is unchanged; executing a particle swarm algorithm on the basis of the initial feature subset, judging whether the evolution of the particle swarm reaches the maximum iteration times set by the particle swarm algorithm, if so, stopping searching, and outputting an optimal complete solution as an optimal feature subset of the data set; otherwise, updating the individual optimal position and the global optimal position of the task 1 and the task 2, and continuously calculating the optimal complete solution.

The beneficial effects of this disclosure are:

(1) the present disclosure proposes a knee selection mechanism to solve the problem that the primary selection subset is difficult to determine, and it can adaptively select the primary selection subset for the second stage from the original feature set according to the feature quality, without losing important information in the original data.

(2) In order to better realize feature subset selection by using an evolutionary multitask technology, the disclosure also provides a variable-range particle characterization mechanism and an initial subset change strategy, and the two strategies can well solve the problems existing in the particle swarm optimization, namely the problems that the search space is too large and the local optimization is easy to fall into.

(3) Under the multitask condition, the method is likely to help solve other tasks when processing a single task, because related or complementary information exists among the tasks, and because the multi-particle swarm algorithm is an effective implementation of the evolution multitask in the evolution algorithm, the method adopts the multi-particle swarm algorithm to share the implicit knowledge migration among the tasks through selective mating operation, and finally achieves the effect of mutual promotion.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flow diagram of an evolution-multitasking based two-stage feature selection process according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a knee selection mechanism of an embodiment of the present disclosure;

FIG. 3 is a variable range characterization graph based on a feature ranking algorithm according to an embodiment of the present disclosure;

FIG. 4 is a fitness value variation trend graph of the Leukemia data optimized by using the method and the original particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure;

fig. 5 is a fitness value variation trend graph of the 9Tumor data optimized by using the method and particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure;

fig. 6 is a fitness value variation trend graph of the data of the protate optimized by using the method and particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure;

FIG. 7 is a fitness value change trend graph of Brain data optimized by using the method and particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure;

FIG. 8 is a diagram illustrating a trend of a fitness value change in 11Tumor data optimized by using the method and particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure;

fig. 9 is a fitness value change trend graph of Lung data optimized by using the method and particle swarm optimization proposed by the present disclosure in the embodiment of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In the present disclosure, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only relational terms determined for convenience in describing structural relationships of the parts or elements of the present disclosure, and do not refer to any parts or elements of the present disclosure, and are not to be construed as limiting the present disclosure.

In the present disclosure, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present disclosure can be determined on a case-by-case basis by persons skilled in the relevant art or technicians, and are not to be construed as limitations of the present disclosure.

In the embodiment, the filtering type feature selection method and the wrapping type feature selection method are combined to perform feature subset selection, so that the mutual advantages can be effectively utilized, and a two-stage mixing method based on evolution multitask is provided to solve the problem of optimal feature subset selection.

Particle swarm optimization has been widely demonstrated to have a high potential in the feature selection problem. This is because it has an algorithm with a strong global search capability, which can effectively handle the feature interaction problem. The particle swarm algorithm is a meta-heuristic search method proposed by Kennedy and Eberhart in 1995 through research on foraging behavior of a bird flock. Like other related population-based algorithms, the particle swarm algorithm randomly initializes the position of a population within a feasible domain and then searches for a global optimal position by adjusting the running trajectory of particles. Unlike other search algorithms, however, the particle swarm algorithm adjusts the running trajectory of the particle itself according to the current global optimal position and the local optimal position. The particle swarm algorithm has the advantages of simple implementation and high operation efficiency, is widely popularized and is applied to solving a plurality of practical optimization problems.

Feature selection is essentially a combinatorial optimization problem, i.e., the classification of data is more accurate and the number of selected features is minimal. The goal is to find an optimal feature combination from the original high-dimensional data in order to obtain the highest classification discrimination capability. The evolutionary multi-task technology is a paradigm of focusing on solving multiple related tasks by sharing experience of solving problems of each task. Under the multi-task condition, a single task is processed, and other tasks are possibly helped to be solved, because relevant or complementary information exists among the tasks, and because the multi-particle swarm algorithm is an effective implementation of the evolution multi-task in the evolution algorithm, the embodiment adopts the multi-particle swarm algorithm to share the implicit knowledge migration among the tasks through selective mating operation, and finally the effect of mutual promotion is achieved.

The two-stage feature selection method and system based on the evolution multitask can be applied to a recommendation system, text classification, mode identification and fault diagnosis.

For example: for the field of text classification, the features in the subset of features are text-related features.

For the field of fault diagnosis, the features in the feature subset are features of the relevant variables for diagnosing a fault.

By utilizing the two-stage feature selection method based on the evolutionary multitask, the optimal feature subset is screened out, and the classifier is trained and the corresponding classification result is output by utilizing the screened feature subset.

The embodiment realizes self-adaptive rapid screening of the feature subset, and improves the rapidness and accuracy of classification such as text classification, mode identification, fault diagnosis and the like.

The following is specifically described by taking text characteristics as an example:

the two-stage feature selection method based on evolution multitask in the embodiment comprises the following steps:

a classification task construction stage: acquiring text characteristic data, forming corresponding text characteristic subsets by different characteristics, and storing the text characteristic subsets into a text characteristic database; calling all feature subsets in a text feature database, and determining an initial text feature subset to be recorded as a task 1; all the sorted text feature subsets are recorded as tasks 2;

and (3) optimal feature screening: and for the task 1 and the task 2, searching and outputting the optimal text feature subset matched with the preset search condition by utilizing a particle group algorithm in the evolution multitask method.

For example: the data in the text feature subset is three-dimensional feature data, namely three-dimensional coordinates, the one-dimensional coordinates are the categories of the word sets, the one-dimensional coordinates are the times of the word sets appearing in the computer text, and the one-dimensional coordinates are the times of the categories of the word sets appearing in the categories;

specifically, a computer text is processed, the computer text is disconnected and processed into a word set, the word set is classified by taking part of speech as a characteristic to obtain a category of the word set, words of the same part of speech are classified into one category, the part of speech is classified into verbs, nouns, adjectives or adverbs, the words in the word set are mapped to points on coordinates, the abscissa on the coordinates is the number of times that the words in the word set appear in the computer text, the ordinate on the coordinates is the number of times that the words in the word set appear in the category of the words, and values of the coordinates are recorded to obtain a three-dimensional characteristic data set of the text.

Specifically, as shown in fig. 1, the method for selecting two-stage features based on evolutionary multitasking of the present embodiment includes:

step 1: two relevant classification tasks are determined. Determining an initial feature subset according to a knee point mechanism, and recording as a task 1; and sequencing all the feature subsets by using a feature sequencing method, and marking all the sequenced feature subsets as tasks 2.

In the process of feature subset selection, since the K-nearest neighbor model has simple and efficient classification characteristics, the present embodiment uses K-nearest neighbors as classifiers for evaluating the performance of the selected feature subset. And ten-fold cross validation is adopted to avoid the influence of uneven distribution on the final classification performance of the classification data set, namely, the data set is divided into 10 parts at random, 1 part is used as a test set, the rest nine parts are used as training sets, and the test is sequentially carried out, so that 10 classification accuracies can be obtained, then the 10 classification accuracies are averaged, and the average accuracy is used for guiding the selection of the feature subset.

In a specific implementation, in the step 1, the process of determining the initial feature subset according to the inflection point mechanism is as follows:

after the features are sorted in descending order according to importance, a curve about the importance of the features is obtained, and after the first and last points are connected by a straight line, the inflection point is the point farthest from the straight line. Thus points above the inflection point, i.e. points with high feature correlation, will be selected to constitute the initial feature subset.

For example: the importance of nouns, verbs, adjectives, and adverbs decreases in order.

The inflection point is a non-reference selection method, and can effectively detect the point of the data information with significant property change. FIG. 2 is a schematic diagram of the corner selection mechanism, and it can be seen from FIG. 2 that the feature weight value greater than the corner feature weight value is selected as the initially selected feature subset.

In a specific implementation, in step 1, all feature subsets are sorted by using a ReliefF algorithm in a feature sorting method.

Among them, the ReliefF algorithm is a simple and efficient feature sorting method for handling the multi-class classification problem. The method solves the problem of feature selection by calculating the correlation between categories and features and assigning different weights to the features. In the evaluation process, the more important the features are, the larger the feature weight is. The present embodiment uses the ReliefF algorithm to reorder the features. In the sorting process, the sorting is performed in a descending order according to the importance of the features and the categories.

Step 2: initializing the population size and the maximum iteration times of the particle swarm algorithm, and randomly initializing the initial position and the speed of the particle individual; one feature for each dimension of the particle.

And step 3: and (3) representing the particle selection probability in the particle swarm algorithm by adopting a variable range of a feature sorting method.

In the particle swarm algorithm, the number of dimensions of a particle is equal to the number of features, and the search range for each dimension is the same. In other words, the probability that each feature in the dataset is selected is the same. In high-dimensional data, because the number of features is large, if the search range of each feature is the same, the operation difficulty of the optimal subset search process is seriously increased. In order to solve the problem, the embodiment provides a variable range characterization strategy based on a feature sorting method, which can effectively adjust the probability of each selected feature, namely, the probability of good feature selection is high, and the probability of poor probability selection is low.

Specifically, in the step 3, the process of characterizing the probability of selecting particles in the particle swarm algorithm by using the feature sorting method in a variable range includes:

two points, namely an inflection point and a demarcation point, are obtained based on a characteristic sorting method; the demarcation point is a point with a characteristic weight value less than 0; obtaining a three-section type characteristic representation method according to the inflection point and the demarcation point, namely when the characteristic weight is greater than the inflection point weight, the search range is [0,1 ]; when the characteristic weight is smaller than the weight of the demarcation point, the search range is [0, a ]; when the search range is between two points, the search range is linearly decreased from [0,1] to [0, a ]; and a is a number between 0 and 1, so that the probability of selecting each feature is effectively adjusted, namely the good probability of selecting the features is high, and the poor probability of selecting the features is low. For example: and a is 0.7.

And limiting the feature search space in the task 1 and the task 2 by adopting variable range representation so as to achieve the aim of effectively reducing the particle swarm algorithm search space.

And 4, step 4: and calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function, and initializing the individual optimal position and the global optimal position of the task 1 and the task 2.

Specifically, in step 4, the pre-constructed feature subset quality evaluation function is:

among them, Fitness_minRepresenting a feature subset quality evaluation function; gamma ray_R(D) Representing the classification error rate of the feature subset R relative to the target data set D, | S | representing the number of selected feature subsets, | N | representing the total number of features in the data set, α and β are parameters that adjust the ratio of the classification error rate and the number of feature subsets.

In step 4, the process of initializing the individual optimal position and the global optimal position of the task 1 and the task 2 includes:

setting the current position of the particle as the individual optimal position pbest; then, considering the ring topology structure of the particle population, comparing the fitness value of each particle with two adjacent particles, keeping the position corresponding to the smaller fitness value, and taking the position of the finally won particle as the global optimal position gbest of the current particle population.

And 5: and updating the inertia weight of the particle swarm algorithm by adopting a linear decreasing mode.

Specifically, in the step 5, updating the inertia weight w of the particle swarm algorithm in a linear decreasing manner is as follows:

wherein iter and Max _ iter respectively represent the current iteration number and the maximum iteration number.

Step 6: and updating the speed and the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm.

Specifically, in step 6, the process of updating the speed of each particle in the particle swarm by using the preset random interaction probability between task 1 and task 2 and the updated inertial weight includes:

if the random value is larger than the preset random interaction probability between the task 1 and the task 2, updating the particle speed by using a formula (a); otherwise, the particle velocity is updated using equation (b):

wherein the content of the first and second substances,

representing the current time-of-day velocity of the ith particle,

representing the last moment speed of the ith particle; pbest_idRepresenting the individual optimal position of the ith particle; gbest_dAnd gbest'_dRespectively representing the global optimal positions of the current task 1 and the current task 2;

indicating the position of the ith particle; c. C₁，c₂And c₃Represents an acceleration factor, e.g., set to 1.49445; r is₁，r₂And r₃Are all within the range [0,1]A random number within; d represents the dimension of the particle group.

In addition, c is₁，c₂And c₃Other values may be set.

The formula for updating the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight is as follows:

wherein the content of the first and second substances,

indicating the position of the ith particle;

representing the speed of the ith particle at the next moment;

indicating the position of the ith particle at the next time instant.

And 7: and updating the individual optimal position and the global optimal position of the task 1 and the task 2, and further updating the optimal complete solution of the search problem.

Particle swarm optimization has the problems of premature convergence and easy falling into local optimization. In order to solve the problem, the embodiment provides an initial subgroup changing mechanism, which can effectively maintain the diversity of the particle swarm algorithm, help task 1 jump out of local optimality, and explore more possible feature combinations in the searching process.

To further illustrate the superiority of the present embodiment in processing the data feature selection problem, fig. 4-9 show the fitness value convergence curves obtained by the method and particle swarm optimization in processing six high-dimensional classification data sets of leukamia, 9Tumor, protate, Brain, 11Tumor and Lung according to the present embodiment.

FIG. 4 is a fitness convergence curve of the Leukemia classification problem using the method and particle swarm optimization of the present embodiment in the embodiment; FIG. 5 is a fitness convergence curve of the 9Tumor classification problem using the method and particle swarm optimization of the present embodiment in the embodiment; FIG. 6 shows a fitness convergence curve under the prestate classification problem using the method and particle swarm optimization of the embodiment; FIG. 7 is a fitness convergence curve under the Brain classification problem using the method and particle swarm optimization of the embodiment; FIG. 8 is a fitness convergence curve under the 11Tumor classification problem using the method and particle swarm optimization of the present embodiment in the embodiment; FIG. 9 is a curve of fitness convergence under the Lung classification problem using the method and particle swarm optimization of the present embodiment in the embodiment; in the example, 6 classical high-dimensional test data sets are given as an example, the execution times are 30 times, the population number is 100, and the performance index value obtained each time is recorded. Table 1 shows the statistical results of the optimal feature subsets obtained using all the features, the Particle Swarm Optimization (PSO), and the method of this example, where CR (%) indicates the classification accuracy of the feature subsets. As can be seen from the comparison result, the feature selection method provided by the embodiment can better select the feature subset with higher classification accuracy, and the capability of removing redundant and irrelevant features is obviously enhanced. As can be seen from analyzing fig. 4-9, the convergence rate of the method of the present embodiment is significantly improved, which also reflects the rapidity of the proposed feature selection method.

In summary, the two-stage feature selection method based on the multitasking technology provided by the embodiment can effectively solve the problem of data feature selection which generally exists in practice.

TABLE 1

The embodiment proposes a knee selection mechanism to solve the problem that the primary selection subset is difficult to determine, and can adaptively select the primary selection subset for the second stage from the original feature set according to the feature quality, without losing important information in the original data.

In order to better realize feature subset selection by using an evolutionary multitask technology, the embodiment further provides a variable-range particle characterization mechanism and an initial subset change strategy, and the two strategies can well solve the problems existing in the particle swarm optimization, namely the problems that a search space is too large and the local optimization is easy to fall into.

Under the multi-task condition, a single task is processed, and other tasks are possibly helped to be solved, because relevant or complementary information exists among the tasks, and because the multi-particle swarm algorithm is an effective implementation of the evolution multi-task in the evolution algorithm, the embodiment adopts the multi-particle swarm algorithm to share the implicit knowledge migration among the tasks through selective mating operation, and finally the effect of mutual promotion is achieved.

Example 2

The present embodiment provides an evolution-multitasking-based two-stage feature selection system, which includes:

Specifically, the classification task construction unit comprises a task marking module, which is used for determining an initial feature subset according to an inflection point mechanism and marking as a task 1; sorting all the feature subsets by using a feature sorting method, and recording all the sorted feature subsets as tasks 2;

the optimal characteristic subset output module is used for initiating a sub-population change mechanism, and if the optimal characteristic subset output module meets the conditions, the initial characteristic subset is changed; otherwise, the initial feature subset is unchanged; executing a particle swarm algorithm on the basis of the initial feature subset, judging whether the evolution of the particle swarm reaches the maximum iteration times set by the particle swarm algorithm, if so, stopping searching, and outputting an optimal complete solution as an optimal feature subset of the data set; otherwise, the particle selection probability in the particle swarm algorithm is re-characterized, and the optimal complete solution is continuously calculated.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A two-stage feature selection method based on evolution multitasking is characterized by comprising the following steps:

2. The evolutionary multitask-based two-stage feature selection method as claimed in claim 1, characterized in that the evolutionary multitask-based two-stage feature selection method specifically comprises:

step 2: initializing the population size and the maximum iteration times of the particle swarm algorithm, and randomly initializing the initial position and the speed of the particle individual; each dimension of the particle corresponds to one characteristic;

3. The evolutionary multitask-based two-stage feature selection method according to claim 2, characterized in that in step 1, the process of determining an initial feature subset according to a knee point mechanism is:

after the features are sorted in descending order according to importance, a curve about the importance of the features is obtained, and after the first and last points are connected by a straight line, the inflection point is the point farthest from the straight line. Points above the inflection point, i.e., points with high feature correlation, are selected to form an initial feature subset;

or

In step 1, all feature subsets are sorted by using a ReliefF algorithm in the feature sorting method.

4. The evolutionary multitask-based two-stage feature selection method as claimed in claim 2, wherein in the step 3, the process of characterizing the particle selection probability in the particle swarm optimization algorithm by adopting the feature sorting method in a variable range comprises the following steps:

two points, namely an inflection point and a demarcation point, are obtained based on a characteristic sorting method; the demarcation point is a point with a characteristic weight value less than 0; obtaining a three-section type characteristic representation method according to the inflection point and the demarcation point, namely when the characteristic weight is greater than the inflection point weight, the search range is [0,1 ]; when the characteristic weight is smaller than the weight of the demarcation point, the search range is [0, a ]; when the search range is between two points, the search range is linearly decreased from [0,1] to [0, a ]; and a is a number between 0 and 1, so that the probability of each selected feature is effectively adjusted, namely the good feature selection probability is high, and the poor feature selection probability is low.

5. The evolutionary multitask-based two-stage feature selection method as claimed in claim 2, wherein in the step 4, the pre-constructed feature subset quality evaluation function is:

among them, Fitness_minRepresenting a feature subset quality evaluation function; gamma ray_R(D) Representing a subset of features R relative to target dataThe classification error rate of the set D, | S | represents the number of selected feature subsets, | N | represents the total number of features in the data set, | α and β are parameters that adjust the ratio of the classification error rate and the number of feature subsets;

or in the step 4, the process of initializing the individual optimal position and the global optimal position of the task 1 and the task 2 comprises the following steps:

6. The evolutionary multitask-based two-stage feature selection method according to claim 2, wherein in the step 5, the inertia weight w of the particle swarm algorithm is updated in a linear descending manner as follows:

7. The evolutionary multitask-based two-stage feature selection method as claimed in claim 2, wherein in the step 6, the process of updating the speed of each particle in the particle swarm by using the preset random interaction probability between task 1 and task 2 and the updated inertial weight is as follows:

wherein the content of the first and second substances,

representing the current time-of-day velocity of the ith particle,

indicating the position of the ith particle; c. C₁，c₂And c₃Represents an acceleration coefficient; r is₁，r₂And r₃Are all within the range [0,1]A random number within; d represents the dimension of the particle group.

8. The evolutionary multitask-based two-stage feature selection method according to claim 2, wherein in the step 6, the formula for updating the position of each particle in the particle swarm by using the preset random interaction probability between task 1 and task 2 and the updated inertial weight is as follows:

wherein the content of the first and second substances,

indicating the position of the ith particle;

representing the speed of the ith particle at the next moment;

indicating the position of the ith particle at the next time instant.

9. An evolutionary multitasking based two-stage feature selection system, comprising:

10. The evolutionary multitask-based two-stage feature selection system as claimed in claim 9, wherein said classification task building unit comprises a task labeling module for determining an initial feature subset according to a knee point mechanism, denoted task 1; sorting all the feature subsets by using a feature sorting method, and recording all the sorted feature subsets as tasks 2;