CN110991518B

CN110991518B - Two-stage feature selection method and system based on evolutionary multitasking

Info

Publication number: CN110991518B
Application number: CN201911192139.XA
Authority: CN
Inventors: 周风余; 陈科; 孙鸿昌; 尹磊; 刘进; 常致富
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2023-11-21
Anticipated expiration: 2039-11-28
Also published as: CN110991518A

Abstract

The present disclosure provides two-stage feature selection methods and systems based on evolutionary multitasking. The method comprises a classification task construction stage: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; invoking all feature subsets in a feature database, and determining an initial feature subset to be marked as a task 1; marking all the sequenced feature subsets as task 2; and (3) an optimal feature screening stage: and for the task 1 and the task 2, searching out an optimal feature subset matched with a preset search condition by utilizing a particle swarm algorithm in the evolution multitasking method and outputting the optimal feature subset. The method has the advantages of simple implementation, high classification precision, few adjustable parameters and the like.

Description

Two-stage feature selection method and system based on evolutionary multitasking

Technical Field

The disclosure belongs to the technical field of artificial intelligence, and particularly relates to a two-stage feature selection method and system based on evolutionary multitasking.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of data collection and knowledge management technologies, the data volume in the fields of machine learning, data mining and the like presents an exponentially growing situation. Ideally, the information provided by these datasets is useful for classifying objects, but in practice, most of the data is redundant and irrelevant for classifying objects, which can seriously impact the performance of the learning algorithm. Therefore, how to quickly and efficiently mine useful data information from huge data has become a key problem to prevent the development of the data application field. Feature selection is an important data preprocessing technique that can effectively extract key information from a dataset and simplify data analysis. The purpose of feature selection is to select as few features as possible in order to obtain the highest classification accuracy, i.e. to screen out the best feature subset from the original dataset. At present, the feature selection technology is widely applied to a plurality of practical applications such as recommendation systems, text classification, pattern recognition and fault diagnosis, and the performance is improved to a greater extent.

In the classification system, the difficulty and the calculation cost of model training are increased due to the existence of redundant and incoherent characteristic information, and moreover, the classifier is easy to fall into the problem of overfitting. Therefore, before the data set is used for constructing the classification model, the feature selection technology is used for finding out the optimal feature subset, so that the difficulty and the construction time of the classifier training in the classification system can be reduced, and the performance of the classification system can be improved to the greatest extent.

Although feature subset selection methods have been proposed in large numbers, searching for a feature subset with higher classification performance in high-dimensional data remains a challenge. This is because there is one, two and multiple directions of information interaction between features. In other words, when a feature having a high correlation is combined with other features, it is entirely possible for the feature to become a redundant or weakly correlated feature. Therefore, the best feature subset selected from the original data feature set should be a set of feature information with a high degree of correlation. Feature selection methods can be broadly classified into two types, filtering type and parcel type, depending on the manner in which the selected feature subset is evaluated. Filtering is the single evaluation of features by inherent information in the dataset, such as distance metrics, correlation metrics, consistency metrics, and information metrics. The final selected feature subset is ultimately determined by a user-defined parameter. The wrapping is to evaluate the quality of the selected feature subset through a learning algorithm, such as K-nearest neighbor, support vector machine, neural network, bayesian network, etc. The wrapped approach typically yields feature subsets with higher classification accuracy, but is time consuming. The filtering type method is used for feature selection, so that the required calculation cost is low, but the precision is low.

The inventor discovers that the current feature selection method has the problem that the feature information is lost, so that the later classifier training process has more adjustable parameters and poor precision.

Disclosure of Invention

In order to solve the above problems, the present disclosure provides a two-stage feature selection method and system based on evolutionary multitasking, which has the advantages of simple implementation, high classification accuracy, few adjustable parameters, and the like.

In order to achieve the above purpose, the present disclosure adopts the following technical scheme:

a first aspect of the present disclosure provides a two-stage feature selection method based on evolutionary multitasking, comprising:

classification task construction stage: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; invoking all feature subsets in a feature database, and determining an initial feature subset to be marked as a task 1; marking all the sequenced feature subsets as task 2;

and (3) an optimal feature screening stage: and for the task 1 and the task 2, searching out an optimal feature subset matched with a preset search condition by utilizing a particle swarm algorithm in the evolution multitasking method and outputting the optimal feature subset.

As an implementation manner, the two-stage feature selection method based on evolutionary multitasking specifically includes:

Step 1: two related classification tasks are determined: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; all feature subsets in the feature database are called, an initial feature subset is determined according to an inflection point mechanism, and the initial feature subset is recorded as a task 1; sorting all feature subsets by using a feature sorting method, and marking all the sorted feature subsets as a task 2; the step 1 is a classification task construction stage;

step 2: initializing the population size and the maximum iteration number of a particle swarm algorithm, and randomly initializing the initial position and the speed of a particle individual; each dimension of the particle corresponds to a feature;

step 3: the variable range of the feature ordering method is adopted to represent the probability of particle selection in a particle swarm algorithm; defining the characteristic search space in the task 1 and the task 2 so as to reduce the search space of the particle swarm algorithm;

step 4: calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function, and initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2;

step 5: updating the inertia weight of the particle swarm algorithm in a linear decreasing mode;

Step 6: updating the speed and the position of each particle in the particle swarm by adopting the random interaction probability between the preset task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm;

step 7: updating the individual optimal positions and the global optimal positions of the task 1 and the task 2, and further updating the optimal complete solution of the search problem;

step 8: an initial sub-population change mechanism, if the condition is met, changing an initial feature subset; otherwise, the initial feature subset is unchanged; on the basis of the initial feature subset, a particle swarm algorithm is executed, whether the evolution of the particle swarm reaches the maximum iteration number set by the particle swarm algorithm is judged, if yes, searching is stopped, and an optimal complete solution is output as an optimal feature subset of the data set; otherwise, go to step 4.

A second aspect of the present disclosure provides a two-stage feature selection system based on evolutionary multitasking, comprising:

a classification task construction unit for: acquiring feature data, forming corresponding feature subsets by different features, and storing the feature subsets into a feature database; invoking all feature subsets in a feature database, and determining an initial feature subset to be marked as a task 1; marking all the sequenced feature subsets as task 2;

An optimal feature screening unit for: and for the task 1 and the task 2, searching out an optimal feature subset matched with a preset search condition by utilizing a particle swarm algorithm in the evolution multitasking method and outputting the optimal feature subset.

As one embodiment, the classification task construction unit includes a task tagging module configured to determine an initial feature subset according to an inflection point mechanism, denoted as task 1; sorting all feature subsets by using a feature sorting method, and marking all the sorted feature subsets as a task 2;

the optimal feature screening unit comprises: the particle swarm algorithm initializing module is used for initializing the population size and the maximum iteration number of the particle swarm algorithm and randomly initializing the initial position and the speed of the particle individual; each particle corresponds to a feature;

the particle swarm algorithm characterization module is used for characterizing the probability of particle selection in the particle swarm algorithm by adopting a variable range of a feature ordering method; defining the characteristic search space in the task 1 and the task 2 so as to reduce the search space of the particle swarm algorithm;

the optimal position calculation module is used for calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function and initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2;

The inertial weight calculation module is used for updating the inertial weight of the particle swarm algorithm in a linear decreasing mode;

the particle fitness value updating module is used for updating the speed and the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight, so as to calculate the fitness value of each particle in the updated particle swarm;

the optimal complete solution updating module is used for updating the individual optimal positions and the global optimal positions of the task 1 and the task 2 so as to update the optimal complete solution of the search problem;

the optimal feature subset output module is used for initializing a sub-population change mechanism, and if the condition is met, the initial feature subset is changed; otherwise, the initial feature subset is unchanged; on the basis of the initial feature subset, a particle swarm algorithm is executed, whether the evolution of the particle swarm reaches the maximum iteration number set by the particle swarm algorithm is judged, if yes, searching is stopped, and an optimal complete solution is output as an optimal feature subset of the data set; otherwise, updating the individual optimal positions and the global optimal positions of the task 1 and the task 2, and continuously calculating the optimal complete solution.

The beneficial effects of the present disclosure are:

(1) The present disclosure proposes an inflection point selection mechanism to solve the problem of difficulty in determining a preliminary selected subset, which can adaptively select a preliminary selected subset for the second stage from an original feature set according to feature quality, without losing important information in the original data.

(2) In order to better use the evolutionary multitasking technology to realize feature subset selection, the disclosure also provides a variable-range particle characterization mechanism and an initial subset change strategy, which can well solve the problems of the particle swarm algorithm, namely the problems of too large search space and easy local optimum.

(3) Under the condition of multitasking, the processing of a single task is likely to help to solve other tasks because related or complementary information exists among the tasks, and because the multiparticulate algorithm is an effective implementation of evolutionary multitasking in the evolutionary algorithm, the implicit knowledge migration among the tasks is shared by adopting the multiparticulate algorithm through selective mating operation, and finally the effect of mutual promotion is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a two-stage feature selection flow diagram based on evolutionary multitasking in accordance with an embodiment of the disclosure;

FIG. 2 is a schematic diagram of an inflection point selection mechanism of an embodiment of the present disclosure;

FIG. 3 is a graph of a variable range characterization based on a feature ordering algorithm in accordance with an embodiment of the present disclosure;

fig. 4 is a graph of a change trend of fitness values of the Leukemia data optimized using the method and the primary particle swarm algorithm according to the embodiment of the present disclosure;

FIG. 5 is a graph of the fitness value variation trend of optimizing 9Tumor data using the method and particle swarm algorithm according to an embodiment of the present disclosure;

FIG. 6 is a graph of the fitness value variation trend of optimizing the Protate data using the method and particle swarm optimization proposed by the present disclosure in an embodiment of the present disclosure;

FIG. 7 is a graph of fitness value variation trend of the Brain data optimized using the method and particle swarm algorithm proposed in the present disclosure;

FIG. 8 is a graph of fitness value variation trend of optimizing 11Tumor data using the method and particle swarm algorithm according to an embodiment of the present disclosure;

fig. 9 is a graph of the fitness value variation trend of optimizing the Lung data using the method and the particle swarm algorithm according to the embodiment of the present disclosure.

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

In the present disclosure, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, are merely relational terms determined for convenience in describing structural relationships of the various components or elements of the present disclosure, and do not denote any one of the components or elements of the present disclosure, and are not to be construed as limiting the present disclosure.

In the present disclosure, terms such as "fixedly coupled," "connected," and the like are to be construed broadly and refer to either a fixed connection or an integral or removable connection; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the terms in the disclosure may be determined according to circumstances, and should not be interpreted as limiting the disclosure, for relevant scientific research or a person skilled in the art.

The embodiment combines the filtering type feature selection method and the wrapping type feature selection method to select the feature subset, so that mutual advantages can be effectively utilized, and the problem of optimal feature subset selection is solved by the two-stage mixing method based on evolutionary multitasking.

Particle swarm algorithms are widely demonstrated to possess high potential in feature selection issues. This is because it has an algorithm with a strong global search capability that can effectively deal with feature interaction problems. The particle swarm algorithm is a meta-heuristic search method proposed by Kennedy and Eberhart in 1995 through research on the foraging behavior of the flock. Similar to other related population-based algorithms, the particle swarm algorithm randomly initializes the position of the population in a feasible region, and then searches for a globally optimal position by adjusting the trajectory of the particles. However, unlike other search algorithms, the particle swarm algorithm adjusts the running track of the particle according to the current global optimal position and the local optimal position. Because the particle swarm algorithm has the advantages of simple implementation and high operation efficiency, the particle swarm algorithm is widely popularized and applied to solving a plurality of practical optimization problems.

Feature selection is essentially a combinatorial optimization problem, i.e., the data is classified with a higher accuracy and the number of features selected is minimal. The goal is to find an optimal feature combination from the original high-dimensional data in order to obtain the highest classification discrimination capability. Evolutionary multitasking is an experience that addresses problems by sharing the individual tasks, focusing on a paradigm that addresses multiple related tasks. In the case of multiple tasks, the processing of a single task may help to solve other tasks because related or complementary information exists between the tasks, and since the multiple particle swarm algorithm is an effective implementation of the evolutionary multiple tasks in the evolutionary algorithm, the embodiment adopts the multiple particle swarm algorithm to share the implicit knowledge migration between the tasks through selective mating operation, and finally achieves the effect of mutual promotion.

The two-stage feature selection method and system based on evolutionary multitasking of the embodiment can be applied to recommendation systems, text classification, pattern recognition and fault diagnosis.

For example: for the text classification field, the features in the feature subset are text-related features.

For the field of fault diagnosis, features in the feature subset are features of relevant variables that diagnose the fault.

The two-stage feature selection method based on the evolutionary multitasking of the embodiment is utilized to screen out the optimal feature subset, and the screened feature subset is utilized to train the classifier and output the corresponding classification result.

The embodiment realizes self-adaptive rapid screening of the feature subsets, and improves the rapid and accurate classification of text classification, pattern recognition, fault diagnosis and the like.

The text features are described in detail below:

the two-stage feature selection method based on evolutionary multitasking of the embodiment comprises the following steps:

classification task construction stage: acquiring text feature data, forming a corresponding text feature subset by different features, and storing the text feature subset into a text feature database; invoking all feature subsets in a text feature database, and determining an initial text feature subset to be marked as a task 1; marking all the sequenced text feature subsets as task 2;

and (3) an optimal feature screening stage: and for the task 1 and the task 2, searching out an optimal text feature subset matched with a preset search condition by utilizing a particle swarm algorithm in the evolution multitasking method and outputting the optimal text feature subset.

For example: the data in the text feature subset is three-dimensional feature data, namely three-dimensional coordinates, wherein one-dimensional coordinates are the categories of the word sets, one-dimensional coordinates are the times of occurrence of the word sets in the text of the computer, and one-dimensional coordinates are the times of occurrence of the categories of the word sets in the categories of the word sets;

Specifically, processing a computer text, breaking the computer text into word sets, classifying the word sets by taking parts of speech as characteristics to obtain the category of the word sets, classifying the words with the same parts of speech into one category, classifying the words with the same parts of speech into verbs, nouns, adjectives or adverbs, mapping the words in the word sets into points on coordinates, wherein the abscissa on coordinates is the number of times the words in the word sets appear in the computer text, the ordinate on coordinates is the number of times the words in the word sets appear in the category, and recording the values of the coordinates to obtain the three-dimensional characteristic data set of the text.

Specifically, as shown in fig. 1, a two-stage feature selection method based on evolutionary multitasking of the present embodiment includes:

step 1: two related classification tasks are determined. Determining an initial feature subset according to an inflection point mechanism, and recording the initial feature subset as a task 1; and (5) sorting all feature subsets by using a feature sorting method, and marking the sorted all feature subsets as task 2.

In the feature subset selection process, since the K-nearest neighbor model has simple and efficient classification characteristics, the present embodiment uses K-nearest neighbors as classifiers to evaluate the performance of the selected feature subset. And ten-fold cross validation is adopted to avoid that the classification data set affects the final classification performance due to uneven distribution, namely the data set is divided into 10 parts at random, 1 part is used as a test set, the remaining nine parts are used as training sets, the test is sequentially carried out, 10 classification precision can be obtained, then the 10 classification precision is averaged, and the average precision is used for guiding the selection of the feature subset.

In a specific implementation, in the step 1, the process of determining the initial feature subset according to the inflection point mechanism is as follows:

after the features are arranged in descending order according to importance, a curve about the importance of the features is obtained, and after the first and last points are connected by a straight line, the inflection point is the point farthest from the straight line. Such that points above the inflection point, i.e. points with high feature relevance, will be selected to constitute the initial feature subset.

For example: the nouns, verbs, adjectives, adverbs are successively less important.

The inflection point is a point of a non-reference selection method, and the point of significant property change of the data information can be effectively checked. Fig. 2 is a schematic diagram of an inflection point selection mechanism, and it can be seen from fig. 2 that features with feature weights greater than the inflection point feature weights are selected into the primary feature subset.

In a specific implementation, in the step 1, the overall feature subset is ranked by using a ReliefF algorithm in the feature ranking method.

Among them, the ReliefF algorithm is a simple and efficient feature ordering method for handling multiple classes of classification problems. The method solves the problem of feature selection by calculating the correlation between the category and the feature and assigning different weights to the feature. The more important features in the evaluation process, the greater the feature weights. The present embodiment uses the ReliefF algorithm to reorder features. In the sorting process, descending order is carried out according to the importance of the characteristics and the categories.

Step 2: initializing the population size and the maximum iteration number of a particle swarm algorithm, and randomly initializing the initial position and the speed of a particle individual; one feature for each dimension of the particle.

Step 3: and (3) adopting a variable range of a feature ordering method to represent the probability of selecting particles in a particle swarm algorithm.

In the particle swarm algorithm, the dimension of the particles is equal to the number of features, and the search range of each dimension is the same. In other words, the probability that each feature in the dataset is selected is the same. In high-dimensional data, the number of features is large, so that the operation difficulty of the optimal subset searching process is seriously increased if the searching range of each feature is the same. In order to solve the problem, the embodiment provides a variable range characterization strategy based on a feature ordering method, which can effectively adjust the probability of each feature being selected, namely, the probability of good feature selection is high, and the probability of poor feature selection is low.

Specifically, in the step 3, the process of using the variable range of the feature ordering method to characterize the probability of particle selection in the particle swarm algorithm is as follows:

two points, namely an inflection point and a demarcation point, are obtained based on a feature ordering method; the demarcation point is the point with the feature weight less than 0; according to the inflection point and the demarcation point, a three-section characteristic characterization method is obtained, namely when the characteristic weight is larger than the inflection point weight, the searching range is [0,1]; when the characteristic weight is smaller than the demarcation point weight, the searching range is [0, a ]; when located between two points, the search range decreases linearly from [0,1] to [0, a ]; wherein a is a number between 0 and 1, and the purposes of effectively adjusting the probability of each feature being selected, namely, the probability of good feature selection is high and the probability of poor feature selection is low are achieved. For example: a=0.7.

And defining the characteristic search space in the task 1 and the task 2 by adopting variable range characterization so as to achieve the aim of effectively reducing the search space of the particle swarm algorithm.

Step 4: and calculating the fitness value of each particle in the particle swarm according to a pre-constructed feature subset quality evaluation function, and initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2.

Specifically, in the step 4, the pre-constructed feature subset quality evaluation function is:

wherein Fitness is provided _min Representing a feature subset quality evaluation function; gamma ray _R (D) Representing a classification error rate of the feature subset R relative to the target data set D; s represents the number of feature subsets selected; the N represents the total number of features in the dataset; alpha and beta are parameters that adjust the classification error rate and the feature subset count rate.

In the step 4, the process of initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2 is as follows:

setting the current position of the particle as the individual optimal position pbest; and then, considering the ring topology structure of the particle population, comparing the fitness value of each particle with that of two adjacent particles, reserving the position corresponding to the smaller fitness value, and taking the position of the last winning particle as the global optimal position gbest of the current particle population.

Step 5: and updating the inertia weight of the particle swarm algorithm in a linear decreasing mode.

Specifically, in the step 5, the inertia weight w of the particle swarm algorithm is updated in a linear decreasing manner as follows:

wherein iter and max_iter represent the current iteration number and the maximum iteration number, respectively.

Step 6: and updating the speed and the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm.

Specifically, in the step 6, the process of updating the speed of each particle in the particle swarm by using the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight is as follows:

if the random value is greater than the random interaction probability between the preset task 1 and the task 2, updating the particle speed by using a formula (a); otherwise, the particle velocity is updated using equation (b):

wherein,indicating the current moment speed of the ith particle, < >>Indicating the last time speed of the ith particle; pbest (p best) _id Representing the individual optimal position of the ith particle; gbest (g best) _d And gbest' _d The global optimal positions of the current task 1 and the current task 2 are respectively represented; / >Indicating the position of the ith particle; c ₁ ，c ₂ And c ₃ Indicating an acceleration factor, for example set to 1.49445; r is (r) ₁ ，r ₂ And r ₃ Are all in the range of 0,1]A random number within; d represents the dimension of the particle swarm.

C is that ₁ ，c ₂ And c ₃ Other values may also be provided.

The formula for updating the position of each particle in the particle swarm by adopting the preset random interaction probability between the task 1 and the task 2 and the updated inertia weight is as follows:

wherein,indicating the position of the ith particle; />Indicating the speed of the ith particle at the next moment; />Indicating the position of the ith particle at the next moment.

Step 7: and updating the individual optimal positions and the global optimal positions of the task 1 and the task 2, and further updating the optimal complete solution of the search problem.

Particle swarm optimization has the problems of premature convergence and susceptibility to local optima. In order to solve this problem, this embodiment proposes an initial subgroup changing mechanism, which can effectively maintain the diversity of the particle swarm algorithm, and help task 1 jump out of local optimum, and explore more possible feature combinations in the searching process.

To further illustrate the superiority of this embodiment in handling data feature selection problems, fig. 4-9 show the convergence curves of the fitness values obtained by the method and particle swarm algorithm of this embodiment when handling six high-dimensional classification data sets, leukemia, 9Tumor, prostate, brain, 11Tumor, and Lung.

FIG. 4 is a graph showing the convergence of the fitness of the method and particle swarm algorithm of the present embodiment under the Leukemia classification problem; FIG. 5 is a graph showing the convergence of the fitness of the method and particle swarm algorithm of the present embodiment under the 9Tumor classification problem; FIG. 6 is a graph showing the convergence of the fitness of the particle swarm algorithm and the method according to the present embodiment under the problem of Protate classification; FIG. 7 is a graph showing the convergence of the fitness of the particle swarm algorithm and the method according to the present embodiment under the Brain classification problem; FIG. 8 is a graph showing the convergence of the fitness of the particle swarm algorithm and the method of the present embodiment under the 11Tumor classification problem; FIG. 9 is a graph showing the convergence of the fitness of the present embodiment method and particle swarm algorithm under the Lung classification problem; in the example, 6 classical high-dimensional test data sets are given as examples, the execution times are 30 times, the population number is 100, and the performance index value obtained each time is recorded. Table 1 shows the statistics of the best feature subset obtained using all features, particle Swarm Optimization (PSO), and the method of this example, where CR (%) represents the classification accuracy of the feature subset. As can be seen from comparison results, the feature selection method provided by the embodiment can better select the feature subset with higher classification precision, and the capability of removing redundant and irrelevant features is obviously enhanced. As can be seen from analysis of fig. 4-9, the convergence speed of the method of this embodiment is significantly improved, which also reflects the rapidity of the proposed feature selection method.

In summary, the two-stage feature selection method based on the multitasking technique proposed in this embodiment can effectively solve the data feature selection problem that is commonly existed in practice.

TABLE 1

The embodiment provides an inflection point selection mechanism to solve the problem that the primary selection subset is difficult to determine, and the inflection point selection mechanism can adaptively select the primary selection subset for the second stage from the original feature set according to the feature quality, and does not lose important information in the original data.

In order to better use the evolutionary multitasking technology to realize feature subset selection, the embodiment also provides a variable-range particle characterization mechanism and an initial subset change strategy, and the two strategies can well solve the problems of the particle swarm algorithm, namely the problems of too large search space and easy sinking into local optimum.

In the case of multiple tasks, the processing of a single task may help to solve other tasks because related or complementary information exists between the tasks, and since the multiple particle swarm algorithm is an effective implementation of the evolutionary multiple tasks in the evolutionary algorithm, the embodiment adopts the multiple particle swarm algorithm to share the implicit knowledge migration between the tasks through selective mating operation, and finally achieves the effect of mutual promotion.

Example 2

The present embodiment provides a two-stage feature selection system based on evolutionary multitasking, comprising:

Specifically, the classification task construction unit includes a task tagging module for determining an initial feature subset according to an inflection point mechanism, denoted task 1; sorting all feature subsets by using a feature sorting method, and marking all the sorted feature subsets as a task 2;

the optimal feature subset output module is used for initializing a sub-population change mechanism, and if the condition is met, the initial feature subset is changed; otherwise, the initial feature subset is unchanged; on the basis of the initial feature subset, a particle swarm algorithm is executed, whether the evolution of the particle swarm reaches the maximum iteration number set by the particle swarm algorithm is judged, if yes, searching is stopped, and an optimal complete solution is output as an optimal feature subset of the data set; otherwise, re-characterizing the probability of selecting particles in the particle swarm algorithm, and continuously calculating the optimal complete solution.

It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A two-stage feature selection method based on evolutionary multitasking, comprising:

classification task construction stage: acquiring text feature data, forming a corresponding text feature subset by different features, and storing the text feature subset into a text feature database; invoking all feature subsets in a text feature database, and determining an initial text feature subset to be marked as a task 1; marking all the sequenced text feature subsets as task 2; the data in the text feature subset is three-dimensional feature data, namely three-dimensional coordinates, wherein one-dimensional coordinates are categories of word sets, one-dimensional coordinates are the times of the word sets in the computer text, and one-dimensional coordinates are the times of the word sets in the categories;

and (3) an optimal feature screening stage: for the task 1 and the task 2, searching out an optimal text feature subset matched with a preset search condition by utilizing a particle swarm algorithm in an evolutionary multitasking method and outputting the optimal text feature subset;

the process of searching the optimal text feature subset matched with the preset search condition by utilizing a particle swarm algorithm in the evolutionary multitasking method comprises the following steps:

step 1: initializing the population size and the maximum iteration number of a particle swarm algorithm, and randomly initializing the initial position and the speed of a particle individual; each dimension of the particle corresponds to a text feature;

Step 2: the variable range of the feature ordering method is adopted to represent the probability of particle selection in a particle swarm algorithm; limiting text feature search spaces in the task 1 and the task 2 so as to reduce the search space of a particle swarm algorithm;

step 3: calculating the fitness value of each particle in the particle swarm according to a pre-constructed text feature subset quality evaluation function, and initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2;

step 4: updating the inertia weight of the particle swarm algorithm in a linear decreasing mode;

step 5: updating the speed and the position of each particle in the particle swarm by adopting the random interaction probability between the preset task 1 and the task 2 and the updated inertia weight, and further calculating the fitness value of each particle in the updated particle swarm;

step 6: updating the individual optimal positions and the global optimal positions of the task 1 and the task 2, and further updating the optimal complete solution of the search problem;

step 7: an initial sub-population change mechanism, if the condition is met, changing an initial text feature subset; otherwise, the initial text feature subset is unchanged; on the basis of the initial text feature subset, executing a particle swarm algorithm, judging whether the evolution of the particle swarm reaches the maximum iteration number set by the particle swarm algorithm, if so, stopping searching, and outputting an optimal complete solution as an optimal text feature subset of the data set; otherwise, go to step 3.

2. The evolutionary multitasking based two-stage feature selection method of claim 1 in which, in the classification task construction stage, two related text classification tasks are determined, an initial text feature subset is determined from the inflection point mechanism, noted task 1; and (5) sorting all the text feature subsets by using a feature sorting method, and marking the sorted all the text feature subsets as task 2.

3. The evolutionarily multitasking based two-stage feature selection method of claim 2, wherein determining an initial text feature subset from an inflection point mechanism is:

after the text features are arranged in a descending order according to the importance, a curve related to the importance of the text features is obtained, and after the first point and the last point are connected by a straight line, the inflection point is the point farthest from the straight line; such that points above the inflection point, i.e., points where text feature relevance is high, will be selected to form an initial text feature subset;

or (b)

And sequencing all the text feature subsets by utilizing a ReliefF algorithm in the feature sequencing method.

4. The two-stage feature selection method based on evolutionary multitasking according to claim 1, wherein in step 2, the process of using the variable range of feature ordering method to characterize the probability of particle selection in the particle swarm algorithm is as follows:

Two points, namely an inflection point and a demarcation point, are obtained based on a feature ordering method; the demarcation point is the point with the feature weight less than 0; according to the inflection point and the demarcation point, a three-section text feature characterization method is obtained, namely when the feature weight is larger than the inflection point weight, the search range is [0,1]; when the characteristic weight is smaller than the demarcation point weight, the searching range is [0, a ]; when located between two points, the search range decreases linearly from [0,1] to [0, a ]; wherein a is a number between 0 and 1, and the purposes of effectively adjusting the probability of each text feature being selected, namely, the probability of good text feature selection is high and the probability of poor text feature selection is low are achieved.

5. The evolutionary multitasking based two-stage feature selection method of claim 1, characterized in that in said step 3, the pre-constructed text feature subset quality evaluation function is:

wherein Fitness is provided _min Representing a feature subset quality evaluation function; gamma ray _R (D) Representing a classification error rate of the feature subset R relative to the target data set D; s represents the number of feature subsets selected; the N represents the total number of features in the dataset; alpha and beta are parameters that adjust the classification error rate and the feature subset count rate;

Or in the step 3, the process of initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2 is as follows:

6. The two-stage feature selection method based on evolutionary multitasking according to claim 1, wherein in step 4, the inertia weight w of the particle swarm algorithm is updated in a linearly decreasing manner as follows:

7. The two-stage feature selection method based on evolutionary multitasking according to claim 1, wherein in step 5, the process of updating the speed of each particle in the particle swarm by using the random interaction probability between the preset task 1 and the task 2 and the updated inertia weight is as follows:

Wherein,indicating the current moment speed of the ith particle, < >>Indicating the last time speed of the ith particle; pbest (p best) _id Representing the individual optimal position of the ith particle; gbest (g best) _d And gbest' _d The global optimal positions of the current task 1 and the current task 2 are respectively represented; />Indicating the position of the ith particle; c ₁ ，c ₂ And c ₃ Representing an acceleration factor; r is (r) ₁ ，r ₂ And r ₃ Are all in the range of 0,1]A random number within; d represents the dimension of the particle swarm.

8. The two-stage feature selection method based on evolutionary multitasking according to claim 1, wherein in step 5, the formula for updating the position of each particle in the particle swarm by using the random interaction probability between the preset task 1 and the task 2 and the updated inertia weight is:

9. A two-stage feature selection system based on evolutionary multitasking, comprising:

a classification task construction unit for: acquiring text feature data, forming a corresponding text feature subset by different features, and storing the text feature subset into a text feature database; invoking all text feature subsets in the text feature database, and determining an initial text feature subset to be marked as a task 1; marking all the sequenced text feature subsets as task 2; the data in the text feature subset is three-dimensional feature data, namely three-dimensional coordinates, wherein one-dimensional coordinates are categories of word sets, one-dimensional coordinates are the times of the word sets in the computer text, and one-dimensional coordinates are the times of the word sets in the categories;

An optimal feature screening unit for: for the task 1 and the task 2, searching out an optimal text feature subset matched with a preset search condition by utilizing a particle swarm algorithm in an evolutionary multitasking method and outputting the optimal text feature subset;

the optimal feature screening unit comprises: the particle swarm algorithm initializing module is used for initializing the population size and the maximum iteration number of the particle swarm algorithm and randomly initializing the initial position and the speed of the particle individual; each particle corresponds to a text feature;

the particle swarm algorithm characterization module is used for characterizing the probability of particle selection in the particle swarm algorithm by adopting a variable range of a feature ordering method; limiting text feature search spaces in the task 1 and the task 2 so as to reduce the search space of a particle swarm algorithm;

the optimal position calculation module is used for calculating the fitness value of each particle in the particle swarm according to a pre-constructed text feature subset quality evaluation function and initializing the individual optimal positions and the global optimal positions of the task 1 and the task 2;

the optimal feature subset output module is used for an initial sub-population change mechanism, and if the condition is met, the initial text feature subset is changed; otherwise, the initial text feature subset is unchanged; on the basis of the initial text feature subset, executing a particle swarm algorithm, judging whether the evolution of the particle swarm reaches the maximum iteration number set by the particle swarm algorithm, if so, stopping searching, and outputting an optimal complete solution as an optimal text feature subset of the data set; otherwise, updating the individual optimal positions and the global optimal positions of the task 1 and the task 2, and continuously calculating the optimal complete solution.

10. The evolutionarily multitasking based two-stage feature selection system of claim 9, wherein the classification task construction unit comprises a task tagging module for determining an initial text feature subset from an inflection point mechanism, denoted task 1; and (5) sorting all the text feature subsets by using a feature sorting method, and marking the sorted all the text feature subsets as task 2.