CN113761307A - Feature selection method and device - Google Patents

Feature selection method and device Download PDF

Info

Publication number
CN113761307A
CN113761307A CN202011351263.9A CN202011351263A CN113761307A CN 113761307 A CN113761307 A CN 113761307A CN 202011351263 A CN202011351263 A CN 202011351263A CN 113761307 A CN113761307 A CN 113761307A
Authority
CN
China
Prior art keywords
vector
features
feature
redundancy
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011351263.9A
Other languages
Chinese (zh)
Inventor
祖辰
杨立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202011351263.9A priority Critical patent/CN113761307A/en
Publication of CN113761307A publication Critical patent/CN113761307A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a feature selection method and device, and relates to the technical field of computers. One embodiment of the method comprises: determining features in each sample, and constructing a sample vector based on values of all the features in each sample; processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features. The global redundancy minimized sparse feature selection algorithm GRMS provided in this embodiment minimizes the redundancy of global features, selects features with strong discriminability, and implements correction on sparsely selected features.

Description

Feature selection method and device
Technical Field
The invention relates to the technical field of computers, in particular to a feature selection method and device.
Background
With the development of information technology, global data is growing explosively, and more data needs to be stored and spread. To deal with a huge amount of high-dimensional data, it is necessary to extract the most informative and valuable information therefrom. Since high-dimensional data contains a large number of features, the features inevitably contain noise, in this case, feature selection becomes an indispensable data mining technique, and performance such as subsequent classification or clustering can be improved by feature dimension reduction.
The current feature selection methods are classified into three types, a filtering method, a wrapping method and an embedded method. In the process of implementing the invention, the inventor finds that the methods generally do not consider the redundancy among the selected features, so the features have high relevance and are not beneficial to the tasks of subsequent clustering or classification and the like. Although a mutual information-based mRMR (Max-reservance and Min-Redundancy) feature selection method has been proposed to minimize Redundancy between features, the mRMR method uses a greedy strategy to find features with minimum Redundancy, resulting in the selected features not having global Redundancy information minimization.
Disclosure of Invention
In view of this, embodiments of the present invention provide a feature selection method and apparatus, which can at least solve the problem that the redundancy of global features cannot be minimized and features with strong discriminability cannot be selected in the prior art.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a feature selection method including:
determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;
processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with global redundancy minimization includes:
determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and
constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 1;
a second importance score vector is derived after the elements are minimized and constrained.
Optionally, the processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector includes:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
and introducing a redundancy matrix, and inputting the first importance scoring vector into a redundancy information minimization criterion to perform redundancy removal correction on different features to obtain a second importance vector.
Optionally, the method further includes:
normalizing values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.
Optionally, the normalizing the values of a feature in different samples to construct a feature vector of the feature includes:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralization matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a feature selection apparatus including:
the characteristic determining module is used for determining the characteristics in each sample and constructing a sample vector based on the values of all the characteristics in each sample; wherein, the sample is at least one of image, text and voice data;
the vector processing module is used for processing the sample vector by utilizing a global redundancy minimized sparse feature selection algorithm to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
and the feature selection module is used for extracting a predetermined number of features with the maximum importance scores from the importance scores of the features to serve as the selected target features.
Optionally, the vector processing module is configured to:
determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and
constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 6;
a second importance score vector is derived after the elements are minimized and constrained.
Optionally, the vector processing module is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
and introducing a redundancy matrix, and inputting the first importance scoring vector into a redundancy information minimization criterion to perform redundancy removal correction on different features to obtain a second importance vector.
Optionally, the system further includes a redundancy matrix constructing module, configured to:
normalizing values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.
Optionally, the redundancy matrix constructing module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralization matrix to obtain a centralized feature vector corresponding to the feature.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a feature selection electronic device.
The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any of the feature selection methods described above.
To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing any of the feature selection methods described above.
According to the scheme provided by the invention, one embodiment of the invention has the following advantages or beneficial effects: and describing which non-redundant features are described based on the redundant matrix, using norm regularization for data feature selection, and minimizing and unifying feature selection and global redundant information into a convex objective optimization function so as to effectively realize feature selection.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIGS. 1(a) - (c) are schematic diagrams of three types of prior methods for selecting features, namely, wrapped, filtered and embedded;
FIG. 2 is a schematic diagram of an mRMR feature selection;
FIG. 3 is a schematic flow chart diagram of a feature selection method according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a sparse-based global redundancy information minimization method;
FIG. 5 is a schematic diagram of the main blocks of a feature selection apparatus according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 7 is a schematic block diagram of a computer system suitable for use with a mobile device or server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Feature selection has been widely used in many different fields, such as selection of disease causing genes in medical research and image segmentation in computer vision. The use of a small number of features with very rich representation capability can not only accelerate the learning process of the model, but also enhance the generalization capability of the model on unknown data.
The current feature selection method mainly adopts three modes, and the advantages and disadvantages of the method are analyzed as follows:
1) the filtering feature selection method (as shown in fig. 1 (a)) selects important features according to the ranking by ranking the features by means of the inherent properties of data. Notably, the filtering approach is classifier independent, so it is computationally less expensive, but generally less performing.
2) The wrapped feature selection method (as shown in fig. 1 (b)) is associated with a specific classifier, and therefore the method can generally achieve better performance, but is not suitable for large-scale data due to high computational cost.
3) The learning process of the embedded feature selection method (as shown in fig. 1 (c)) is already embedded into the classification model training, and the features to be retained (i.e., the selected features) are finally obtained by solving the optimization problem of the classification algorithm.
Therefore, compared with the filtering type feature selection method and the wrapping type feature selection method, the embedded feature selection method can simultaneously obtain lower calculation cost and satisfactory feature selection performance.
Referring to fig. 2, a schematic diagram of rmr feature selection is shown. There are four different features on the left, with feature importance scores of: 0.95, 0.91, 0.72 and 0.33. Feature 1 has the highest importance score and has relevance to feature 2 and feature 3. Feature 2 and feature 3 may represent data from different angles, respectively, so feature 2 and feature 3 are not related to each other. When the mRMR method selects the above 4 features, since the features 2 and 3 have correlation with the feature 1 at the same time, the features 1 and 4 are finally selected and the features 2 and 3 are ignored. In fact, the combination of features 2 and 3 has a stronger discriminating power than features 1 and 4. The mRMR method cannot minimize the redundancy of global features and cannot select features with strong discriminability (such as feature 2 and feature 3).
Referring to fig. 3, a main flowchart of a feature selection method provided by an embodiment of the present invention is shown, including the following steps:
s301: determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;
s302: processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
s303: and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.
In the above embodiment, for steps S301 to S303, since the features in the low-dimensional data are fewer, the present solution is only applied to feature selection of high-dimensional data (such as voice/image/text), and the high-dimensional data generally refers to a case where the number of features is much larger than the number of samples.
Redundancy means that two or more features provide duplicate information, and thus redundant information minimization can be defined: if both features i and j are related to each other, then one is preferably retained and the other is discarded during feature selection, thereby minimizing the redundant information of the final selected feature.
Generally, by reducing redundant features and using more informative features, better performance results can be achieved in subsequent tasks (e.g., clustering, classification). For example, in order to distinguish a specific person in a face recognition task, it is desirable to select some non-redundant features having discriminant and representativeness, such as eyes, nose, mouth, and the like, rather than the non-representative redundant features, such as hair and the like. The motivation of the feature selection method for minimizing global redundancy proposed by the scheme is to select representative and non-redundant features in the feature set of the sample.
Given n samples { x1,x2,...,xnUsing a sample matrix
Figure BDA0002800015420000071
Representing a sample vector, where the k-th row represents the k-th sample
Figure BDA0002800015420000072
d is the number of features. Taking a face picture as an example, the features of the face picture can be wavelet features, histogram features, gray features and the like extracted according to the face picture.
The sparse feature Selection algorithm (GRMS) with minimized Global Redundancy provided by the scheme is as follows:
Figure BDA0002800015420000073
Figure BDA0002800015420000074
and calculating to obtain w and s, wherein the scheme mainly uses the s selection characteristic. Wherein the content of the first and second substances,
Figure BDA0002800015420000075
the first importance score vector is a regression coefficient matrix, corresponds to the first importance score vector without considering the characteristic redundancy, and is also the characteristic importance weight generated by the sparse characteristic selection algorithm; s is a second importance score vector that takes redundancy into account;
Figure BDA0002800015420000076
representing the full 1 vector of size d, with γ, β, and λ being regularization parameters used to balance the weights of the terms in the formula. It is worth noting that to avoid the optimization formula falling into trivial solution, the elements in s, are introduced and 1 constrainedT1 is 1 (the elements are not 0 and the sum is 1).
The first item in the GRMS regresses the training sample to the label (corresponding to y, belonging to the category of the sample) through a regression coefficient, and measures a regression value and a true value through a square loss function; the second term is a norm of the regression coefficient w, and a sparse effect can be generated, that is, a plurality of rows of elements in w obtained by final solution are all 0, and non-0 rows correspond to the retained features, so that a feature selection function is realized. The global feature redundancy information in the last two instant drawing feature selections is minimized.
In the above formula, A is a redundancy matrix, and the redundancy matrix is introduced
Figure BDA0002800015420000081
And describing the redundancy of every two characteristics. Given sample vector
Figure BDA0002800015420000082
The ith and jth featuresAre respectively X(i)、X(j)(i, j ═ 1, 2.., d). The following centered features can be obtained accordingly:
Figure BDA0002800015420000083
wherein the content of the first and second substances,
Figure BDA0002800015420000084
in order to centralize the matrix, the matrix is,
Figure BDA0002800015420000085
and
Figure BDA0002800015420000086
representing the feature vector after the ith feature and the jth feature are centralized. In actual operation, the centralization matrix is not considered, and the value of one feature in different samples is directly normalized.
The redundancy matrix a may be calculated as follows:
Figure BDA0002800015420000087
it can be seen that the redundancy matrix
Figure BDA0002800015420000088
Wherein
Figure BDA0002800015420000089
Being the hadamard product, the above formula can therefore be written in matrix form as follows: b ═ DFTFD=(FD)TFD, wherein F ═ F1,f2,...,fd]D is a diagonal matrix with diagonal elements of 1/| | f i1,2, …, d. Meanwhile, the matrix B can be determined to be a semi-positive definite matrix.
It is assumed that the ith and jth features are highly correlated, and the absolute value of the correlation coefficient is relatively large whether the correlation is positive or negative. To penalize high correlations, a squared cosine similarity is used to measure the correlation between two features, since B is a semi-positive definite matrix, it is clear that the redundancy matrix a is non-negative and also semi-positive.
By the sparse feature selection algorithm, the global feature redundancy can be minimized, and meanwhile, the weight consistency of the features before and after correction is ensured. In particular, if feature i is correlated with all other features, the first term of the sparse feature selection algorithm will result in a revised weight score s for feature iiBecomes smaller and thus reduces the importance of the feature i. On the other hand, if feature i is not correlated with all other features, the revised weight score s for feature iiWill be scored with the initial feature weights wiThe consistency is maintained, i.e. the weight scores of the features i remain unchanged.
Taking the face recognition application as an example, it is assumed that the features extracted from the original face data include gray level features, histogram features, fourier features, gradient features, and the like, the matrix X is composed of the above features of different pictures, and y is a category vector corresponding to different people. And (3) calculating a redundancy matrix A according to the data matrix X, then inputting the matrix X, the category vector y and the redundancy matrix A into a GRMS formula to finally obtain a correction weight scoring vector s of different characteristics in 4, wherein the larger the numerical value in s is, the higher the corresponding characteristic weight of the element is.
Referring to the redundancy matrix shown in fig. 4, the darker the color reflects the stronger the correlation (i.e., redundancy) between features. It is worth noting that feature selection and feature redundancy minimization are solved for optimization simultaneously.
The method can also be considered from another aspect, for example, firstly, on the basis of the sample vector, calculating a first importance score of each feature in the feature set when the redundancy is not considered through an evaluation criterion, such as an extreme tree algorithm and a random forest, and constructing a first importance score vector
Figure BDA0002800015420000091
Then, the first importance scoring vector is input into a redundant information minimization criterion to carry out redundancy removal correction on different features to obtain a second importance vector s, wherein the second importance isThe sexual vector includes a second importance score for each feature. The criterion for minimizing redundant information is:
Figure BDA0002800015420000092
Figure BDA0002800015420000093
where λ is a regularization parameter used to balance the weights of the first term and the second term, and the redundancy matrix characterizes the correlation, i.e., the redundancy, between two features.
In particular, the first term sTAs characterizes the redundancy of global features, so the smaller the first term should be, the better; second term wTs characterizes the consistency of the original feature first importance score w (redundancy not considered) and the modified feature second importance score s (redundancy considered), the second term should be as large as possible since it is desirable that the modified features are as consistent as possible with the feature importance before modification.
In the method provided by the above embodiment, the proposed global redundancy minimized sparse feature selection algorithm GRMS ranks the feature importance after calculating the second importance score vector, and simultaneously delineates which non-redundant features based on the redundant matrix to correct the sparsely selected features.
The method provided by the scheme aims at the problem that the existing method cannot select the characteristic subset with the global redundant information minimization, and provides a sparse characteristic selection algorithm with the global redundant minimization, so that the characteristic is subjected to redundant correction based on a redundant matrix, and the characteristic selection and the characteristic redundant minimization are simultaneously subjected to optimization solution.
Referring to fig. 5, a schematic diagram of main modules of a feature selection apparatus 500 according to an embodiment of the present invention is shown, including:
a feature determining module 501, configured to determine features in each sample, and construct a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;
a vector processing module 502, configured to process the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
a feature selection module 503, configured to extract a predetermined number of features with the largest importance scores from the importance scores of the features, so as to serve as the selected target features.
In the device for implementing the present invention, the vector processing module 502 is configured to:
determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and
constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 6;
a second importance score vector is derived after the elements are minimized and constrained.
In the device for implementing the present invention, the vector processing module 502 is configured to:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
and introducing a redundancy matrix, and inputting the first importance scoring vector into a redundancy information minimization criterion to perform redundancy removal correction on different features to obtain a second importance vector.
The implementation device of the invention also comprises a redundancy matrix construction module used for:
normalizing values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.
In the implementation apparatus of the present invention, the redundancy matrix constructing module is configured to:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralization matrix to obtain a centralized feature vector corresponding to the feature.
In addition, the detailed implementation of the device in the embodiment of the present invention has been described in detail in the above method, so that the repeated description is not repeated here.
FIG. 6 illustrates an exemplary system architecture 600 to which embodiments of the invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605 (by way of example only). The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. Various communication client applications can be installed on the terminal devices 601, 602, 603.
The terminal devices 601, 602, 603 may be various electronic devices having display screens and supporting web browsing, and the server 605 may be a server providing various services.
It should be noted that the method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a feature determination module, a vector processing module, and a feature selection module. The names of these modules do not in some cases form a limitation on the module itself, and for example, a vector processing module may also be described as a "module that processes a sample vector to obtain an importance score vector".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;
processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.
According to the technical scheme of the embodiment of the invention, a sparse feature selection algorithm based on global redundancy minimization is provided, redundancy correction is carried out on features based on a redundancy matrix, and feature selection and feature redundancy minimization are simultaneously optimized and solved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of feature selection, comprising:
determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;
processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.
2. The method of claim 1, wherein the processing the sample vector using a sparse feature selection algorithm with global redundancy minimization comprises:
determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and
constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 1;
a second importance score vector is derived after the elements are minimized and constrained.
3. The method of claim 1, wherein the processing the sample vector using a sparse feature selection algorithm with minimized global redundancy to obtain a significance score vector comprises:
processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;
and introducing a redundancy matrix, and inputting the first importance scoring vector into a redundancy information minimization criterion to perform redundancy removal correction on different features to obtain a second importance vector.
4. The method of claim 2 or 3, further comprising:
normalizing values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.
5. The method of claim 4, wherein normalizing values of a feature in different samples to construct a feature vector of the feature comprises:
constructing a data matrix according to the values of a feature in different samples;
and processing the data matrix by using a centralization matrix to obtain a centralized feature vector corresponding to the feature.
6. A feature selection apparatus, comprising:
the characteristic determining module is used for determining the characteristics in each sample and constructing a sample vector based on the values of all the characteristics in each sample; wherein, the sample is at least one of image, text and voice data;
the vector processing module is used for processing the sample vector by utilizing a global redundancy minimized sparse feature selection algorithm to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;
and the feature selection module is used for extracting a predetermined number of features with the maximum importance scores from the importance scores of the features to serve as the selected target features.
7. The apparatus of claim 6, wherein the vector processing module is configured to:
determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and
constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 6;
a second importance score vector is derived after the elements are minimized and constrained.
8. The apparatus of claim 7, further comprising a redundancy matrix construction module to:
normalizing values of a feature in different samples to construct a feature vector of the feature;
and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN202011351263.9A 2020-11-25 2020-11-25 Feature selection method and device Pending CN113761307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011351263.9A CN113761307A (en) 2020-11-25 2020-11-25 Feature selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011351263.9A CN113761307A (en) 2020-11-25 2020-11-25 Feature selection method and device

Publications (1)

Publication Number Publication Date
CN113761307A true CN113761307A (en) 2021-12-07

Family

ID=78786083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011351263.9A Pending CN113761307A (en) 2020-11-25 2020-11-25 Feature selection method and device

Country Status (1)

Country Link
CN (1) CN113761307A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2213523A1 (en) * 1995-03-13 1996-10-31 Metra Biosystems, Inc. Apparatus and method for acoustic analysis of bone using optimized functions of spectral and temporal signal components
US20140207711A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Transductive feature selection with maximum-relevancy and minimum-redundancy criteria
US20140207799A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Hill-climbing feature selection with max-relevancy and minimum redundancy criteria
CN104933733A (en) * 2015-06-12 2015-09-23 西北工业大学 Target tracking method based on sparse feature selection
CN111338950A (en) * 2020-02-25 2020-06-26 北京高质系统科技有限公司 Software defect feature selection method based on spectral clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2213523A1 (en) * 1995-03-13 1996-10-31 Metra Biosystems, Inc. Apparatus and method for acoustic analysis of bone using optimized functions of spectral and temporal signal components
US20140207711A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Transductive feature selection with maximum-relevancy and minimum-redundancy criteria
US20140207799A1 (en) * 2013-01-21 2014-07-24 International Business Machines Corporation Hill-climbing feature selection with max-relevancy and minimum redundancy criteria
CN104933733A (en) * 2015-06-12 2015-09-23 西北工业大学 Target tracking method based on sparse feature selection
CN111338950A (en) * 2020-02-25 2020-06-26 北京高质系统科技有限公司 Software defect feature selection method based on spectral clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FEIPING NIE 等: "A General Framework for Auto-Weighted Feature Selection via Global Redundancy Minimization", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 28, no. 5, 31 May 2019 (2019-05-31), pages 2429, XP011709576, DOI: 10.1109/TIP.2018.2886761 *
王;卫金茂;张璐;: "基于保留分类信息的多任务特征学习算法", 计算机研究与发展, no. 03, 15 March 2017 (2017-03-15) *

Similar Documents

Publication Publication Date Title
US11804069B2 (en) Image clustering method and apparatus, and storage medium
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN107145485B (en) Method and apparatus for compressing topic models
WO2022105117A1 (en) Method and device for image quality assessment, computer device, and storage medium
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
WO2023138188A1 (en) Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device
WO2021135449A1 (en) Deep reinforcement learning-based data classification method, apparatus, device, and medium
CN111898703B (en) Multi-label video classification method, model training method, device and medium
CN112863683A (en) Medical record quality control method and device based on artificial intelligence, computer equipment and storage medium
CN114329029B (en) Object retrieval method, device, equipment and computer storage medium
CN112188306A (en) Label generation method, device, equipment and storage medium
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN114298997B (en) Fake picture detection method, fake picture detection device and storage medium
CN114817612A (en) Method and related device for calculating multi-modal data matching degree and training calculation model
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN114329004A (en) Digital fingerprint generation method, digital fingerprint generation device, data push method, data push device and storage medium
WO2021012691A1 (en) Method and device for image retrieval
CN113569018A (en) Question and answer pair mining method and device
CN112784189A (en) Method and device for identifying page image
CN110209895A (en) Vector index method, apparatus and equipment
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN115392361A (en) Intelligent sorting method and device, computer equipment and storage medium
CN113761307A (en) Feature selection method and device
CN113821687A (en) Content retrieval method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination