CN117909708A - Online feature selection method and device, electronic equipment and storage medium - Google Patents

Online feature selection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117909708A
CN117909708A CN202311754585.1A CN202311754585A CN117909708A CN 117909708 A CN117909708 A CN 117909708A CN 202311754585 A CN202311754585 A CN 202311754585A CN 117909708 A CN117909708 A CN 117909708A
Authority
CN
China
Prior art keywords
feature
stream data
features
target
online
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311754585.1A
Other languages
Chinese (zh)
Inventor
游琳敬
黄夏渊
聂祥丽
张波
乔红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Academy of Mathematics and Systems Science of CAS
Original Assignee
Institute of Automation of Chinese Academy of Science
Academy of Mathematics and Systems Science of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, Academy of Mathematics and Systems Science of CAS filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202311754585.1A priority Critical patent/CN117909708A/en
Publication of CN117909708A publication Critical patent/CN117909708A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides an online feature selection method, an online feature selection device, electronic equipment and a storage medium, which are applied to the technical field of machine learning. The method comprises the following steps: acquiring target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.

Description

Online feature selection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to an online feature selection method, an online feature selection device, an electronic device, and a storage medium.
Background
In the field of machine learning and data analysis, feature selection is a key technology for selecting features with the most representativeness and relevance from original data so as to reduce the dimension of the original data and improve the performance and efficiency of a model.
In the prior art, feature selection is generally performed after all data has been collected, and then offline processing and model training are performed based on the selected features, i.e., the conventional feature selection method is mainly based on offline data set analysis.
However, with the rise of data stream applications, offline analysis methods have failed to address the feature selection problem of large-scale, high-speed online data streams.
Disclosure of Invention
The invention provides an online feature selection method, an online feature selection device, electronic equipment and a storage medium, which can solve the problem of feature selection of online data streams.
The invention provides an online feature selection method, which comprises the following steps: acquiring target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
According to the present invention, before the target stream data is acquired, the method further includes: acquiring initial stream data; determining a statistical index vector of each feature in the initial stream data; all features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
According to the online feature selection method provided by the invention, the feature class of the target stream data comprises the existing features and the newly added features; the determining the statistical index vector of each feature in the target stream data comprises the following steps: determining a first data block from the target stream data, the first data block being characterized by the existing characteristics; and determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
According to the present invention, there is provided an online feature selection method, wherein the adding, according to the statistical indicator vector, features satisfying a first condition in the target stream data to an alternative feature set includes: calculating the relevance of each feature in the target stream data according to the statistical index vector; determining the minimum correlation in the target feature set as the correlation threshold; judging whether the candidate feature set comprises the features or not under the condition that the correlation degree of the features is larger than the correlation degree threshold value; if not, adding the feature to the alternative feature set.
According to the present invention, there is provided an online feature selection method, wherein when the number of features in the candidate feature set satisfies a second condition, calculating, based on a statistical index vector, a feature index of each feature in the candidate feature set includes: determining a first threshold according to the number of target feature sets and the coefficients of the alternative feature sets to be selected; calculating a similarity matrix of the features in the alternative feature set under the condition that the number of the features in the alternative feature set is larger than the first threshold value; and calculating a feature index of each feature in the alternative feature set according to the similarity matrix and the correlation degree.
The invention also provides an online feature selection device, which comprises: the device comprises an acquisition module and a processing module; the acquisition module is used for acquiring the target stream data; the processing module is used for determining a statistical index vector of each feature in the target stream data and adding the feature meeting the first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
The invention provides an online characteristic selection device, which is characterized in that an acquisition module is used for acquiring initial stream data; the processing module is used for determining a statistical index vector of each feature in the initial stream data; all features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
According to the invention, the characteristic category of the target stream data comprises the existing characteristic and the newly added characteristic; the processing module is used for determining a first data block from the target stream data, wherein the first data block is characterized by the existing characteristics; and determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
According to the online feature selection device provided by the invention, the processing module is used for calculating the relevance of each feature in the target stream data according to the statistical index vector; determining the minimum correlation in the target feature set as the correlation threshold; judging whether the candidate feature set comprises the features or not under the condition that the correlation degree of the features is larger than the correlation degree threshold value; if not, adding the feature to the alternative feature set.
The invention provides an online feature selection device, which is characterized in that a processing module is used for determining a first threshold according to the number of target feature sets and the coefficients of alternative feature sets which are required to be selected; calculating a similarity matrix of the features in the alternative feature set under the condition that the number of the features in the alternative feature set is larger than the first threshold value; and calculating a feature index of each feature in the alternative feature set according to the similarity matrix and the correlation degree.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the online feature selection methods described above when the program is executed.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the online feature selection method as described in any of the above.
The online characteristic selection method, the online characteristic selection device, the electronic equipment and the storage medium provided by the invention can acquire target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label. According to the scheme, the features meeting the first condition in the target stream data can be added to the alternative feature set according to the statistical index vector, and the preset number of features are selected from the alternative feature set and added to the target feature set based on the feature index, so that the problem of data dimension reduction selection can be solved; because the first condition includes that the correlation degree of the feature is larger than the correlation degree threshold value, the feature does not exist in the alternative feature set, the larger the feature index is, the higher the correlation degree of the feature is, and the lower the redundancy degree is, the correlation degree of the feature in the target feature set can be dynamically improved, the redundancy degree of the feature in the target feature set is reduced, and therefore efficient online feature selection is achieved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an online feature selection method provided by the present invention;
FIG. 2 is a schematic diagram of the composition of streaming data provided by the present invention;
FIG. 3 is a schematic diagram of an online feature selection device according to the present invention;
Fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present invention is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
In order to clearly describe the technical solution of the embodiment of the present invention, in the embodiment of the present invention, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and effect, and those skilled in the art will understand that the words "first", "second", etc. are not limited in number and execution order.
Embodiments of the invention some exemplary embodiments have been described for illustrative purposes, it being understood that the invention may be practiced otherwise than as specifically shown in the accompanying drawings.
The foregoing implementations are described in detail below with reference to specific embodiments and accompanying drawings.
In order to solve the feature selection problem of online data streams, the concept of online feature selection is introduced. The online feature selection method can gradually process the data stream and update the feature selection result according to the dynamic change of the data so as to adapt to the change of the data stream.
Traditional online feature selection methods can be classified into three categories based on the dynamic form of the data stream: sample stream oriented methods, feature stream oriented methods, and trapezoid stream oriented methods. The sample-stream-oriented feature selection method is specially used for processing the situation that samples of data arrive continuously and feature dimensions of the data are unchanged. This approach processes each sample step by step and dynamically updates the model for feature selection. The number of features in the feature stream changes over time and new features are dynamically added. The trapezoidal flow is an extension of the sample flow, describing the situation where the characteristic dimension of the new sample may increase as the sample size increases, but the old sample characteristic dimension remains unchanged.
In the related art, there is also a case called a rectangular stream in which feature dimensions are increasing and new samples come continuously, resulting in a change in the degree of correlation or redundancy of old features. Unlike the trapezoidal stream, the feature dimension of the old samples of the rectangular stream also increases. In this case, the conventional online feature selection method cannot effectively cope with. To solve this problem, the present invention proposes an online feature selection method. The method can simultaneously consider the sample and the characteristic change, and effectively select the characteristic with the most representation and the correlation degree.
As shown in fig. 1, an embodiment of the present invention provides an online feature selection method, which may be applied to an online feature selection apparatus. The online feature selection method may include S101-S104:
s101, acquiring target stream data by an online characteristic selection device.
Alternatively, the target stream data may be one of rectangular stream data, sample stream data, feature stream data, and trapezoidal stream data.
Alternatively, the online feature selection means may acquire the initial stream data before acquiring the target stream data; determining a statistical index vector of each feature in the initial stream data; all features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
Specifically, the online feature selection device may acquire the initial stream data X 0, and determine that the statistical indicator vector of the ith feature f i acquired at the time t in the initial stream data is: wherein/> Is the mean value of the characteristic f i -Is the variance of feature f i,/>Data number of characteristic f i,/>Is the mean of all tagged data in feature f i,/>For the variance of all tagged data in feature f i,/>Is the number of data of all tagged data in feature f i,/>Is the mean value of the j-th class data in the characteristic f i,/>Is the variance of the j-th class data in feature f i,/>Is the number of data pieces of the j-th category data in the feature f i. The online feature selection device may then add the initial stream data X 0 to the alternative feature set F s and the target feature set F b, i.e., F s=X0,Fb=Fs, respectively, thereby enabling construction of the alternative feature set F s and the target feature set F b.
Alternatively, after acquiring the initial stream data, the online feature selection apparatus may acquire target stream data that is new stream data, that is, the online feature selection apparatus may acquire dynamic stream data.
S102, an online feature selection device determines a statistical index vector of each feature in the target stream data, and adds the features meeting the first condition in the target stream data to an alternative feature set according to the statistical index vector.
Wherein the first condition includes that the correlation of the feature is greater than a correlation threshold, and that the feature is not present in the candidate feature set.
Alternatively, the feature class of the target stream data may include existing features and newly added features; the existing feature refers to a feature already existing in the last-arrived data, and the newly added feature refers to a feature newly added in the target stream data compared to the last-arrived data.
For example, as shown in fig. 2, taking the last data of the target stream data as the initial stream data X 0, if the initial stream data X 0 is the old sample and old feature, the target stream data may be divided into three data blocks, namely data block X 1 and data block based on the two dimensions of the feature and the sample and oldData block/>Wherein, data block X 1 represents the old and new characteristics of the new sample, data blockRepresenting new features of old samples, data block/>Representing new features of the new sample. The feature class of data block X 1 is the existing feature, data block/>And data block/>Is the newly added feature.
It should be noted that the above features may be understood as characterizing data of different dimensions, for example, for an image, the features may include features of a color dimension and may also include features of a size dimension.
It should be noted that, since the newly arrived stream data can be divided based on the feature new dimension and the feature old dimension and the sample new dimension, the limitation that the traditional online feature selection method can only process single type data stream can be overcome, thereby realizing more efficient and more accurate data dimension reduction under the rectangular stream dynamic data environment.
Alternatively, the online feature selection device may determine a first data block from the target stream data, where the first data block is characterized by the existing feature; and determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
Specifically, the online feature selection device may divide the target stream data into a first data block X 1 and a second data block X 2 according to the above feature categories, the second data block X 2 being obtained by dividing the data blocksAnd data block/>And splicing, wherein the characteristics of the first data block are existing characteristics, and the characteristics of the second data block are newly added characteristics. Thereafter, for the first data block the characteristics/>The online feature selection means may be enabled by the first data block/>Updating statistical index vector/>, of existing featuresThereby obtaining the existing feature/>Statistical index/>Where X ij represents the ith row and jth column element of the first data block X 1. Feature/>, for the second data block X 2 The online feature selection means may directly add features to each new feature via the second data block X 2 Calculated/>Wherein d 0 is the old feature dimension at time t-1. Thereby, the statistical index vector/>, of all the characteristics at the current moment is obtained
Based on the scheme, the statistical index vector of each feature in the target stream data can be determined, and the dynamic update of the features can be realized with less calculation effort due to the fact that the statistical index has the characteristics of easy calculation and update.
Optionally, after determining the statistical index vector of each feature in the target stream data, the online feature selection device may calculate the relevance of each feature in the target stream data according to the statistical index vector; determining the minimum correlation in the target feature set as the correlation threshold; judging whether the candidate feature set comprises the features or not under the condition that the correlation degree of the features is larger than the correlation degree threshold value; if not, adding the feature to the alternative feature set.
Specifically, the online feature selection device may calculate the correlation of the features according to the statistical index vectorWherein/>Calculating to obtain the relevance of the ith feature at the moment t, and then, collecting the smallest relevance/>, in the target feature setDetermining as a correlation threshold and traversing the correlation of all features if and only if the correlation of feature i is greater than the correlation threshold/>And does not belong to the alternative feature subset/>When feature i is added to the alternative feature subset.
S103, the online feature selection device calculates feature indexes of each feature in the alternative feature set based on the statistical index vector under the condition that the feature quantity of the alternative feature set meets a second condition.
The larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
Optionally, the online feature selection device may determine a first threshold according to the number of target feature sets and the candidate feature set coefficients that need to be selected; calculating a similarity matrix of the features in the alternative feature set under the condition that the number of the features in the alternative feature set is larger than the first threshold value; and calculating a feature index of each feature in the alternative feature set according to the similarity matrix and the correlation degree.
Specifically, the online feature selection device may determine a first threshold value (1+α) k according to the number k of target feature sets and the coefficient α of the candidate feature sets to be selected, where α > 0, and reconstruct a feature map of the candidate feature sets when the feature number |f b | > k of the candidate feature sets is equal to (1+α), that is, calculate a similarity matrix of features in the candidate feature setsWherein,The degree of similarity between times f i and f j is shown at time t. Then, the online feature selection device may calculate a feature index E (i)=λC(i)+(1-λ)R(i) of each feature in the candidate feature set according to the similarity matrix and the correlation, where λ is a coordination parameter of the correlation and the redundancy, and/ >Representing the redundancy of the feature i, β is the power of S (i,j).
Alternatively, the online feature selection device may simplify the calculation S β=(PΛP-1)β=PΛβP-1 using the similar diagonalization properties of the real symmetric matrix, where Λ is a diagonal matrix made up of feature values ζ of S and P is a reversible matrix.
S104, the online feature selection device selects a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adds the features to the target feature set.
Alternatively, the online feature selection device may select k features with the largest feature indexes in the candidate feature set to form the target feature set F s.
In the embodiment of the invention, the features meeting the first condition in the target stream data can be added to the alternative feature set according to the statistical index vector, and then the preset number of features are selected from the alternative feature set and added to the target feature set based on the feature index, so that the problem of data dimension reduction selection can be solved; because the first condition includes that the correlation degree of the feature is larger than the correlation degree threshold value, the feature does not exist in the alternative feature set, the larger the feature index is, the higher the correlation degree of the feature is, and the lower the redundancy degree is, the correlation degree of the feature in the target feature set can be dynamically improved, the redundancy degree of the feature in the target feature set is reduced, and therefore efficient online feature selection is achieved.
The foregoing description of the solution provided by the embodiments of the present invention has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
According to the online feature selection method provided by the embodiment of the invention, the execution main body can be an online feature selection device or a control module for online feature selection in the online feature selection device. In the embodiment of the invention, an online feature selection device executes an online feature selection method as an example, and the online feature selection device provided by the embodiment of the invention is described.
It should be noted that, in the embodiment of the present invention, the online feature selection device may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present invention is schematic, which is merely a logic function division, and other division manners may be implemented in practice.
As shown in fig. 3, an embodiment of the present invention provides an online feature selection apparatus 300. The online feature selection apparatus 300 includes: an acquisition module 301 and a processing module 302. The acquiring module 301 may be configured to acquire target flow data; the processing module 302 is configured to determine a statistical indicator vector of each feature in the target stream data, and add, to an alternative feature set, a feature in the target stream data that satisfies a first condition according to the statistical indicator vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
Optionally, the acquiring module 301 is configured to acquire initial stream data; the processing module 302 is configured to determine a statistical indicator vector of each feature in the initial stream data; all features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
Optionally, the feature class of the target stream data includes existing features and newly added features; the processing module 302 is configured to determine a first data block from the target stream data, where the first data block is characterized by the existing feature; and determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
Optionally, the processing module 302 is configured to calculate a correlation degree of each feature in the target stream data according to the statistical indicator vector; determining the minimum correlation in the target feature set as the correlation threshold; judging whether the candidate feature set comprises the features or not under the condition that the correlation degree of the features is larger than the correlation degree threshold value; if not, adding the feature to the alternative feature set.
Optionally, the processing module 302 is configured to determine a first threshold according to the number of target feature sets and the candidate feature set coefficients that need to be selected; calculating a similarity matrix of the features in the alternative feature set under the condition that the number of the features in the alternative feature set is larger than the first threshold value; and calculating a feature index of each feature in the alternative feature set according to the similarity matrix and the correlation degree.
In the embodiment of the invention, the features meeting the first condition in the target stream data can be added to the alternative feature set according to the statistical index vector, and then the preset number of features are selected from the alternative feature set and added to the target feature set based on the feature index, so that the problem of data dimension reduction selection can be solved; because the first condition includes that the correlation degree of the feature is larger than the correlation degree threshold value, the feature does not exist in the alternative feature set, the larger the feature index is, the higher the correlation degree of the feature is, and the lower the redundancy degree is, the correlation degree of the feature in the target feature set can be dynamically improved, the redundancy degree of the feature in the target feature set is reduced, and therefore efficient online feature selection is achieved.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430, and communication bus 440, wherein processor 410, communication interface 420, and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform an online feature selection method comprising: acquiring target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of online feature selection provided by the methods described above, the method comprising: acquiring target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided online feature selection methods, the method comprising: acquiring target stream data; determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set; wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An online feature selection method, comprising:
acquiring target stream data;
Determining a statistical index vector of each feature in the target stream data, and adding the features meeting a first condition in the target stream data to an alternative feature set according to the statistical index vector;
calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition;
Selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set;
Wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
2. The online feature selection method of claim 1, wherein prior to the obtaining the target stream data, the method further comprises:
acquiring initial stream data;
Determining a statistical index vector of each feature in the initial stream data;
All features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
3. The online feature selection method according to claim 1 or 2, wherein the feature class of the target stream data includes an existing feature and a newly added feature;
The determining the statistical index vector of each feature in the target stream data comprises the following steps:
Determining a first data block from the target stream data, the first data block being characterized by the existing characteristics;
And determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
4. The online feature selection method according to claim 1, wherein the adding features satisfying a first condition in the target stream data to an alternative feature set according to the statistical index vector includes:
calculating the relevance of each feature in the target stream data according to the statistical index vector;
Determining the minimum correlation in the target feature set as the correlation threshold;
judging whether the candidate feature set comprises the features or not under the condition that the correlation degree of the features is larger than the correlation degree threshold value;
If not, adding the feature to the alternative feature set.
5. The online feature selection method according to claim 4, wherein the calculating the feature index of each feature in the candidate feature set based on the statistical index vector in the case where the feature number of the candidate feature set satisfies a second condition includes:
determining a first threshold according to the number of target feature sets and the coefficients of the alternative feature sets to be selected;
Calculating a similarity matrix of the features in the alternative feature set under the condition that the number of the features in the alternative feature set is larger than the first threshold value;
And calculating a feature index of each feature in the alternative feature set according to the similarity matrix and the correlation degree.
6. An online feature selection apparatus, comprising: the device comprises an acquisition module and a processing module;
the acquisition module is used for acquiring the target stream data;
The processing module is used for determining a statistical index vector of each feature in the target stream data and adding the feature meeting the first condition in the target stream data to an alternative feature set according to the statistical index vector; calculating a feature index of each feature in the candidate feature set based on the statistical index vector under the condition that the feature quantity of the candidate feature set meets a second condition; selecting a preset number of features from the alternative feature subsets according to the sequence of the feature indexes from large to small, and adding the features to a target feature set;
Wherein the first condition includes that a degree of correlation of a feature is greater than a degree of correlation threshold, a feature is not present in the candidate feature set; the larger the feature index is, the higher the correlation degree of the feature is, the lower the redundancy is, and the correlation degree is the correlation degree between the feature and the corresponding label.
7. The online feature selection apparatus of claim 6 wherein,
The acquisition module is used for acquiring initial stream data;
The processing module is used for determining a statistical index vector of each feature in the initial stream data; all features of the initial stream data are added to the candidate feature set and the target feature set, respectively.
8. The online feature selection apparatus according to claim 6 or 7, wherein the feature class of the target stream data includes an existing feature and a newly added feature; the processing module is used for determining a first data block from the target stream data, wherein the first data block is characterized by the existing characteristics; and determining a statistical index vector of each feature in the first data block based on the stream data acquired at the last moment.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the online feature selection method according to any one of claims 1 to 5 when the program is executed.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps in the online feature selection method according to any one of claims 1 to 5.
CN202311754585.1A 2023-12-19 2023-12-19 Online feature selection method and device, electronic equipment and storage medium Pending CN117909708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311754585.1A CN117909708A (en) 2023-12-19 2023-12-19 Online feature selection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311754585.1A CN117909708A (en) 2023-12-19 2023-12-19 Online feature selection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117909708A true CN117909708A (en) 2024-04-19

Family

ID=90690258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311754585.1A Pending CN117909708A (en) 2023-12-19 2023-12-19 Online feature selection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117909708A (en)

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN109582793B (en) Model training method, customer service system, data labeling system and readable storage medium
CN109783817B (en) Text semantic similarity calculation model based on deep reinforcement learning
CN112487168B (en) Semantic question-answering method and device of knowledge graph, computer equipment and storage medium
CN109919183B (en) Image identification method, device and equipment based on small samples and storage medium
CN111127364B (en) Image data enhancement strategy selection method and face recognition image data enhancement method
CN110941698B (en) Service discovery method based on convolutional neural network under BERT
CN108804577B (en) Method for estimating interest degree of information tag
CN113065525A (en) Age recognition model training method, face age recognition method and related device
CN113377991B (en) Image retrieval method based on most difficult positive and negative samples
CN110705889A (en) Enterprise screening method, device, equipment and storage medium
CN116823782A (en) Reference-free image quality evaluation method based on graph convolution and multi-scale features
CN117909708A (en) Online feature selection method and device, electronic equipment and storage medium
CN115457269A (en) Semantic segmentation method based on improved DenseNAS
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN114003707A (en) Problem retrieval model training method and device and problem retrieval method and device
CN109614456B (en) Deep learning-based geographic information positioning and partitioning method and device
JP6993250B2 (en) Content feature extractor, method, and program
CN110309139B (en) High-dimensional neighbor pair searching method and system
CN111125541A (en) Method for acquiring sustainable multi-cloud service combination for multiple users
CN116109650B (en) Point cloud instance segmentation model training method and training device
CN115222945B (en) Deep semantic segmentation network training method based on multi-scale self-adaptive course learning
CN117235448B (en) Data cleaning method, terminal equipment and storage medium
CN117633619A (en) Customer service telephone conversation text classification method and device
CN117216530A (en) Model information determining method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination