CN114268625B

CN114268625B - Feature selection method, device, equipment and storage medium

Info

Publication number: CN114268625B
Application number: CN202010960198.3A
Authority: CN
Inventors: 郑立凡; 吕培立; 董井然; 陈守志
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2024-01-02
Anticipated expiration: 2040-09-14
Also published as: CN114268625A

Abstract

The application discloses a feature selection method, a device, equipment and a storage medium. The method comprises the following steps: any executing node receives the initial selected feature set broadcasted by the management node and feature data corresponding to the initial selected feature set; constructing at least one first candidate feature set; obtaining a model performance index corresponding to at least one first candidate feature set; in response to the at least one first candidate feature set having the first candidate feature set satisfying the selection condition, the first candidate feature set satisfying the selection condition is sent to a management node, and the management node is used for acquiring the target selected feature set based on the first candidate feature set satisfying the selection condition. In the process, feature selection can be directly realized based on the stored complete feature data and the selected feature set broadcasted by the management node and the feature data corresponding to the selected feature set, interaction of the feature data is not needed between the execution nodes, the time consumption of the feature selection process is short, the feature selection efficiency is high, and the feature selection effect is good.

Description

Feature selection method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of computers, and in particular, to a feature selection method, device, apparatus, and storage medium.

Background

With the continuous increase and development of computer storage and computing power, in the field of machine learning, many high-dimensional datasets are often involved. The original feature set corresponding to the high-dimensional dataset typically contains a large number of redundant features that can lead to reduced processing performance of the machine learning model. The feature selection can select the features acting on the machine learning model from the original feature set, and then only the data set corresponding to the selected features is used for training or using the machine learning model, so that the calculation complexity of the machine learning model is reduced, and the processing performance of the machine learning model is improved.

The data set generally comprises at least two samples which respectively correspond to complete sample data, each complete sample data comprises one characteristic data which respectively corresponds to each characteristic in the original characteristic set, and the characteristic data which corresponds to a certain characteristic in each complete sample data are combined to form complete characteristic data which corresponds to the characteristic.

In the related art, the storage manner of the data set in at least two execution nodes in the distributed system is as follows: and storing the complete sample data corresponding to the same sample in the same execution node. Based on the storage mode, in the process of feature selection, interaction of feature data is needed to be carried out between execution nodes to obtain complete feature data corresponding to the features, and then feature selection is achieved based on the complete feature data corresponding to the features. The interaction of the feature data is long in time consumption, the efficiency of the feature selection process is low, and the effect of feature selection is poor.

Disclosure of Invention

The embodiment of the application provides a feature selection method, a device, equipment and a storage medium, which can be used for improving the effect of feature selection. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a feature selection method, where the method is applied to any execution node in a distributed system, where any execution node stores at least one complete feature data corresponding to a first feature, where the first feature is a feature in an original feature set except for a feature in an initially selected feature set, and the method includes:

receiving the initial selected feature set and feature data corresponding to the initial selected feature set, which are broadcasted by a management node;

constructing at least one first candidate feature set based on at least one first feature corresponding to the stored complete feature data and the initial selected feature set, wherein any first candidate feature set is any first feature set corresponding to the initial selected feature set and the stored complete feature data;

acquiring a model performance index corresponding to the at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set;

And responding to the model performance index corresponding to the at least one first candidate feature set to indicate that the first candidate feature set meeting the selection condition exists in the at least one first candidate feature set, sending the first candidate feature set meeting the selection condition to the management node, wherein the management node is used for taking the first candidate feature set meeting the selection condition as a first selected feature set, and acquiring a target selected feature set based on the first selected feature set.

The method is applied to management nodes in a distributed system, the management nodes are used for being in communication connection with at least two execution nodes in the distributed system, and any execution node at least stores complete feature data corresponding to one feature in an original feature set, and the method comprises the following steps:

determining an initial selected feature set, and acquiring feature data corresponding to the initial selected feature set;

broadcasting the initial selected feature set and feature data corresponding to the initial selected feature set to the at least two executing nodes, wherein the at least two executing nodes are used for determining first candidate feature sets meeting selection conditions in all first candidate feature sets based on the stored complete feature data, the initial selected feature set and the feature data corresponding to the initial selected feature set, and sending the first candidate feature sets meeting the selection conditions to the management node, wherein any first candidate feature set is a set of the initial selected feature set and any first feature, and the first feature is a feature except the feature in the initial selected feature set in the original feature set;

And taking the first candidate feature set meeting the selection condition as a first selected feature set, and acquiring a target selected feature set based on the first selected feature set.

In another aspect, there is provided a feature selection apparatus, the apparatus comprising:

the receiving unit is used for receiving the initial selected feature set broadcasted by the management node and feature data corresponding to the initial selected feature set;

the construction unit is used for constructing at least one first candidate feature set based on at least one first feature corresponding to the stored complete feature data and the initial selected feature set, wherein any one first candidate feature set is a set of any first feature corresponding to the initial selected feature set and the stored complete feature data;

the obtaining unit is used for obtaining a model performance index corresponding to the at least one first candidate feature set based on the stored complete feature data and the feature data corresponding to the initial selected feature set;

the sending unit is used for responding to the model performance index corresponding to the at least one first candidate feature set to indicate that the first candidate feature set meeting the selection condition exists in the at least one first candidate feature set, sending the first candidate feature set meeting the selection condition to the management node, and the management node is used for taking the first candidate feature set meeting the selection condition as a first selected feature set and acquiring a target selected feature set based on the first selected feature set.

In one possible implementation manner, the acquiring unit is configured to determine feature data corresponding to the at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initially selected feature set; for any one of the at least one first candidate feature set, acquiring a reference model corresponding to the any one first candidate feature set based on feature data corresponding to the any one first candidate feature set; and performing performance evaluation on the reference model corresponding to any one of the first candidate feature sets, and taking the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets.

In one possible implementation manner, the obtaining unit is further configured to obtain test data corresponding to the any one of the first candidate feature sets and a standard result corresponding to the test data; invoking a reference model corresponding to any one of the first candidate feature sets to process the test data to obtain a prediction result corresponding to the test data; and determining a performance evaluation result of the reference model corresponding to any one of the first candidate feature sets based on the prediction result and the standard result.

In one possible implementation manner, the any executing node further stores at least one complete feature data corresponding to a second feature, where the second feature is a feature in the original feature set other than a feature in the first selected feature set, and the receiving unit is further configured to receive the first selected feature set and the feature data corresponding to the first selected feature set, which are broadcasted by the managing node;

the construction unit is further configured to construct at least one second candidate feature set based on at least one second feature corresponding to the stored complete feature data and the first selected feature set, where any second candidate feature set is a set of any second feature corresponding to the first selected feature set and the stored complete feature data;

the obtaining unit is further configured to obtain a model performance index corresponding to the at least one second candidate feature set based on the stored complete feature data and feature data corresponding to the first selected feature set;

the sending unit is further configured to send, to the management node, a second candidate feature set that satisfies a selection condition in response to a model performance index corresponding to the at least one second candidate feature set indicating that the at least one second candidate feature set exists in the at least one second candidate feature set.

In one possible implementation manner, the model performance index corresponding to the first candidate feature set meeting the selection condition is a maximum model performance index of all model performance indexes, where all model performance indexes include model performance indexes respectively corresponding to all first candidate feature sets acquired by each execution node in the distributed system.

There is also provided a feature selection apparatus, the apparatus comprising:

a determining unit for determining an initial selected feature set;

the acquisition unit is used for acquiring the feature data corresponding to the initial selected feature set;

the broadcasting unit is used for broadcasting the initial selected feature set and feature data corresponding to the initial selected feature set to the at least two executing nodes, the at least two executing nodes are used for determining first candidate feature sets meeting selection conditions in all first candidate feature sets based on the stored complete feature data, the initial selected feature set and the feature data corresponding to the initial selected feature set, the first candidate feature sets meeting the selection conditions are sent to the management node, any one of the first candidate feature sets is a set of the initial selected feature set and any one of the first feature sets, and the first feature is a feature except the feature in the initial selected feature set in the original feature set;

The obtaining unit is further configured to obtain a target selected feature set based on the first selected feature set by using the first candidate feature set satisfying the selection condition as a first selected feature set.

In a possible implementation manner, the obtaining unit is further configured to, in response to a selection procedure meeting a termination condition, take the first selected feature set as the target selected feature set; in response to the selection process failing to meet the termination condition, acquiring feature data corresponding to the first selected feature set, broadcasting the first selected feature set and the feature data corresponding to the first selected feature set to the at least two executing nodes, wherein the at least two executing nodes are used for determining second candidate feature sets meeting the selection condition in all second candidate feature sets based on the stored complete feature data, the first selected feature set and the feature data corresponding to the first selected feature set, and transmitting the second candidate feature sets meeting the selection condition to the management node; taking the second candidate feature set meeting the selection condition as a second selected feature set, and acquiring a target selected feature set based on the second selected feature set; wherein any second candidate feature set is a set of the first selected feature set and any second feature that is a feature of the original feature set other than the feature of the first selected feature set.

In one possible implementation, the selecting process satisfies a termination condition, including:

the number of features in the first selected feature set is not less than a number threshold; or,

the original feature set is free of features other than the features in the first selected feature set; or,

and the model performance index increment corresponding to the first selected feature set is smaller than a reference threshold value.

In a possible implementation manner, the determining unit is configured to use, as the initial selected feature set, a set of at least one reserved feature in response to the existence of the reserved feature; in response to the absence of the reserved feature, the empty set is taken as the initial selected feature set.

In another aspect, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one program code, the at least one program code loaded and executed by the processor to implement any of the feature selection methods described above.

In another aspect, there is provided a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement any of the above-described feature selection methods.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform any of the feature selection methods described above.

The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:

in the embodiment of the application, the complete feature data corresponding to a certain feature is directly stored in one execution node, based on the stored complete feature data and the selected feature set broadcasted by the management node and the feature data corresponding to the selected feature set, feature selection can be directly realized, interaction of the feature data is not needed between the execution nodes, the time consumption of a feature selection process is short, the feature selection efficiency is high, and the feature selection effect is good.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a feature selection method provided in an embodiment of the present application;

FIG. 2 is a flow chart of a feature selection method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a feature selection process provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation of a wraparound feature selection provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a feature selection device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a feature selection device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In order to facilitate understanding of the technical process of the embodiments of the present application, some terms related to the embodiments of the present application are explained below:

machine learning: machine learning is the science of letting a computer learn and act like a human, learning underlying knowledge under a large amount of data through a model, and optimizing the model with an optimization algorithm. The method is widely applied to various fields, such as shopping recommendation, search ranking, advertisement clicking, credit risk assessment, image recognition, automatic driving and the like.

Characteristic engineering: feature engineering refers to the process of using domain knowledge of data to create features that enable machine learning models to function, and is the basis of machine learning applications. Feature engineering is one of the means to effectively promote machine learning models.

Feature selection: feature selection refers to the process of selecting an appropriate subset of features from an original feature set according to certain criteria in order to enhance the model effect (e.g., enhance generalization ability) when building a machine learning model. Common feature selection methods include Filter feature selection, wrapper feature selection, embedded feature selection, hybrid feature selection, and the like. The feature selection is an important content in the feature engineering, and can simplify the model, reduce the training/reasoning time of the model and improve the generalization capability of the model through feature selection.

KS (Kolmogorov Smirnov, kelmogorov Mirnov) test: the KS test is a test method that compares one frequency distribution f (x) with a theoretical distribution g (x) or two observed value distributions.

An embodiment of the present application provides a feature selection method, please refer to fig. 1, which illustrates a schematic diagram of an implementation environment of the feature selection method provided in the embodiment of the present application. The implementation environment comprises: distributed system 100. The distributed system 100 includes a management node 110 and at least two execution nodes 120.

Wherein, the management node 110 can broadcast the selected feature set and the feature data corresponding to the selected feature set to each execution node 120, and the management node 110 can also receive the updated selected feature set sent by the execution node 120. The executing node 120 may be configured to receive the selected feature set and feature data corresponding to the selected feature set broadcast by the management node 110, where the executing node 120 may be further configured to obtain a model performance index corresponding to the candidate feature set, and further determine the updated selected feature set according to the model performance index corresponding to the candidate feature set.

In one possible implementation, the management node 110 may be a terminal or a server, which is not limited in this embodiment of the present application. Similarly, the execution node 120 may be a terminal or a server. The terminal may be any electronic product that can interact with a user by one or more of a keyboard, a touch pad, a touch screen, a remote control, a voice interaction or handwriting device, such as a PC (Personal Computer ), a mobile phone, a smart phone, a PDA (Personal Digital Assistant ), a wearable device, a palm top PPC (Pocket PC), a tablet computer, a smart car machine, a smart tv, a smart sound box, etc. The server may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center.

The management node 110 establishes a communication connection with the execution node 120 through a wired or wireless network, and communication connection is also established between different execution nodes 120 through a wired or wireless network.

Those skilled in the art will appreciate that the above described distributed system 100 is by way of example only, and that other existing or future distributed systems may be suitable for use in the present application and are intended to be included within the scope of the present application and are incorporated herein by reference.

Based on the implementation environment shown in fig. 1, the embodiment of the present application provides a feature selection method, which is applied to the interaction process between the management node 110 and any execution node 120 (referred to as a target execution node in the embodiment of the present application) in the distributed system 100. In the embodiment of the application, the distributed system comprises a management node and at least two execution nodes, wherein the management node is used for being in communication connection with the at least two execution nodes; any executing node at least stores the complete feature data corresponding to one feature in the original feature set. As shown in fig. 2, the method provided in the embodiment of the present application includes the following steps:

in step 201, the management node determines an initial selected feature set, and obtains feature data corresponding to the initial selected feature set.

The initial selected feature set is the selected feature set at the beginning of the feature selection process. In one possible implementation, the process by which the management node determines the initial selected feature set is: in response to the presence of a reservation feature, taking a set of at least one reservation feature as an initial selected feature set; in response to the absence of the reserved feature, the empty set is taken as the initial selected feature set.

The reserved features are preset features to be selected, and when the reserved features exist, the set of each reserved feature is used as an initial selected feature set. When reservation features are present, the number of reservation features may be one or more, which embodiments of the present application do not limit. In an exemplary embodiment, the reserved features are features in an original feature set corresponding to an original data set stored in the distributed system. When there is no reserved feature, the empty set is directly taken as the initial selected feature set, and at this time, the initial selected feature set does not comprise any feature.

According to different conditions of the initial selected feature set, the process of acquiring the feature data corresponding to the initial selected feature set is different. In one possible implementation manner, when the initial selected feature set is the null set, the feature data corresponding to the initial selected feature set is the null data, and at this time, the process of obtaining the feature data corresponding to the initial selected feature set is: and taking the null data as the feature data corresponding to the initial selected feature set.

In one possible implementation manner, when the initial selected feature set is a set formed by at least one reserved feature, the initial selected feature set includes at least one feature, and at this time, feature data corresponding to the initial selected feature set refers to data obtained after a complete feature data set corresponding to each feature in the initial selected feature set. In this case, the process of acquiring feature data corresponding to the initially selected feature set is: and acquiring complete feature data corresponding to each feature in the initial selected feature set, and collecting the complete feature data corresponding to each feature to obtain feature data corresponding to the initial selected feature set.

In the embodiment of the application, the data set for implementing the feature selection process is taken as the original data set. The original data set comprises at least two samples of corresponding complete sample data respectively, and the complete sample data corresponding to any one sample comprises one characteristic data corresponding to each characteristic in the original characteristic set respectively. The original feature set refers to a set of all features to which all data in the original data set belongs. In an exemplary embodiment, the features refer to attributes of the user, which may be refined from the user's profile, daily activities. For example, the user's characteristics include age, academic, income, etc., and based on the user's characteristics, the user's advertisement clicks, marketing purchases, loan overdue, etc. can be estimated.

The complete sample data corresponding to any one sample comprises one characteristic data corresponding to each characteristic in the original characteristic set. That is, the respective feature data included in the complete sample data belong to different features, respectively. For example, as shown in table 1, the original data set includes complete sample data corresponding to three samples, respectively, and the original feature set corresponding to the original data set includes five features of "sample identification", "age", "gender", "academic" and "income". Taking the first sample as an example, the complete sample data corresponding to the first sample includes five feature data, namely "sample 1", "20 years old", "female", "college patent" and "1 ten thousand", which respectively belong to five features of "sample identification", "age", "gender", "academic" and "income".

TABLE 1

Sample identification	Age of	Sex (sex)	Learning calendar	Income (income)
					Sample 1	Age 20	Female	Large special purpose	1 ten thousand (1
Sample 2	Age of 25	Man's body	Gramineae (Gramineae)	1.5 ten thousand
					Sample 3	Age of 38 years	Man's body	Master	2 ten thousand (ten thousand)

Before the feature selection is performed by using the method provided by the embodiment of the application, any executing node in the distributed system at least stores complete feature data corresponding to one feature in the original feature set. The complete feature data corresponding to any feature comprises all feature data belonging to the any feature in the complete sample data corresponding to each sample in the original data set.

For example, taking the original data set as shown in table 1 as an example, the complete feature data corresponding to the feature of "age" includes feature data "20 years old" belonging to the feature of "age" in the complete sample data corresponding to the first sample, feature data "25 years old" belonging to the feature of "age" in the complete sample data corresponding to the second sample, and feature data "38 years old" belonging to the feature of "age" in the complete sample data corresponding to the third sample.

From the foregoing, it can be seen that the original data set can be regarded as a set of complete sample data corresponding to each sample, and also as a set of complete feature data corresponding to each feature in the original feature set.

Any executing node in the distributed system at least stores complete feature data corresponding to one feature in the original feature set, which means that any executing node stores complete feature data corresponding to one or more features in the original feature set. It should be noted that, in order to ensure that the original data set is all stored in the execution nodes and the stored data is not redundant, the intersection between the complete feature data stored in any two execution nodes is an empty set, and the union of the complete feature data stored in each execution node is the original data set.

The distributed system is used for processing big data, and the characteristics of the big data comprise huge data volume, various data types and parallel processing. The implementation framework of the distributed system is not limited in this embodiment, and illustratively, the implementation framework of the distributed system is Hadoop (big data), spark (a computing framework), and the like.

In one possible implementation manner, a storage manner in which at least one complete feature data corresponding to one feature in the original feature set is stored in any execution node is used as the target storage manner. Before storing the original data in the target storage mode, the execution nodes in the distributed system store the original data in a default storage mode. The default storage mode refers to a storage mode that any executing node stores at least complete sample data corresponding to one sample. In this case, before performing feature selection by using the method provided in the embodiment of the present application, the management node sends a storage mode adjustment instruction to at least two execution nodes, so that the storage mode of the original data set in at least two execution nodes is converted from the default storage mode to the target storage mode, so as to implement the feature selection method provided in the embodiment of the present application on the basis of the target storage mode.

Under a default storage mode, the complete sample data corresponding to the same sample can be ensured to be stored in the same executing node. It should be noted that, in the default storage manner, only one sample of complete sample data corresponding to one sample may be stored in any execution node, or multiple samples of complete sample data corresponding to multiple samples may be stored in any execution node, which is not limited in the embodiment of the present application. After the storage mode is converted from the default storage mode to the target storage mode, the complete feature data corresponding to the same feature can be ensured to be stored in the same executing node. It should be noted that, in the target storage manner, only the complete feature data corresponding to one feature may be stored in any execution node, or the complete feature data corresponding to a plurality of features may be stored in any execution node, which is not limited in this embodiment of the present application.

In one possible implementation manner, the storage mode adjustment instruction may carry an adjustment operation, and the executing node that receives the storage mode adjustment instruction adjusts according to the adjustment operation, so that a process of converting a storage mode of the original data set in at least two executing nodes from a default storage mode to a target storage mode may be implemented. The embodiments of the present application are not limited to the adjustment operation. The process of converting the default storage system into the target storage system may be regarded as a process of performing a transposition process on the original data set, and after the transposition process, converting the feature data from row storage into column storage.

Under the default storage mode, the distributed system needs to frequently execute the Shuffle operation to realize the feature selection process, and after the default storage mode is converted into the target storage mode, the Shuffle operation can be reduced or even eliminated, so that the execution efficiency of the feature selection process is improved.

In one possible implementation, the original data set refers to a preprocessed data set, and the preprocessing mode includes at least one of missing value filling and outlier replacement. By preprocessing, the reliability of the original data set can be higher, and the feature selection effect can be improved.

As can be seen from the foregoing, in the embodiment of the present application, each execution node stores at least one complete feature data corresponding to a feature, and based on this, the process of obtaining, by the management node, the feature data corresponding to the initially selected feature set includes: the management node acquires complete feature data corresponding to each feature in the selected feature set from the reference execution node, and gathers the acquired complete feature data corresponding to each feature to obtain feature data corresponding to the initial selected feature set.

The reference execution node refers to an execution node storing complete feature data corresponding to features in the initially selected feature set. In one possible implementation manner, the management node stores a corresponding relation between the complete feature data and the execution node identifier, where the corresponding relation between the complete feature data and the execution node identifier is used to indicate in which execution node the complete feature data corresponding to a feature is stored. Based on the corresponding relation between the complete feature data and the execution node identification, the management node can determine the reference execution node, further send a complete feature data acquisition request to the reference execution node, and acquire complete feature data corresponding to the features in the initial selected feature set sent by the reference execution node.

In one possible implementation, when storing the complete feature data corresponding to each feature, the encoded complete feature data may be stored, so as to save storage space.

In step 202, the management node broadcasts feature data corresponding to the initial selected feature set and the initial selected feature set to at least two execution nodes.

The at least two executing nodes are used for determining first candidate feature sets meeting selection conditions in all first candidate feature sets based on stored complete feature data, initial selected feature sets and feature data corresponding to the initial selected feature sets, and sending the first candidate feature sets meeting the selection conditions to the management node, wherein any one of the first candidate feature sets is a set of the initial selected feature sets and any one of the first feature sets, and the first feature is a feature except for the feature in the initial selected feature set in the original feature set.

After the feature data corresponding to the initial selected feature set is obtained, the management node broadcasts the initial selected feature set and the feature data corresponding to the initial selected feature set to each execution node so that each execution node can determine the first candidate feature set meeting the selection condition in all the first candidate feature sets based on the stored complete feature data, the feature data corresponding to the initial selected feature set and the initial selected feature set.

After the management node broadcasts the initial selected feature set and the feature data corresponding to the initial selected feature set to all the execution nodes, the execution nodes storing the complete feature data corresponding to the first feature and the execution nodes not storing the complete feature data corresponding to the first feature can both receive the feature data corresponding to the initial selected feature set and the initial selected feature set.

The executing nodes which do not store the complete characteristic data corresponding to the first characteristic are executing nodes which only store the complete characteristic data corresponding to one or some characteristics in the initial selected characteristic set, and after the executing nodes receive the initial selected characteristic set and the characteristic data corresponding to the initial selected characteristic set, the executing nodes ignore the received initial selected characteristic set and the characteristic data corresponding to the initial selected characteristic set and do not participate in the process of acquiring the first selected characteristic set because the stored complete characteristic data does not include new complete characteristic data except the received characteristic data.

After receiving the initial selected feature set and the feature data corresponding to the initial selected feature set, the executing node storing the complete feature data corresponding to the first feature participates in the process of acquiring the first selected feature set based on the stored complete feature data, the received initial selected feature set and the feature data corresponding to the initial selected feature set. In the embodiment of the application, each execution node storing at least one complete feature data corresponding to a first feature is called a first execution node.

It should be further noted that, as long as a certain execution node stores complete feature data corresponding to a certain or certain first features, the execution node is referred to as a first execution node. Any first executing node may store, in addition to the complete feature data corresponding to some or some first features, complete feature data corresponding to some or some features in the initially selected feature set, which is not limited in this embodiment of the present application.

In step 203, the target execution node receives the initial selected feature set and feature data corresponding to the initial selected feature set broadcast by the management node.

After the management node broadcasts the initial selected feature set and the feature data corresponding to the initial selected feature set to each execution node, each execution node can receive the feature data corresponding to the initial selected feature set and the initial selected feature set broadcast by the management node. The target execution node refers to any one of the execution nodes storing at least one complete feature data corresponding to the first feature, that is, the target execution node is any one of the first execution nodes.

In an exemplary embodiment, since only each first executing node can participate in the process of acquiring the first selected feature set, the embodiments of the present application are described only from the perspective of any first executing node. That is, the execution subjects of steps 203 to 206 refer to any first execution node. Each first execution node may execute steps 203 to 206 in parallel, so as to improve operation efficiency, reduce resource consumption, and support feature selection of ultra-large scale data.

In step 204, the target execution node constructs at least one first candidate feature set based on the at least one first feature corresponding to the stored complete feature data and the initial selected feature set, where any first candidate feature set is any first feature set corresponding to the initial selected feature set and the stored complete feature data.

The target executing node constructs at least one first candidate feature set after receiving the initial selected feature set and feature data corresponding to the initial selected feature set. The at least one first candidate feature set in this step 204 refers to a first candidate feature set that the target execution node can construct from the at least one first feature corresponding to the stored complete feature data and the received selected feature set. Since any of the first candidate feature sets is a set of first features corresponding to the complete feature data stored in the target execution node, the number of first candidate feature sets that the target execution node can construct is the same as the number of first features corresponding to the complete feature data stored in the target execution node. That is, the target execution node stores the complete feature data corresponding to the first features, and the target execution node can construct the first candidate feature sets.

It should be noted that the number of the first candidate feature sets that can be constructed by different execution nodes may be the same or may be different, which is related to the number of the first features corresponding to the complete feature data stored in each execution node.

In one possible implementation manner, the at least one first feature corresponding to the stored complete feature data may be obtained by parsing the stored complete feature data. And combining any first feature corresponding to the stored complete feature data with the initial selected feature set to obtain a first candidate feature set. Each first feature corresponding to the stored complete feature data may construct a first candidate feature set with the initial selected feature set.

For example, assume that three first features, namely "academic", "gender" and "income", are determined by parsing the stored complete feature data, and that the initially selected feature set is { age }. In this case, the target execution node can construct three first candidate feature sets, namely { age, academy }, { age, gender }, and { age, income }, respectively.

In step 205, the target execution node obtains a model performance index corresponding to at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set.

After the at least one first candidate feature set is constructed, the target execution node obtains a model performance index corresponding to the at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set. The model performance index corresponding to any first candidate feature set is used for indicating the acting degree of the first candidate feature set on realizing the model function. The larger the model performance index corresponding to any candidate feature set, the higher the acting degree of the any candidate feature set on realizing the model function is.

In one possible implementation manner, the process of obtaining, by the target execution node, the model performance index corresponding to the at least one first candidate feature set based on the stored complete feature data and the feature data corresponding to the initial selected feature set includes the following steps a and b:

step a: and determining feature data corresponding to at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set.

For any one candidate feature set in at least one first candidate feature set, the mode of determining feature data corresponding to the any one candidate feature set by the target executing node is as follows: and determining the complete feature data corresponding to the first feature in the first candidate feature set from the stored complete feature data, and collecting the complete feature data corresponding to the first feature in the first candidate feature set and the feature data corresponding to the initial selected feature set to obtain the feature data corresponding to any one of the first candidate feature sets. According to the method, the characteristic data corresponding to each first candidate characteristic set can be obtained, and then the step b is executed.

Step b: for any one of at least one first candidate feature set, acquiring a reference model corresponding to the any one first candidate feature set based on feature data corresponding to the any one first candidate feature set; and performing performance evaluation on the reference model corresponding to any one of the first candidate feature sets, and taking the performance evaluation result of the reference model corresponding to any one of the candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets.

In one possible implementation manner, based on the feature data corresponding to the any one first candidate feature set, the process of obtaining the reference model corresponding to the any one first candidate feature set is as follows: and training the initial model by utilizing the feature data corresponding to any one of the first candidate feature sets and the standard result corresponding to the feature data, and taking the model obtained by training as the reference model corresponding to any one of the first candidate feature sets. The training process of the initial model is supervised training, and the training process is not expanded in the embodiment of the application.

It should be noted that, the model structure of the initial model may be set according to the application requirement, for example, if the application requirement is to predict whether the clicking behavior of the user is generated according to the data in a classified manner, the model structure of the initial model is a classifier. The application requirements can reflect functions required to be realized by the model, and the degree of the first candidate feature set acting on the realization of the functions of the model can be determined by evaluating the performance of the reference model obtained through training according to the feature data corresponding to the first candidate feature set.

The standard results corresponding to the characteristic data comprise standard results respectively corresponding to the samples in the original data set. The standard result is used for guiding the optimization direction of the model, the type of the standard result is determined according to application requirements, and the embodiment of the application is not limited to the standard result. Illustratively, assuming that the application requirements are a classification prediction of whether a user produces click behavior based on data, the standard results include two types: the user generates clicking actions, and the user does not generate clicking actions. In an exemplary embodiment, the standard results may refer to target variables in the original dataset.

It should be noted that, since the feature data corresponding to each first candidate feature set is derived from the original data set, the feature data corresponding to each first candidate feature set corresponds to the same standard result. In the process of acquiring the reference models corresponding to different first candidate feature sets, training is performed based on the same initial model and the same standard result so as to ensure comparability between the reference models corresponding to different first candidate feature sets.

After the reference model corresponding to any one of the first candidate feature sets is obtained, performing performance evaluation on the reference model corresponding to any one of the first candidate feature sets to obtain a performance evaluation result of the reference model corresponding to any one of the first candidate feature sets, and taking the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets.

In one possible implementation manner, the performance evaluation process for the reference model corresponding to any first candidate feature set is as follows: acquiring test data corresponding to any one of the first candidate feature sets and standard results corresponding to the test data; invoking a reference model corresponding to any one of the first candidate feature sets to process the test data, so as to obtain a prediction result corresponding to the test data; and determining a performance evaluation result of the reference model corresponding to any one of the first candidate feature sets based on the prediction result and the standard result.

The test data corresponding to any first candidate feature set refers to the data after the test data set corresponding to each feature in any candidate feature set. The test data corresponding to any one of the first candidate feature sets is part of the data in the complete test data set. The feature set corresponding to the complete test data set is identical to the feature set corresponding to the original data set. The complete test data set may be stored in the respective execution node as a unitary data set with the original data set. In any execution node, the complete feature data and test data corresponding to a certain feature or certain features are stored. The process of acquiring the test data corresponding to any one of the first candidate feature sets may refer to the process of acquiring the feature data corresponding to any one of the first candidate feature sets, which is not described in detail in the embodiment of the present application.

The performance evaluation result of the reference model is used for measuring the performance of the reference model, and the performance evaluation result of the reference model is obtained according to the prediction result and the standard result. The embodiment of the present application does not limit the type of the performance evaluation result of the reference model, and the performance evaluation result of the reference model refers to AUC (Area Under the Curve, area under curve) result, KS (Kolmogorov Smirnov, kolmogorov) test result, or the like, for example. The manner in which the performance evaluation result corresponding to the reference model is determined based on the test result and the standard result is related to the type of the performance evaluation result, which is not limited in the embodiment of the present application.

And after obtaining the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets, taking the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets. The model performance index corresponding to any first candidate feature set is used for measuring the performance of the reference model obtained through training based on the feature data corresponding to the first candidate feature set, so that which first candidate feature set can be selected from the angle of considering the influence of the feature on the model performance, and the effect of feature selection is good.

According to the above process, the target execution node can obtain the model performance indexes respectively corresponding to each first candidate feature set constructed by the target execution node. It should be noted that, after each first executing node receives the initial selected feature set and the feature data corresponding to the initial selected feature set broadcasted by the management node, each first executing node obtains the model performance index corresponding to the first candidate feature set respectively constructed according to the processes of step 204 and step 205. Because the process of each first executing node obtaining the model performance index corresponding to the first candidate feature set constructed respectively is completely completed in the executing node without depending on other executing nodes, the process of each first executing node obtaining the model performance index corresponding to the first candidate feature set constructed respectively can be executed in parallel, which is beneficial to shortening the time consumption of feature selection.

After each first executing node obtains the model performance index corresponding to each constructed first candidate feature set, the comparison of the model performance indexes is carried out among the first executing nodes so as to determine the first candidate feature set meeting the selection condition. In the embodiment of the present application, a mode of comparing model feature performance indexes between each first execution node is not limited, and an exemplary comparison mode is: firstly, grouping the performance indexes of the models in pairs, and eliminating smaller performance indexes of the models in each group; then, the rest model performances are grouped in pairs, and smaller model performance indexes in each group are removed; and so on until only the maximum model performance index remains.

In one possible implementation manner, the model performance index corresponding to the first candidate feature set satisfying the selection condition is the maximum model performance index of all model performance indexes, and all model performance indexes include model performance indexes respectively corresponding to all first candidate feature sets acquired by each execution node in the distributed system. Since only the first executing nodes acquire the first candidate feature sets in the distributed system, all the model performance indexes comprise model performance indexes respectively corresponding to all the first candidate feature sets acquired by each first executing node.

It should be noted that the number of all the first candidate feature sets is the same as the number of all the first features, that is, any one of the all the first features and the initially selected feature set form one of the all the first candidate feature sets. It should be further noted that, since each first executing node may store complete feature data corresponding to one or more features, the number of first executing nodes may be the same as or less than the number of all the first candidate feature sets, which is not limited in this embodiment of the present application.

The number of maximum model performance indicators in the overall model performance indicators may be one or more, and the embodiments of the present application are not limited thereto. In one possible implementation manner, if the number of the maximum model performance indexes in all the model performance indexes is one, the first candidate feature set corresponding to the maximum model performance indexes is directly used as the first candidate feature set meeting the selection condition.

In one possible implementation manner, if the number of the maximum model performance indexes in all the model performance indexes is at least two, the number of the first candidate feature sets corresponding to the maximum model performance indexes is at least two, and at this time, any one of the at least two first candidate feature sets corresponding to the maximum model performance indexes may be used as the first candidate feature set that satisfies the selection condition.

As can be seen from the foregoing, the process of determining the first candidate feature set that satisfies the selection condition is related to the model performance index corresponding to the first candidate feature set, where the first candidate feature set that satisfies the selection condition may be in at least one first candidate feature set acquired by any one of the first executing nodes. Thus, for the target execution node, there are two cases:

Case 1: the model performance index corresponding to the at least one first candidate feature set acquired by the target executing node indicates that the first candidate feature set meeting the selection condition exists in the at least one first candidate feature set.

In this case 1, step 206 is performed.

Case 2: the model performance index corresponding to the at least one first candidate feature set acquired by the target executing node indicates that the first candidate feature set meeting the selection condition does not exist in the at least one first candidate feature set.

In this case 2, it is explained that the first candidate feature set satisfying the selection condition is in at least one first candidate feature set constructed by other first executing nodes, and at this time, the target executing node does not need to execute step 206, and the first candidate feature set satisfying the selection condition is sent to the management node by the other first executing nodes.

In step 206, the target execution node sends the first candidate feature set satisfying the selection condition to the management node in response to the model performance index corresponding to the at least one first candidate feature set indicating that the first candidate feature set satisfying the selection condition exists in the at least one first candidate feature set.

When the model performance index corresponding to the at least one first candidate feature set obtained by the target executing node indicates that the first candidate feature set meeting the selection condition exists in the at least one first candidate feature set, the target executing node sends the first candidate feature set meeting the selection condition to the management node. The first candidate feature set satisfying the selection condition is considered as one of the most suitable selection among all the first candidate feature sets. Since each first candidate feature set is obtained by adding one first feature on the basis of the initial selected feature set, one first candidate feature set most suitable for selection can be regarded as a set of one first feature most suitable for selection and the initial selected feature set.

In step 207, the management node takes the first candidate feature set satisfying the selection condition as a first selected feature set, and obtains a target selected feature set based on the first selected feature set.

After the target executing node sends the first candidate feature set meeting the selection condition to the management node, the management node takes the first candidate feature set meeting the selection condition as a first selected feature set. The first selected feature set refers to the selected feature set after updating the initial selected feature set. The feature selection of the first selected feature set works better than the initial selected feature set.

The target selected feature set refers to the final selected feature set. The process of acquiring the target selected feature set is an iterative process, each iteration is carried out once for updating the selected feature set determined last time, a new feature is added to the updated selected feature set compared with the pre-updated selected feature set obtained in the previous iteration process, and the feature selection effect of the updated selected feature set is better than that of the pre-updated selected feature set obtained in the previous iteration process.

The process of obtaining the first selected feature set based on the initial selected feature set may be regarded as a round of iterative process, and after obtaining the first selected feature set, the target selected feature set is obtained continuously based on the first selected feature set.

In one possible implementation, in the process of acquiring the target selected feature set based on the first selected feature set, it is first determined whether the selection process satisfies the termination condition. And responding to the selection process meeting the termination condition, directly taking the first selected feature set as the target selected feature set, and completing the whole feature selection process. And responding to the selection process not meeting the termination condition, continuing to execute the next iteration process until the selection process meets the termination condition, and taking the selected feature set obtained when the selection termination condition is met as a target selected feature set.

In one possible implementation, for the selection process resulting in the first selected feature set, the selection process satisfying the termination condition includes any of the following:

case 1: the number of features in the first selected feature set is not less than a number threshold.

The number threshold is set empirically or flexibly adjusted according to the application scenario, which is not limited in the embodiments of the present application. The number threshold is used to limit the number of features that should ultimately be selected, and if the number of features in the first selected feature set is not less than the number threshold, it is indicated that a sufficient number of features have been selected without selecting a new feature, and the selection process is considered to satisfy the termination condition.

Case 2: the original feature set has no features other than the features in the first selected feature set.

In this case 2, it is explained that all the features in the original feature set are already contained in the first selected feature set, at which point no new selected feature set can be acquired any more, and the selection process is considered to satisfy the termination condition.

Case 3: the model performance index increment corresponding to the first selected feature set is less than the reference threshold.

In the feature selection process, it is desirable to continuously select a selected feature set that results in a better model performance index. Each time after a new selected feature set is determined, a difference between the model performance index corresponding to the new selected feature set and the model performance index corresponding to the last selected feature set may be used as a model performance index increment corresponding to the new selected feature set.

In an exemplary embodiment, the model performance index corresponding to the initial selected feature set may be preset, or may be obtained according to feature data corresponding to the initial selected feature set, which is not limited in this embodiment of the present application. In the model selection process, if the model performance index increment corresponding to the first selected feature set is smaller than the reference threshold value, the model performance index corresponding to the first selected feature set is only slightly improved compared with the initial selected feature set. In this case, it is considered that the continued execution of the feature selection process does not result in a better-performing selected feature set, and at this time, the selection process is considered to satisfy the termination condition.

It should be noted that, the reference threshold is a smaller value, and the reference threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Illustratively, the reference threshold is 0.05.

When the selection process satisfies any one of the above three cases, it is explained that the selection process satisfies the termination condition, and at this time, the first selected feature set is taken as the target selected feature set, and the entire feature selection process is completed. The limitation of the termination condition can avoid overfitting of the model by the selected features.

In one possible implementation, when the selection process does not satisfy the termination condition, the process of continuing to perform the next round of iterative process includes the following six steps:

step 1, a management node acquires feature data corresponding to a first selected feature set, and broadcasts the first selected feature set and the feature data corresponding to the first selected feature set to at least two execution nodes.

The at least two executing nodes are used for determining second candidate feature sets meeting selection conditions in all second candidate feature sets based on the stored complete feature data, the first selected feature set and the feature data corresponding to the first selected feature set, and sending the second candidate feature sets meeting the selection conditions to the management node.

The process of the management node obtaining the feature data corresponding to the first selected feature set refers to the process of obtaining the feature data corresponding to the initial feature set in step 201, which is not described herein. It should be noted that, since the first selected feature set is added with one first feature compared to the initial feature set, the first selected feature set includes at least one feature. The manner in which the management node broadcasts the feature data corresponding to the first selected feature set to at least two execution nodes is referred to in step 202, which is not described herein.

And step 2, any second executing node receives the first selected feature set broadcasted by the management node and the feature data corresponding to the first selected feature set.

The second executing node is an executing node storing at least one complete feature data corresponding to a second feature, and the second feature is a feature in the original feature set except the feature in the first selected feature set. Since the first selected feature set is added by one first feature from the initial selected feature set, the second feature is the other first feature among the first features except for the newly added first feature in the first selected feature set. It should be noted that the number of second execution nodes may be the same as the number of first execution nodes, or may be one less than the number of first execution nodes, which is related to the actual storage location of the complete feature data corresponding to each feature, which is not limited in the embodiments of the present application.

And 3, constructing at least one second candidate feature set by any second executing node based on at least one second feature corresponding to the stored complete feature data and the first selected feature set, wherein any second candidate feature set is any second feature set corresponding to the first selected feature set and the stored complete feature data.

Step 4: any second executing node obtains a model performance index corresponding to at least one second candidate feature set based on the stored complete feature data and the feature data corresponding to the first selected feature set.

And 5, any second executing node responds to the model performance index corresponding to the at least one second candidate feature set to indicate that the second candidate feature set meeting the selection condition exists in the at least one second candidate feature set, and the second candidate feature set meeting the selection condition is sent to the management node.

The implementation process of the above steps 2 to 5 is referred to steps 203 to 206, and will not be repeated here. In an exemplary embodiment, the target execution node further stores at least one complete feature data corresponding to the second feature, where any of the second execution nodes may be the same execution node as the target execution node. That is, the operations of steps 2 to 5 described above may be performed by the target execution node.

And 6, the management node takes the second candidate feature set meeting the selection condition as a second selected feature set.

Through the six steps, a second round of iterative process is completed, and a second selected feature set is obtained. The number of features in the second selected feature set is increased by one compared to the first selected feature set.

After obtaining the second selected feature set, the management node obtains the target selected feature based on the second selected feature set. In the process of acquiring the target selected feature set based on the second selected feature set, judging whether the selection process meets the termination condition or not, and taking the second selected feature set as the target selected feature set if the selection process meets the termination condition. If the selection process does not meet the termination condition, continuing to execute the next iteration process until the selection process meets the termination condition, and taking the selected feature set obtained when the selection process meets the termination condition as a target selected feature set.

In one possible implementation, for a selection process that is iterated continuously, the selection process satisfying the termination condition includes any one of the following:

case 1: the number of features in the currently selected feature set is not less than the number threshold.

The currently selected feature set refers to the latest selected feature set obtained when it is determined whether the selection process satisfies the termination condition.

Case 2: there are no features in the original feature set other than the features in the currently selected feature set.

Case 3: the model performance index increment corresponding to the currently selected feature set is smaller than the reference threshold.

When the selection process satisfies any one of the above three cases, it is explained that the selection process satisfies the termination condition, and at this time, the currently selected feature set is taken as the target selected feature set, and the entire feature selection process is completed.

The object of the embodiment of the application is to select, based on an original data set, features contributing to a model function from an original feature set corresponding to the original data set to form a target selected feature set. The target selected feature set is a preferred subset of the original feature set. After determining the target selected feature set, the management node may store the target selected feature set for subsequent direct extraction of the target selected feature set from the store.

In one possible implementation, after determining the target selected feature set, the management node performs model processing tasks based on the target selected feature set. In an exemplary embodiment, the manner in which the model processing task is performed based on the target selected feature set includes, but is not limited to, performing a model training task based on the target selected feature set, resulting in a trained model; or executing a model prediction task based on the target selected feature set to obtain a model prediction result. The model training task needs to execute the model prediction task to obtain a model prediction result, and then update model parameters based on the model prediction result and the difference between labels. In the embodiment of the application, a model prediction task is executed based on a target selected feature set, and a model prediction result is obtained as an example.

In one possible implementation, the process of performing a model prediction task based on the target selected feature set to obtain a model prediction result is: and acquiring application data corresponding to the target selected feature set, and inputting the application data into the target model for prediction processing to obtain a prediction result corresponding to the application data output by the target model. It should be noted that, the application data corresponding to the target selected feature set refers to data obtained after application data sets corresponding to each feature in the target selected feature set respectively. In an exemplary embodiment, the generation scene of the application data is close to the generation scene of the original data set and the test data set according to the feature selection process, so as to ensure the reliability of the model prediction result. In an exemplary embodiment, the target model is the same as or similar to the learning targets of the initial model involved in the feature selection process to ensure that the selected target selected feature set is valid for the prediction process of the target model.

Illustratively, the process of feature selection is as shown in FIG. 3, where the management node determines whether a reserved feature exists; if the reserved characteristics exist, taking the set of the reserved characteristics as an initial selected characteristic set F; if no reserved feature exists, the empty set is taken as an initial selected feature set F. And taking the features in the original feature set except the features in the initial selected feature set as first features, and recording the set of the first features as a feature set C to be selected. Judging whether the selection process meets the termination condition, if so, taking the selected feature set obtained when the selection process meets the termination condition as a target selected feature set, and ending the feature selection process.

When the selection process does not meet the termination condition, the management node broadcasts the initial selected feature set and feature data corresponding to the initial selected feature set to each first execution node; each first executing node adds each first feature in C into the initial selected feature set F to form candidate feature sets F ', wherein the number of the candidate feature sets F' and the number of the first features in C are the same. Obtaining model performance indexes corresponding to each F' respectively; marking F' meeting the selection condition as B; let f=b, c=original feature set-B; returning to the judging step of judging whether the selecting process satisfies the termination condition.

The feature selection manner referred to in the embodiments of the present application broadly pertains to wrapped feature selection. For example, assuming that the original feature set is { age, gender, academy }, the number threshold for limiting the number of features in the selected feature set is 2, a simple wraparound feature selection implementation procedure is shown in table 2.

TABLE 2

In addition to wrapped feature selection, common feature selection modes include filtered feature selection, embedded feature selection, and mixed feature selection, and the comparison of wrapped feature selection and feature selection in other modes is shown in table 3.

TABLE 3 Table 3

As can be seen from table 3, the wrapped feature selection method has a good effect, but has a large calculation amount, so that the feature selection of the ultra-large scale data cannot be supported in the default storage method of the conventional distributed system. In the embodiment of the application, the storage mode of the original data set in the distributed system is converted from the default storage mode to the target storage mode, so that the process of evaluating the effect of the candidate feature set can be executed in parallel between each execution node, the time consumption is shortened to a certain extent, the calculated amount is reduced, the process of executing the wrapped feature selection in the distributed system can be realized, the operation efficiency of the feature selection process is improved, the resource consumption is reduced, and the feature selection of the ultra-large-scale data can be supported.

The time complexity of the wrapped feature selection is O (m is n is t), wherein m is the total number of features, n is the feature selection number, t is the running time of a single feature set evaluation algorithm, the iteration process is serial, the single feature set evaluation process can be parallel at each execution node, and the performance is optimized by improving the parallel efficiency. In an exemplary embodiment, if the model is selected as a generalized linear model, the use of a wraparound feature selection approach can reduce the impact of co-linearity of the parameters of the model to some extent.

Taking Spark as an example of an implementation framework of the distributed system, in any round of iterative process, schematic diagrams of implementing wrapped feature selection based on a default storage mode and implementing wrapped feature selection based on a target storage mode are shown in (1) and (2) in fig. 4 respectively. Fig. 4 (1) is a schematic diagram of realizing wrapped feature selection based on a default storage mode, in which, in the default storage mode, interaction of feature data is required between execution nodes to obtain complete feature data corresponding to a certain feature, and then feature data corresponding to a candidate feature set is obtained by combining feature data corresponding to the selected feature set. Fig. 4 (2) is a schematic diagram of realizing wrapped feature selection based on a target storage mode, and in the target storage mode, feature data corresponding to the selected feature set is directly broadcast to an executing node, so that feature data corresponding to the candidate feature set can be directly obtained at the executing node. In the Spark framework, interaction of feature data between execution nodes is achieved by executing the Shuffle operation, that is, the method provided by the embodiment of the application can reduce the Shuffle operation and improve efficiency and overall performance of a feature selection process.

For example, in order to intuitively compare the difference between the implementation of the wrapped feature selection based on the default storage mode and the target storage mode, two sets of experiments were performed, in each of which the time consumption of implementing the wrapped feature selection based on the default storage mode and the time consumption of implementing the wrapped feature selection based on the target storage mode were tested under the data sets of the same data magnitude, respectively. The experimental results are shown in table 4.

TABLE 4 Table 4

As can be seen from table 4, the time required for realizing the wrapped feature selection based on the target storage mode is significantly reduced, so that the feature selection efficiency can be significantly improved.

The feature selection is one of methods for enhancing the generalization capability of a model, shortening the execution time of the model and improving the interpretation of the model, and the embodiment of the application provides an automatic multi-dimensional feature selection method capable of realizing large data scale. With the ability to support automated wraparound feature selection in the case of an ultra-large data volume (e.g., 100 tens of thousands of samples of 10 tens of thousands of features).

Referring to fig. 5, an embodiment of the present application provides a feature selection apparatus, including:

a receiving unit 501, configured to receive an initial selected feature set broadcasted by a management node and feature data corresponding to the initial selected feature set;

a construction unit 502, configured to construct at least one first candidate feature set based on at least one first feature corresponding to the stored complete feature data and the initial selected feature set, where any first candidate feature set is a set of any first feature corresponding to the initial selected feature set and the stored complete feature data;

an obtaining unit 503, configured to obtain a model performance index corresponding to at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initially selected feature set;

a sending unit 504, configured to send the first candidate feature set meeting the selection condition to the management node in response to the model performance index corresponding to the at least one first candidate feature set indicating that the at least one first candidate feature set has the first candidate feature set meeting the selection condition, where the management node is configured to obtain the target selected feature set based on the first selected feature set by using the first candidate feature set meeting the selection condition as the first selected feature set.

In a possible implementation manner, the obtaining unit 503 is configured to determine feature data corresponding to at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initially selected feature set; for any one of the at least one first candidate feature set, acquiring a reference model corresponding to the any one first candidate feature set based on feature data corresponding to the any one first candidate feature set; and performing performance evaluation on the reference model corresponding to any one of the first candidate feature sets, and taking the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets.

In one possible implementation manner, the obtaining unit 503 is further configured to obtain test data corresponding to any one of the first candidate feature sets and a standard result corresponding to the test data; calling a reference model corresponding to any one of the first candidate feature sets to process the test data, so as to obtain a prediction result corresponding to the test data; and determining a performance evaluation result of the reference model corresponding to any one of the first candidate feature sets based on the prediction result and the standard result.

In one possible implementation manner, any executing node further stores at least one complete feature data corresponding to a second feature, where the second feature is a feature in the original feature set other than the feature in the first selected feature set, and the receiving unit 501 is further configured to receive the first selected feature set broadcasted by the managing node and feature data corresponding to the first selected feature set;

The construction unit 502 is further configured to construct at least one second candidate feature set based on at least one second feature corresponding to the stored complete feature data and the first selected feature set, where any second candidate feature set is a set of any second feature corresponding to the first selected feature set and the stored complete feature data;

the obtaining unit 503 is further configured to obtain a model performance index corresponding to at least one second candidate feature set based on the stored complete feature data and feature data corresponding to the first selected feature set;

the sending unit 504 is further configured to send the second candidate feature set meeting the selection condition to the management node in response to the model performance index corresponding to the at least one second candidate feature set indicating that the at least one second candidate feature set has the second candidate feature set meeting the selection condition.

In one possible implementation manner, the model performance index corresponding to the first candidate feature set satisfying the selection condition is the maximum model performance index of all model performance indexes, and all model performance indexes include model performance indexes respectively corresponding to all first candidate feature sets acquired by each execution node in the distributed system.

Referring to fig. 6, an embodiment of the present application provides a feature selection apparatus, including:

a determining unit 601, configured to determine an initial selected feature set;

an obtaining unit 602, configured to obtain feature data corresponding to an initially selected feature set;

the broadcasting unit 603 is configured to broadcast feature data corresponding to the initial selected feature set and the initial selected feature set to at least two executing nodes, where the at least two executing nodes are configured to determine, in all first candidate feature sets, a first candidate feature set that satisfies a selection condition based on the stored complete feature data, the feature data corresponding to the initial selected feature set and the initial selected feature set, and send the first candidate feature set that satisfies the selection condition to the management node, where any one of the first candidate feature sets is a set of the initial selected feature set and any one of the first feature, and the first feature is a feature in the original feature set other than the feature in the initial selected feature set;

the obtaining unit 602 is further configured to obtain a target selected feature set based on the first selected feature set, with the first candidate feature set satisfying the selection condition as the first selected feature set.

In a possible implementation, the obtaining unit 602 is further configured to, in response to the selection process meeting the termination condition, regard the first selected feature set as the target selected feature set; in response to the selection process failing to meet the termination condition, acquiring feature data corresponding to the first selected feature set, broadcasting the first selected feature set and the feature data corresponding to the first selected feature set to at least two execution nodes, wherein the at least two execution nodes are used for determining second candidate feature sets meeting the selection condition in all second candidate feature sets based on the stored complete feature data, the first selected feature set and the feature data corresponding to the first selected feature set, and transmitting the second candidate feature sets meeting the selection condition to the management node; taking the second candidate feature set meeting the selection condition as a second selected feature set, and acquiring a target selected feature set based on the second selected feature set; wherein any one of the second candidate feature sets is a set of the first selected feature set and any one of the second features that is a feature of the original feature set other than the feature in the first selected feature set.

In one possible implementation, the selection process satisfies a termination condition, including:

the model performance index increment corresponding to the first selected feature set is less than the reference threshold.

In a possible implementation manner, the determining unit 601 is configured to use, in response to the existence of the reserved features, a set of at least one reserved feature as an initial selected feature set; in response to the absence of the reserved feature, the empty set is taken as the initial selected feature set.

It should be noted that, when the apparatus provided in the foregoing embodiment performs the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

In an exemplary embodiment, a computer device is also provided, which may refer to either a management node in a distributed system or any execution node in a distributed system. Referring to fig. 7, the computer device includes a processor 701 and a memory 702, the memory 702 storing at least one piece of program code. The at least one piece of program code is loaded and executed by the one or more processors 701 to implement any of the feature selection methods described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one program code is stored, which is loaded and executed by a processor of a computer device to implement any of the above-described feature selection methods.

In one possible implementation, the computer readable storage medium may be a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), a compact disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any of the feature selection methods described above.

It should be noted that the terms "first," "second," and the like in the description and in the claims of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims

1. A feature selection method, wherein the method is applied to any execution node in a distributed system, where the any execution node stores at least one complete feature data corresponding to a first feature, where the first feature is a feature in an original feature set except for a feature in an initially selected feature set, and the method includes:

2. The method according to claim 1, wherein the obtaining a model performance index corresponding to the at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set comprises:

determining feature data corresponding to the at least one first candidate feature set based on the stored complete feature data and feature data corresponding to the initial selected feature set;

for any one of the at least one first candidate feature set, acquiring a reference model corresponding to the any one first candidate feature set based on feature data corresponding to the any one first candidate feature set;

and performing performance evaluation on the reference model corresponding to any one of the first candidate feature sets, and taking the performance evaluation result of the reference model corresponding to any one of the first candidate feature sets as a model performance index corresponding to any one of the first candidate feature sets.

3. The method according to claim 2, wherein performing performance evaluation on the reference model corresponding to the any one of the first candidate feature sets includes:

acquiring test data corresponding to any one of the first candidate feature sets and a standard result corresponding to the test data;

Invoking a reference model corresponding to any one of the first candidate feature sets to process the test data to obtain a prediction result corresponding to the test data;

and determining a performance evaluation result of the reference model corresponding to any one of the first candidate feature sets based on the prediction result and the standard result.

4. A method according to any one of claims 1-3, wherein the any executing node further stores at least one complete feature data corresponding to a second feature, the second feature being a feature in the original feature set other than a feature in the first selected feature set, and after the first candidate feature set satisfying the selection condition is sent to the managing node, the method further comprises:

receiving the first selected feature set broadcasted by the management node and feature data corresponding to the first selected feature set;

constructing at least one second candidate feature set based on at least one second feature corresponding to the stored complete feature data and the first selected feature set, wherein any second candidate feature set is any second feature set corresponding to the stored complete feature data of the first selected feature set;

Acquiring a model performance index corresponding to the at least one second candidate feature set based on the stored complete feature data and feature data corresponding to the first selected feature set;

and transmitting the second candidate feature set meeting the selection condition to the management node in response to the model performance index corresponding to the at least one second candidate feature set indicating that the second candidate feature set meeting the selection condition exists in the at least one second candidate feature set.

5. A method according to any one of claims 1 to 3, wherein the model performance index corresponding to the first candidate feature set satisfying the selection condition is a maximum model performance index of all model performance indexes, and the all model performance indexes include model performance indexes respectively corresponding to all first candidate feature sets acquired by each execution node in the distributed system.

6. The feature selection method is characterized in that the method is applied to a management node in a distributed system, the management node is used for being in communication connection with at least two execution nodes in the distributed system, and any execution node at least stores complete feature data corresponding to one feature in an original feature set, and the method comprises the following steps:

7. The method of claim 6, wherein the obtaining a target selected feature set based on the first selected feature set comprises:

in response to a selection process meeting a termination condition, regarding the first selected feature set as the target selected feature set;

In response to the selection process failing to meet the termination condition, acquiring feature data corresponding to the first selected feature set, broadcasting the first selected feature set and the feature data corresponding to the first selected feature set to the at least two executing nodes, wherein the at least two executing nodes are used for determining second candidate feature sets meeting the selection condition in all second candidate feature sets based on the stored complete feature data, the first selected feature set and the feature data corresponding to the first selected feature set, and transmitting the second candidate feature sets meeting the selection condition to the management node; taking the second candidate feature set meeting the selection condition as a second selected feature set, and acquiring a target selected feature set based on the second selected feature set;

wherein any second candidate feature set is a set of the first selected feature set and any second feature that is a feature of the original feature set other than the feature of the first selected feature set.

8. The method of claim 7, wherein the selection process satisfies a termination condition, comprising:

9. The method of any of claims 6-8, wherein said determining an initial selected feature set comprises:

in response to the presence of a reservation feature, taking a set of at least one reservation feature as the initial selected feature set;

in response to the absence of the reserved feature, the empty set is taken as the initial selected feature set.

10. A feature selection apparatus, the apparatus comprising:

11. A feature selection apparatus, the apparatus comprising:

a determining unit for determining an initial selected feature set;

the broadcasting unit is used for broadcasting the initial selected feature set and feature data corresponding to the initial selected feature set to at least two executing nodes, the at least two executing nodes are used for determining first candidate feature sets meeting selection conditions in all first candidate feature sets based on the stored complete feature data, the initial selected feature set and the feature data corresponding to the initial selected feature set, the first candidate feature sets meeting the selection conditions are sent to a management node, any one of the first candidate feature sets is a set of the initial selected feature set and any one of the first feature, and the first feature is a feature except the feature in the initial selected feature set in the original feature set;

12. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program code that is loaded and executed by the processor to implement the feature selection method of any of claims 1 to 5 or the feature selection method of any of claims 6 to 9.

13. A computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the feature selection method of any one of claims 1 to 5 or the feature selection method of any one of claims 6 to 9.