WO2021259003A1

WO2021259003A1 - Feature recognition method and apparatus, and computer device and storage medium

Info

Publication number: WO2021259003A1
Application number: PCT/CN2021/096980
Authority: WO
Inventors: 孔清扬
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-06-23
Filing date: 2021-05-28
Publication date: 2021-12-30
Also published as: CN111832631A

Abstract

The present application relates to a feature recognition method and apparatus, and a computer device and a storage medium. The method comprises: generating a plurality of candidate features from a plurality of original features according to a preset candidate feature generation method; generating a corresponding first meta-feature from each candidate feature according to a preset first meta-feature generation method; inputting the first meta-feature into a probability model, so as to obtain a target probability that each candidate feature is a preset label; comparing the target probability with a first preset threshold value, and all candidate features, the target probabilities of which are greater than or equal to the first preset threshold value, constituting a candidate feature set; combining the candidate features in the candidate feature set with the plurality of original features so as to calculate evaluation values of the candidate features; and comparing the evaluation values with a second preset threshold value, wherein candidate features, evaluation values corresponding to which are greater than the second preset threshold value, are effective features. By means of the feature recognition method and apparatus, and the computer device and the storage medium provided in the present application, features effective for machine learning are selected from constructed features.

Description

Feature recognition method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 23, 2020, the application number is 202010583878.8, and the invention title is "Feature Recognition Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the technical field of artificial intelligence, and in particular to a feature recognition method, device, computer equipment, and storage medium.

Background technique

With the continuous popularization and application of big data, machine learning itself is also developing in the direction of higher ease of use, lower technical threshold, and more agile development costs. Among them, automated machine learning is a major development direction. The goal of automated machine learning is to use automated data-driven methods to make the aforementioned decisions. As long as the user provides the data, the automatic machine learning system automatically determines the best solution. Domain experts no longer need to worry about learning various machine learning algorithms. The most important part of automated machine learning is automated feature engineering.

The inventor realized that the existing automated feature engineering tools can only mechanically combine and merge features, and did not consider whether the constructed features can dig out the truly effective information in the data, nor can it guarantee that the constructed features can be improved. The accuracy of the model. For example, the automatic feature construction tool featuretools uses a method called Deep Feature Synthesis (Deep Feature Synthesis, DFS) algorithm. DFS algorithm is an automatic method for performing feature engineering on relational data and time data. It generates comprehensive features through operations (including sum, average, and count) applied to the data. But after the combination of violent features, it caused a dimensional disaster, which not only failed to improve the accuracy of the model, but also reduced the learning ability of the model. Therefore, how to select the features that are truly effective for machine learning from the constructed features is an urgent problem to be solved.

technical problem

The main purpose of this application is to provide a feature recognition method, device, computer equipment and storage medium to solve the problem of how to select effective features from the structured features.

Technical solutions

In order to achieve the above objective, this application provides a feature recognition method, which includes the following steps:

Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;

Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;

Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;

Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;

Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.

This application also provides a feature recognition device, including:

The first acquiring unit is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;

A first generating unit, configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

The first calculation unit is configured to input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label, as the probability that each of the candidate features is a preset label Target probability

The comparing unit is configured to compare the target probability of each candidate feature with a first preset threshold, and compose a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold ；

A second calculation unit, configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;

The determining unit is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature.

The present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned feature recognition method are implemented:

This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of a feature recognition method are realized:

Beneficial effect

In the feature recognition method, device, computer equipment and storage medium provided in this application, the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature, and the first meta feature is input to pre-training The completed probability model calculates the probability that each candidate feature is a preset label, composes candidate features whose probability is greater than or equal to the first preset threshold to form a candidate feature set, and combines each candidate feature in the candidate feature set with all original features The evaluation value of each candidate feature is collectively calculated, and the candidate feature whose evaluation value is greater than the first preset threshold is an effective feature. This application combines meta-learning to evaluate the constructed feature, and selects the feature that is effective for machine learning.

Description of the drawings

FIG. 1 is a schematic diagram of the steps of a feature recognition method in an embodiment of the present application;

Fig. 2 is a structural block diagram of a feature recognition device in an embodiment of the present application;

FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings. The best mode of the present invention

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

1, an embodiment of the present application provides a feature recognition method, including the following steps:

Step S1: Obtain multiple original features, and generate multiple candidate features according to a preset candidate feature generation method for the multiple original features;

Step S2, generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

Step S3, input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label as the target probability of each of the candidate features being the preset label; where , The probability model is trained based on a random forest model;

Step S4, comparing the target probability of each candidate feature with a first preset threshold value, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold value;

Step S5, combining each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;

Step S6, comparing each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.

In this embodiment, the above method is applied to feature engineering. Feature engineering is essentially an engineering activity whose purpose is to extract features from raw data to the maximum extent for use by algorithms and models. In the existing feature engineering, features are only mechanically combined and merged. The new features generated from this may not necessarily improve the accuracy of the model. Therefore, it is necessary to select features that are effective for machine learning from the generated new features.

Specifically, as described in step S1 above, the above-mentioned original features are some basic features, and each of the original features is generated according to a pre-established candidate feature generation method to generate a corresponding candidate feature. The pre-established candidate feature generation method may include One or more of the following:

a. Feature transformation: Convert the continuous feature and time feature in the original feature into discrete features, divide the feature value range of the numerical feature into multiple equal segments, and convert the time feature into week number, month, year, whether it is a weekend Etc., or adapt the range of continuous features to a specific distribution, such as using normalization to normalize numeric variables to [0,1];

b. Feature combination: Combine two features into a new feature, such as two continuous variables using addition, subtraction, multiplication and division operations, and two categorical variables are combined;

c. High-level feature intersection: generate a new feature from no less than two features, such as maximum aggregation, minimum aggregation, average aggregation, variance aggregation and number aggregation.

As described in step S2 above, according to the pre-established first meta feature generation method, each of the candidate features generated above is generated corresponding to the first meta feature, and the preset first meta feature generation method is used to generate the following Two kinds of meta information:

1. Entropy and statistical test of candidate features: Divide all original features into three subgroups according to the original feature types, namely discrete, numeric, and date-time, and use chi-square test and t-test to calculate each The correlation between each group and the candidate feature, and the measurement of the candidate feature based on entropy is also calculated;

2. Information about candidate features, including maximum, minimum, average, variance, etc. of candidate features.

As described in step S3 above, each of the first meta features is input to a probability model, and the probability model is pre-trained based on the random forest model. The trained probability model can calculate each of the first meta features. The probability that the feature is a preset label is used as the target probability of each candidate feature being a preset label, and the preset label can be "good" or "bad".

As described in step S4 above, compare each of the target probabilities with a first preset threshold. The first preset threshold is a value defined by the user and lies between [0, 1]. A preset threshold is set to 0.5, and all candidate features corresponding to target probabilities greater than or equal to the first preset threshold are formed into a candidate feature set.

As described in step S5 above, each candidate feature in the candidate feature set is combined with multiple original features to calculate the evaluation value of each candidate feature; AUC (Area Under the Curve, model evaluation index) or accuracy can be used As the evaluation value of the candidate feature, the AUC value is the area under the ROC (Receiver Operating Characteristic, susceptibility curve) curve, and the AUC value is used as the evaluation value of the candidate feature. The larger the AUC value, the more accurate the candidate feature is an effective feature. Accuracy refers to the percentage of results that are predicted correctly.

As described in step S6 above, the evaluation value corresponding to each candidate feature is compared with a second preset threshold. The second preset threshold can be preset by the user or determined by all the original features. When the evaluation value of a candidate feature is greater than the second preset threshold, it is determined that the candidate feature is a valid feature. Since each candidate feature is combined with all the original features, the second preset threshold can be determined by all The original feature is determined. When all the original features are added to a candidate feature, the evaluation value is greater than the second preset threshold, indicating that the addition of the candidate feature can be beneficial to the use of the algorithm or model. In this application, the candidate feature formed by the original feature structure is first input into the probability model to calculate the target probability of the preset label, and the candidate feature greater than or equal to the first preset threshold is selected according to the target probability, and the selected candidate The feature is combined with all the original features to calculate the respective evaluation value, and the candidate feature whose evaluation value is greater than the second preset threshold is used as the effective feature. This application combines meta-learning to quickly and effectively select from the constructed features that are effective for machine learning Characteristics.

In an embodiment, each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated, as each candidate feature is a preset label Before step S3 of the target probability, it also includes:

Step S3a: Obtain multiple labeled training sets, the labeled training set containing multiple original training features;

Step S3b, generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;

Step S3c: Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of candidate training features corresponding to the preset first meta feature generation method. The third element feature;

Step S3d, assign a label to each candidate training feature;

Step S3e: Combine the second meta-features corresponding to the multiple labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to Training is performed in the random forest model, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.

In this embodiment, as described in step S3a above, multiple labeled training sets are obtained, the labeled training sets are obtained from a data set storage library, and each labeled training set contains multiple original training features.

As described in step S3b, the original training features of each labeled training set are generated according to the preset candidate feature generation method to generate multiple candidate features, and the preset candidate feature generation method is specifically as described in the previous embodiment.

As described in step S3c, each of the marked training sets is generated according to the preset second meta-feature generation method to generate corresponding second meta-features, each of the marked training sets is generated with a second meta-feature, according to the preset second meta-features It is assumed that the first meta feature generation method generates the third meta feature corresponding to a plurality of candidate training features, and the third meta feature includes the two types of meta information described in the previous embodiment. The preset second meta feature generating method is used to generate the following four types of meta information:

a. General information: Analyze the general statistical information of the marked training set, such as the number of original training features, the statistical information of the data size, and other statistical information of the original training features;

b. Initial evaluation: Statistics of the current performance when the learning algorithm is applied to the original training features. The generated meta-information includes the defined evaluation mechanism and running time. The evaluation mechanism can include AUC or accuracy;

c. Entropy-based measurement: According to the types of original training features, namely discrete, numeric, and date-time, all the original training features are divided into three subgroups according to their types, and the information of the original training features in each subgroup is calculated Gain

d. Feature diversity: After dividing all original training features into type-based subgroups, use chi-square test and t-test to calculate the similarity of each pair of original training features in the group.

As described in step S3d above, a label is assigned to each candidate training feature. The label can be "good" or "bad" to indicate whether the candidate training feature can improve the performance of the learner, and if it can improve the performance of the learner At the time, the label of the candidate training feature is "good".

As described in step S3e above, generate a new training data set from the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to one candidate training feature, and how many third meta-features there are How many new training data sets will be formed, and then input the new training data set into the random forest model for training, so that the output result of the random forest model is the probability of the label, when each new training After all the data sets are input into the random forest model for iterative training, the probability model of the completed training is obtained.

In an embodiment, the step S3d of assigning a label to each candidate feature includes:

Step S3d1, input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;

Step S3d2, combining a plurality of the labeled training sets and each candidate training feature into a data set;

Step S3d3, input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;

Step S3d4, comparing the respective second learning value of each data set with the first learning value, and if the second learning value of the data set is greater than the first learning value, it is the data set The added candidate training feature is assigned a first type label.

In this embodiment, as described in step S3d1 above, a plurality of the labeled training sets are input to a learner, and the first learning values of the plurality of labeled training sets are calculated. The learner is a model evaluation standard, which is used to calculate the learning value of the input data set, and the learner can calculate AUC or accuracy.

As described in step S3d2 above, combine multiple labeled training sets and one candidate training feature to form a new data set. As many candidate training features are combined into as many data sets, each data set has Including all labeled training sets.

As described in the above step S3d3, each of the data sets is input into the learner, and the second learning value of each new data set is calculated, and each candidate training feature is performed with all the labeled training sets. Combined with the calculation of the second learning value, the second learning value can be regarded as the second learning value of the candidate training feature.

As described in step S3d4 above, compare each of the second learning values with the first learning values. If the second learning value is greater than the first learning value, it indicates that the candidate training feature can be improved For the performance of the learner, the candidate training feature corresponding to the second learning value is assigned a first type label. The first type label may be "good", when the second learning value is less than or equal to the first learning value At this time, a second type of label is assigned to the candidate feature corresponding to the second learning value, and the second type of label may be "bad".

In an embodiment, the step S5 of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:

Step S501: compose a first target feature set into each of the multiple original features and each candidate feature in the candidate feature set;

Step S502: Calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.

In this embodiment, as described in step S501 above, a plurality of the original features and each candidate feature in the candidate feature set are combined to form a first target feature set, and how many candidate features there are in the candidate feature set are composed How many first target feature sets.

As described in step S502, the AUC value of the first target feature set is calculated. Each first target feature set includes a candidate feature and all original features, and the AUC value of the first target feature set is taken as the The evaluation value of the candidate feature.

Step S50a, combining a plurality of the original features and each candidate feature in the candidate feature set to form a second target feature set;

Step S50b, calculating the AUC value and accuracy of each of the second target feature sets;

Step S50c: Calculate the evaluation value of the candidate feature according to the formula M=ak ₁ +bk ₂ ; wherein, a is the AUC value of the second target feature, and b is the accuracy of the second target feature The k ₁ is the weight of the AUC value of the second target feature, and the k ₂ is the weight of the accuracy of the second target feature.

In this embodiment, as described in step S50a above, a plurality of the original features and one candidate feature in the candidate feature set are combined to form a second target feature set, and the number of candidate features in the candidate feature set constitutes as many second target features. Target feature set, each second target feature set includes all original features and candidate features in a candidate feature set.

As described in step S50b, the AUC value and accuracy of each second target feature set are calculated. Since each second target feature set includes all original features and candidate features in a candidate feature set, each second target feature set includes all original features and candidate features in a candidate feature set. The candidate features in the target feature set are different, so the AUC value and accuracy of the second target feature set can be used as the basis for calculating the evaluation value of the candidate feature.

As described in the above step S50c, the above formula is used to calculate the evaluation value of the candidate feature. The above formula combines the original feature and calculates the evaluation value of the candidate feature from both the AUC value and the accuracy, so as to improve the accuracy of the evaluation value of the candidate feature .

Referring to Fig. 2, an embodiment of the present application also provides a feature recognition device, including:

The first acquiring unit 10 is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;

The first generating unit 20 is configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

The first calculation unit 30 is configured to input each of the first meta features into the probability model, and calculate the probability that each of the first meta features is a preset label, as each candidate feature is a preset label Target probability;

The comparing unit 40 is configured to compare the target probability of each candidate feature with a first preset threshold value, and compose all the candidate features whose target probability is greater than or equal to the first preset threshold value into candidate features set;

The second calculation unit 50 is configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;

The determining unit 60 is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature .

In an embodiment, the feature recognition device further includes:

The second acquiring unit is configured to acquire multiple labeled training sets, where the labeled training set contains multiple original training features;

The second generating unit is configured to generate multiple candidate training features from the multiple original training features according to the preset candidate feature generating method;

The third generating unit is configured to generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and train a plurality of the candidates according to the preset first meta feature generation method Feature generation corresponding to the third element feature;

An allocation unit, configured to allocate a label to each candidate training feature;

The training unit is configured to combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; and to combine the new training data set Input to the random forest model for training, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.

Further, the allocating unit includes:

The first calculation subunit is configured to input a plurality of the marked training sets into a learner, and calculate the first learning value of the plurality of the marked training sets;

A combination subunit, configured to combine a plurality of the labeled training sets and each of the candidate training features into a data set;

The second calculation subunit is used to input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;

The allocation subunit is used to compare the respective second learning value of each data set with the first learning value. If the second learning value of the data set is greater than the first learning value, it is The candidate training features added in the data set are assigned a first type label.

In an embodiment, the second calculation unit 50 includes:

The first composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a first target feature set;

The third calculation subunit is configured to calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.

In an embodiment, the second calculation unit 50 includes:

The second composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a second target feature set;

The fourth calculation subunit is used to calculate the AUC value and accuracy of each of the second target feature sets;

The fifth calculation subunit is configured to _{calculate the evaluation value of the candidate feature according to the formula M=ak 1} +bk ₂ ; wherein, a is the AUC value of the second target feature, and b is the second target feature. For the accuracy of the target feature, the k ₁ is the weight of the AUC value of the second target feature, and the k ₂ is the weight of the accuracy of the second target feature.

In this embodiment, please refer to the description in the foregoing method embodiment for the specific implementation of the foregoing units and sub-units, and details are not described herein again.

Referring to FIG. 3, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store the original feature data, the first meta feature data, and so on. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a feature recognition method.

The above-mentioned processor executes the steps of the above-mentioned feature recognition method:

In an embodiment, the processor executes the input of each of the first meta features into the probability model, and calculates the probability that each of the first meta features is a preset label, as each candidate feature Before the step of presetting the target probability of the label, it also includes:

Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;

Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;

Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics

Assign a label to each candidate training feature;

Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.

In an embodiment, the above-mentioned processor executing the step of assigning a label to each candidate feature includes:

Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;

Combining a plurality of the labeled training sets and each of the candidate training features into a data set;

Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;

The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training features are assigned the first category of labels.

In an embodiment, the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;

The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;

Calculating the AUC value and accuracy of each of the second target feature sets;

Calculate the evaluation value of the candidate feature according to the formula M=ak ₁ +bk ₂ ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k ₁ is the weight of the AUC value of the second target feature, and the k ₂ is the weight of the accuracy of the second target feature.

Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.

An embodiment of the present application also provides a computer-readable storage medium. The above-mentioned storage medium may be a non-volatile storage medium or a volatile storage medium. A computer program is stored thereon, and when the computer program is executed by the processor, a feature recognition method is realized, specifically:

Assign a label to each candidate training feature;

In summary, for the feature recognition method, device, computer equipment, and storage medium provided in the embodiments of this application, the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature , The first meta feature is input into the pre-trained probability model to calculate the probability that each candidate feature is a preset label, the candidate features whose probability of the candidate feature is greater than or equal to the first preset threshold are formed into a candidate feature set, and the candidate features are collected The evaluation value of each candidate feature is calculated for each candidate feature and all the original feature sets. The candidate feature whose evaluation value is greater than the first preset threshold is the effective feature. This application combines meta-learning to evaluate the constructed feature and select the right Features that are effective for machine learning.

Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored and a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims

A feature recognition method, which includes the following steps:

Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;

Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;

Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;

Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;

Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
The feature recognition method according to claim 1, wherein the inputting each of the first meta features into a probability model calculates the probability that each of the first meta features is a preset label, as each Before the step of the candidate feature being the target probability of the preset label, it further includes:

Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;

Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;

Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics

Assign a label to each candidate training feature;

Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
The feature recognition method according to claim 2, wherein the step of assigning a label to each of the candidate features comprises:

Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;

Combining a plurality of the labeled training sets and each of the candidate training features into a data set;

Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;

The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training feature is assigned the first class label.
The feature recognition method according to claim 1, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;

The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
The feature recognition method according to claim 1, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;

Calculating the AUC value and accuracy of each of the second target feature sets;

Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
A feature recognition device, which includes:

The first acquiring unit is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;

A first generating unit, configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

The first calculation unit is configured to input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label, as the probability that each of the candidate features is a preset label Target probability

The comparing unit is configured to compare the target probability of each candidate feature with a first preset threshold, and compose a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold ；

A second calculation unit, configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;

The determining unit is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature.
The feature recognition device according to claim 6, wherein the feature recognition device further comprises:

The second acquiring unit is configured to acquire multiple labeled training sets, where the labeled training set contains multiple original training features;

The second generating unit is configured to generate multiple candidate training features from the multiple original training features according to the preset candidate feature generating method;

The third generating unit is configured to generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and train a plurality of the candidates according to the preset first meta feature generation method Feature generation corresponding to the third element feature;

An allocation unit, configured to allocate a label to each candidate training feature;

The training unit is configured to combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; and to combine the new training data set Input to the random forest model for training, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
The feature recognition device according to claim 6, wherein the second calculation unit comprises:

The first composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a first target feature set;

The third calculation subunit is configured to calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
8. The feature recognition device according to claim 7, wherein the allocating unit comprises:

The first calculation subunit is configured to input a plurality of the marked training sets into a learner, and calculate the first learning value of the plurality of the marked training sets;

A combination subunit, configured to combine a plurality of the labeled training sets and each of the candidate training features into a data set;

The second calculation subunit is used to input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;

The allocation subunit is used to compare the respective second learning value of each data set with the first learning value. If the second learning value of the data set is greater than the first learning value, it is The candidate training features added in the data set are assigned a first type label.
The feature recognition device according to claim 6, wherein the second calculation unit comprises:

The second composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a second target feature set;

The fourth calculation subunit is used to calculate the AUC value and accuracy of each of the second target feature sets;

The fifth calculation subunit is configured to calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, and b is the second target feature. For the accuracy of the target feature, the k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
A computer device includes a memory and a processor, and a computer program is stored in the memory, wherein the steps of a feature recognition method are realized when the processor executes the computer program:

Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;

Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;

Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;

Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;

Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
11. The computer device according to claim 11, wherein said inputting each said first meta feature into a probability model calculates the probability of each said first meta feature being a preset label as each said Before the step of the candidate feature being the target probability of the preset label, it also includes:

Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;

Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;

Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics

Assign a label to each candidate training feature;

Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
The computer device according to claim 12, wherein the step of assigning a label to each of the candidate features comprises:

Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;

Combining a plurality of the labeled training sets and each of the candidate training features into a data set;

Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;

The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training feature is assigned the first class label.
11. The computer device according to claim 11, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;

The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
11. The computer device according to claim 11, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;

Calculating the AUC value and accuracy of each of the second target feature sets;

Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
A computer-readable storage medium with a computer program stored thereon, wherein the steps of a feature recognition method are realized when the computer program is executed by a processor:

Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;

Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;

Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;

Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;

Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;

Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
The computer-readable storage medium according to claim 16, wherein the inputting each of the first meta features into a probability model calculates the probability that each of the first meta features is a preset label as each Before the step that the candidate feature is the target probability of the preset label, the method further includes:

Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;

Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;

Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method Ternary characteristics

Assign a label to each candidate training feature;

Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
18. The computer-readable storage medium of claim 17, wherein the step of assigning a label to each of the candidate features comprises:

Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;

Combining a plurality of the labeled training sets and each of the candidate training features into a data set;

Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;

The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training features are assigned the first category of labels.
The computer-readable storage medium according to claim 16, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features, include:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;

The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
The computer-readable storage medium according to claim 16, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features, include:

Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;

Calculating the AUC value and accuracy of each of the second target feature sets;

Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.