WO2021259003A1 - Feature recognition method and apparatus, and computer device and storage medium - Google Patents

Feature recognition method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2021259003A1
WO2021259003A1 PCT/CN2021/096980 CN2021096980W WO2021259003A1 WO 2021259003 A1 WO2021259003 A1 WO 2021259003A1 CN 2021096980 W CN2021096980 W CN 2021096980W WO 2021259003 A1 WO2021259003 A1 WO 2021259003A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
candidate
features
meta
training
Prior art date
Application number
PCT/CN2021/096980
Other languages
French (fr)
Chinese (zh)
Inventor
孔清扬
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021259003A1 publication Critical patent/WO2021259003A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • This application relates to the technical field of artificial intelligence, and in particular to a feature recognition method, device, computer equipment, and storage medium.
  • automated machine learning is a major development direction.
  • the goal of automated machine learning is to use automated data-driven methods to make the aforementioned decisions.
  • the automatic machine learning system automatically determines the best solution. Domain experts no longer need to worry about learning various machine learning algorithms.
  • the most important part of automated machine learning is automated feature engineering.
  • the automatic feature construction tool featuretools uses a method called Deep Feature Synthesis (Deep Feature Synthesis, DFS) algorithm.
  • DFS algorithm is an automatic method for performing feature engineering on relational data and time data. It generates comprehensive features through operations (including sum, average, and count) applied to the data. But after the combination of violent features, it caused a dimensional disaster, which not only failed to improve the accuracy of the model, but also reduced the learning ability of the model. Therefore, how to select the features that are truly effective for machine learning from the constructed features is an urgent problem to be solved.
  • the main purpose of this application is to provide a feature recognition method, device, computer equipment and storage medium to solve the problem of how to select effective features from the structured features.
  • this application provides a feature recognition method, which includes the following steps:
  • Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
  • Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • This application also provides a feature recognition device, including:
  • the first acquiring unit is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method
  • a first generating unit configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method
  • the first calculation unit is configured to input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label, as the probability that each of the candidate features is a preset label Target probability
  • the comparing unit is configured to compare the target probability of each candidate feature with a first preset threshold, and compose a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold ;
  • a second calculation unit configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature
  • the determining unit is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned feature recognition method are implemented:
  • Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
  • Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of a feature recognition method are realized:
  • Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
  • Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature, and the first meta feature is input to pre-training
  • the completed probability model calculates the probability that each candidate feature is a preset label, composes candidate features whose probability is greater than or equal to the first preset threshold to form a candidate feature set, and combines each candidate feature in the candidate feature set with all original features
  • the evaluation value of each candidate feature is collectively calculated, and the candidate feature whose evaluation value is greater than the first preset threshold is an effective feature.
  • This application combines meta-learning to evaluate the constructed feature, and selects the feature that is effective for machine learning.
  • FIG. 1 is a schematic diagram of the steps of a feature recognition method in an embodiment of the present application
  • Fig. 2 is a structural block diagram of a feature recognition device in an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a feature recognition method, including the following steps:
  • Step S1 Obtain multiple original features, and generate multiple candidate features according to a preset candidate feature generation method for the multiple original features;
  • Step S2 generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method
  • Step S3 input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label as the target probability of each of the candidate features being the preset label; where , The probability model is trained based on a random forest model;
  • Step S4 comparing the target probability of each candidate feature with a first preset threshold value, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold value;
  • Step S5 combining each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature
  • Step S6 comparing each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • Feature engineering is essentially an engineering activity whose purpose is to extract features from raw data to the maximum extent for use by algorithms and models.
  • features are only mechanically combined and merged. The new features generated from this may not necessarily improve the accuracy of the model. Therefore, it is necessary to select features that are effective for machine learning from the generated new features.
  • the above-mentioned original features are some basic features, and each of the original features is generated according to a pre-established candidate feature generation method to generate a corresponding candidate feature.
  • the pre-established candidate feature generation method may include One or more of the following:
  • Feature transformation Convert the continuous feature and time feature in the original feature into discrete features, divide the feature value range of the numerical feature into multiple equal segments, and convert the time feature into week number, month, year, whether it is a weekend Etc., or adapt the range of continuous features to a specific distribution, such as using normalization to normalize numeric variables to [0,1];
  • Feature combination Combine two features into a new feature, such as two continuous variables using addition, subtraction, multiplication and division operations, and two categorical variables are combined;
  • High-level feature intersection generate a new feature from no less than two features, such as maximum aggregation, minimum aggregation, average aggregation, variance aggregation and number aggregation.
  • each of the candidate features generated above is generated corresponding to the first meta feature, and the preset first meta feature generation method is used to generate the following Two kinds of meta information:
  • Entropy and statistical test of candidate features Divide all original features into three subgroups according to the original feature types, namely discrete, numeric, and date-time, and use chi-square test and t-test to calculate each The correlation between each group and the candidate feature, and the measurement of the candidate feature based on entropy is also calculated;
  • each of the first meta features is input to a probability model, and the probability model is pre-trained based on the random forest model.
  • the trained probability model can calculate each of the first meta features.
  • the probability that the feature is a preset label is used as the target probability of each candidate feature being a preset label, and the preset label can be "good” or "bad".
  • the first preset threshold is a value defined by the user and lies between [0, 1].
  • a preset threshold is set to 0.5, and all candidate features corresponding to target probabilities greater than or equal to the first preset threshold are formed into a candidate feature set.
  • each candidate feature in the candidate feature set is combined with multiple original features to calculate the evaluation value of each candidate feature;
  • AUC Absolute Under the Curve, model evaluation index
  • accuracy can be used as the evaluation value of the candidate feature
  • the AUC value is the area under the ROC (Receiver Operating Characteristic, susceptibility curve) curve, and the AUC value is used as the evaluation value of the candidate feature.
  • Accuracy refers to the percentage of results that are predicted correctly.
  • the evaluation value corresponding to each candidate feature is compared with a second preset threshold.
  • the second preset threshold can be preset by the user or determined by all the original features.
  • the evaluation value of a candidate feature is greater than the second preset threshold, it is determined that the candidate feature is a valid feature. Since each candidate feature is combined with all the original features, the second preset threshold can be determined by all The original feature is determined.
  • the evaluation value is greater than the second preset threshold, indicating that the addition of the candidate feature can be beneficial to the use of the algorithm or model.
  • the candidate feature formed by the original feature structure is first input into the probability model to calculate the target probability of the preset label, and the candidate feature greater than or equal to the first preset threshold is selected according to the target probability, and the selected candidate The feature is combined with all the original features to calculate the respective evaluation value, and the candidate feature whose evaluation value is greater than the second preset threshold is used as the effective feature.
  • This application combines meta-learning to quickly and effectively select from the constructed features that are effective for machine learning Characteristics.
  • each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated, as each candidate feature is a preset label Before step S3 of the target probability, it also includes:
  • Step S3a Obtain multiple labeled training sets, the labeled training set containing multiple original training features
  • Step S3b generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method
  • Step S3c Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of candidate training features corresponding to the preset first meta feature generation method.
  • Step S3d assign a label to each candidate training feature
  • Step S3e Combine the second meta-features corresponding to the multiple labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to Training is performed in the random forest model, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  • step S3a multiple labeled training sets are obtained, the labeled training sets are obtained from a data set storage library, and each labeled training set contains multiple original training features.
  • step S3b the original training features of each labeled training set are generated according to the preset candidate feature generation method to generate multiple candidate features, and the preset candidate feature generation method is specifically as described in the previous embodiment.
  • each of the marked training sets is generated according to the preset second meta-feature generation method to generate corresponding second meta-features
  • each of the marked training sets is generated with a second meta-feature, according to the preset second meta-features
  • the first meta feature generation method generates the third meta feature corresponding to a plurality of candidate training features
  • the third meta feature includes the two types of meta information described in the previous embodiment.
  • the preset second meta feature generating method is used to generate the following four types of meta information:
  • General information Analyze the general statistical information of the marked training set, such as the number of original training features, the statistical information of the data size, and other statistical information of the original training features;
  • Initial evaluation Statistics of the current performance when the learning algorithm is applied to the original training features.
  • the generated meta-information includes the defined evaluation mechanism and running time.
  • the evaluation mechanism can include AUC or accuracy;
  • Feature diversity After dividing all original training features into type-based subgroups, use chi-square test and t-test to calculate the similarity of each pair of original training features in the group.
  • a label is assigned to each candidate training feature.
  • the label can be "good” or “bad” to indicate whether the candidate training feature can improve the performance of the learner, and if it can improve the performance of the learner At the time, the label of the candidate training feature is "good".
  • step S3e As described in step S3e above, generate a new training data set from the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to one candidate training feature, and how many third meta-features there are How many new training data sets will be formed, and then input the new training data set into the random forest model for training, so that the output result of the random forest model is the probability of the label, when each new training After all the data sets are input into the random forest model for iterative training, the probability model of the completed training is obtained.
  • the step S3d of assigning a label to each candidate feature includes:
  • Step S3d1 input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
  • Step S3d2 combining a plurality of the labeled training sets and each candidate training feature into a data set
  • Step S3d3 input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;
  • Step S3d4 comparing the respective second learning value of each data set with the first learning value, and if the second learning value of the data set is greater than the first learning value, it is the data set
  • the added candidate training feature is assigned a first type label.
  • a plurality of the labeled training sets are input to a learner, and the first learning values of the plurality of labeled training sets are calculated.
  • the learner is a model evaluation standard, which is used to calculate the learning value of the input data set, and the learner can calculate AUC or accuracy.
  • step S3d2 combine multiple labeled training sets and one candidate training feature to form a new data set.
  • candidate training features are combined into as many data sets, each data set has Including all labeled training sets.
  • each of the data sets is input into the learner, and the second learning value of each new data set is calculated, and each candidate training feature is performed with all the labeled training sets.
  • the second learning value can be regarded as the second learning value of the candidate training feature.
  • step S3d4 compare each of the second learning values with the first learning values. If the second learning value is greater than the first learning value, it indicates that the candidate training feature can be improved
  • the candidate training feature corresponding to the second learning value is assigned a first type label.
  • the first type label may be "good", when the second learning value is less than or equal to the first learning value
  • a second type of label is assigned to the candidate feature corresponding to the second learning value, and the second type of label may be "bad".
  • the step S5 of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • Step S501 compose a first target feature set into each of the multiple original features and each candidate feature in the candidate feature set;
  • Step S502 Calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
  • step S501 a plurality of the original features and each candidate feature in the candidate feature set are combined to form a first target feature set, and how many candidate features there are in the candidate feature set are composed How many first target feature sets.
  • the AUC value of the first target feature set is calculated.
  • Each first target feature set includes a candidate feature and all original features, and the AUC value of the first target feature set is taken as the The evaluation value of the candidate feature.
  • the step S5 of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • Step S50a combining a plurality of the original features and each candidate feature in the candidate feature set to form a second target feature set
  • Step S50b calculating the AUC value and accuracy of each of the second target feature sets
  • step S50a a plurality of the original features and one candidate feature in the candidate feature set are combined to form a second target feature set, and the number of candidate features in the candidate feature set constitutes as many second target features.
  • Target feature set, each second target feature set includes all original features and candidate features in a candidate feature set.
  • each second target feature set includes all original features and candidate features in a candidate feature set
  • each second target feature set includes all original features and candidate features in a candidate feature set.
  • the candidate features in the target feature set are different, so the AUC value and accuracy of the second target feature set can be used as the basis for calculating the evaluation value of the candidate feature.
  • the above formula is used to calculate the evaluation value of the candidate feature.
  • the above formula combines the original feature and calculates the evaluation value of the candidate feature from both the AUC value and the accuracy, so as to improve the accuracy of the evaluation value of the candidate feature .
  • an embodiment of the present application also provides a feature recognition device, including:
  • the first acquiring unit 10 is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method
  • the first generating unit 20 is configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method
  • the first calculation unit 30 is configured to input each of the first meta features into the probability model, and calculate the probability that each of the first meta features is a preset label, as each candidate feature is a preset label Target probability;
  • the comparing unit 40 is configured to compare the target probability of each candidate feature with a first preset threshold value, and compose all the candidate features whose target probability is greater than or equal to the first preset threshold value into candidate features set;
  • the second calculation unit 50 is configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;
  • the determining unit 60 is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature .
  • the feature recognition device further includes:
  • the second acquiring unit is configured to acquire multiple labeled training sets, where the labeled training set contains multiple original training features;
  • the second generating unit is configured to generate multiple candidate training features from the multiple original training features according to the preset candidate feature generating method
  • the third generating unit is configured to generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and train a plurality of the candidates according to the preset first meta feature generation method Feature generation corresponding to the third element feature;
  • An allocation unit configured to allocate a label to each candidate training feature
  • the training unit is configured to combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; and to combine the new training data set Input to the random forest model for training, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  • the allocating unit includes:
  • the first calculation subunit is configured to input a plurality of the marked training sets into a learner, and calculate the first learning value of the plurality of the marked training sets;
  • a combination subunit configured to combine a plurality of the labeled training sets and each of the candidate training features into a data set
  • the second calculation subunit is used to input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;
  • the allocation subunit is used to compare the respective second learning value of each data set with the first learning value. If the second learning value of the data set is greater than the first learning value, it is The candidate training features added in the data set are assigned a first type label.
  • the second calculation unit 50 includes:
  • the first composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a first target feature set;
  • the third calculation subunit is configured to calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
  • the second calculation unit 50 includes:
  • the second composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a second target feature set;
  • the fourth calculation subunit is used to calculate the AUC value and accuracy of each of the second target feature sets
  • M ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, and b is the second target feature.
  • the k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store the original feature data, the first meta feature data, and so on.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a feature recognition method.
  • the above-mentioned processor executes the steps of the above-mentioned feature recognition method:
  • Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
  • Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • the processor executes the input of each of the first meta features into the probability model, and calculates the probability that each of the first meta features is a preset label, as each candidate feature Before the step of presetting the target probability of the label, it also includes:
  • the above-mentioned processor executing the step of assigning a label to each candidate feature includes:
  • Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
  • the respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set.
  • the candidate training features are assigned the first category of labels.
  • the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • the AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
  • the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • M ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the above-mentioned storage medium may be a non-volatile storage medium or a volatile storage medium.
  • a computer program is stored thereon, and when the computer program is executed by the processor, a feature recognition method is realized, specifically:
  • Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
  • Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  • the processor executes the input of each of the first meta features into the probability model, and calculates the probability that each of the first meta features is a preset label, as each candidate feature Before the step of presetting the target probability of the label, it also includes:
  • the above-mentioned processor executing the step of assigning a label to each candidate feature includes:
  • Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
  • the respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set.
  • the candidate training features are assigned the first category of labels.
  • the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • the AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
  • the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
  • M ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  • the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature ,
  • the first meta feature is input into the pre-trained probability model to calculate the probability that each candidate feature is a preset label, the candidate features whose probability of the candidate feature is greater than or equal to the first preset threshold are formed into a candidate feature set, and the candidate features are collected
  • the evaluation value of each candidate feature is calculated for each candidate feature and all the original feature sets.
  • the candidate feature whose evaluation value is greater than the first preset threshold is the effective feature.
  • This application combines meta-learning to evaluate the constructed feature and select the right Features that are effective for machine learning.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

The present application relates to a feature recognition method and apparatus, and a computer device and a storage medium. The method comprises: generating a plurality of candidate features from a plurality of original features according to a preset candidate feature generation method; generating a corresponding first meta-feature from each candidate feature according to a preset first meta-feature generation method; inputting the first meta-feature into a probability model, so as to obtain a target probability that each candidate feature is a preset label; comparing the target probability with a first preset threshold value, and all candidate features, the target probabilities of which are greater than or equal to the first preset threshold value, constituting a candidate feature set; combining the candidate features in the candidate feature set with the plurality of original features so as to calculate evaluation values of the candidate features; and comparing the evaluation values with a second preset threshold value, wherein candidate features, evaluation values corresponding to which are greater than the second preset threshold value, are effective features. By means of the feature recognition method and apparatus, and the computer device and the storage medium provided in the present application, features effective for machine learning are selected from constructed features.

Description

特征识别方法、装置、计算机设备和存储介质Feature recognition method, device, computer equipment and storage medium
本申请要求于2020年06月23日提交中国专利局、申请号为202010583878.8,发明名称为“特征识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 23, 2020, the application number is 202010583878.8, and the invention title is "Feature Recognition Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能的技术领域,特别涉及一种特征识别方法、装置、计算机设备和存储介质。This application relates to the technical field of artificial intelligence, and in particular to a feature recognition method, device, computer equipment, and storage medium.
背景技术Background technique
随着大数据的不断普及和应用,机器学习本身也朝着更高的易用性、更低的技术门槛、更敏捷的开发成本的方向去发展。其中自动化机器学习便是一大发展方向。自动机器学习的目标就是使用自动化的数据驱动方式来做出上述的决策。用户只要提供数据,自动机器学习系统自动的决定最佳的方案。领域专家不再需要苦恼于学习各种机器学习的算法。自动化机器学习中最重要的一个部分便是自动化特征工程。With the continuous popularization and application of big data, machine learning itself is also developing in the direction of higher ease of use, lower technical threshold, and more agile development costs. Among them, automated machine learning is a major development direction. The goal of automated machine learning is to use automated data-driven methods to make the aforementioned decisions. As long as the user provides the data, the automatic machine learning system automatically determines the best solution. Domain experts no longer need to worry about learning various machine learning algorithms. The most important part of automated machine learning is automated feature engineering.
发明人意识到,现有的自动化特征工程工具只能机械的组合和合并特征,并没有考虑到构造出的特征是否能挖掘出数据中真正有效的信息,也并不能保证构造出的特征能够提升模型的准确性。例如自动化特征构造工具featuretools使用一种称为深度特征合成(Deep Feature Synthesis,DFS)的算法,DFS算法是一种用于对关系数据和时间数据执行特征工程的自动方法,它通过应用于数据的操作(包括和、平均值和计数)生成综合特征。但是经过暴力的特征组合之后,造成了维度灾难,不仅无法提升模型的准确性,还降低了模型的学习能力。因此,如何从构造的特征中选择出真正对机器学习有效的特征是亟待解决的问题。The inventor realized that the existing automated feature engineering tools can only mechanically combine and merge features, and did not consider whether the constructed features can dig out the truly effective information in the data, nor can it guarantee that the constructed features can be improved. The accuracy of the model. For example, the automatic feature construction tool featuretools uses a method called Deep Feature Synthesis (Deep Feature Synthesis, DFS) algorithm. DFS algorithm is an automatic method for performing feature engineering on relational data and time data. It generates comprehensive features through operations (including sum, average, and count) applied to the data. But after the combination of violent features, it caused a dimensional disaster, which not only failed to improve the accuracy of the model, but also reduced the learning ability of the model. Therefore, how to select the features that are truly effective for machine learning from the constructed features is an urgent problem to be solved.
技术问题technical problem
本申请的主要目的为提供一种特征识别方法、装置、计算机设备和存储介质,解决如何从构造的特征中选择有效特征的问题。The main purpose of this application is to provide a feature recognition method, device, computer equipment and storage medium to solve the problem of how to select effective features from the structured features.
技术解决方案Technical solutions
为实现上述目的,本申请提供了一种特征识别方法,包括以下步骤:In order to achieve the above objective, this application provides a feature recognition method, which includes the following steps:
获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
本申请还提供一种特征识别装置,包括:This application also provides a feature recognition device, including:
第一获取单元,用于获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;The first acquiring unit is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;
第一生成单元,用于将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;A first generating unit, configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
第一计算单元,用于将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;The first calculation unit is configured to input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label, as the probability that each of the candidate features is a preset label Target probability
比较单元,用于将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;The comparing unit is configured to compare the target probability of each candidate feature with a first preset threshold, and compose a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold ;
第二计算单元,用于将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;A second calculation unit, configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;
判定单元,用于将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。The determining unit is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现上述一种特征识别方法的步骤:The present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the steps of the above-mentioned feature recognition method are implemented:
获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种特征识别方法的步骤:This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of a feature recognition method are realized:
获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
有益效果Beneficial effect
本申请提供的特征识别方法、装置、计算机设备和存储介质,原始特征按照预设的候选特征生成方法生成候选特征,再由候选特征生成对应的第一元特征,将第一元特征输入预先训练完成的概率模型计算每个候选特征为预设标签的概率,将候选特征的概率大于等于第一预设阀值的候选特征组成候选特征集,将候选特征集中的每个候选特征与全部原始特征集合计算各个候选特征的评估值,评估值大于第一预设阀值的候选特征为有效特征,本申请结合元学习,对构造的特征进行评估,选择出对机器学习有效的特征。In the feature recognition method, device, computer equipment and storage medium provided in this application, the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature, and the first meta feature is input to pre-training The completed probability model calculates the probability that each candidate feature is a preset label, composes candidate features whose probability is greater than or equal to the first preset threshold to form a candidate feature set, and combines each candidate feature in the candidate feature set with all original features The evaluation value of each candidate feature is collectively calculated, and the candidate feature whose evaluation value is greater than the first preset threshold is an effective feature. This application combines meta-learning to evaluate the constructed feature, and selects the feature that is effective for machine learning.
附图说明Description of the drawings
图1 是本申请一实施例中特征识别方法步骤示意图;FIG. 1 is a schematic diagram of the steps of a feature recognition method in an embodiment of the present application;
图2 是本申请一实施例中特征识别装置结构框图;Fig. 2 is a structural block diagram of a feature recognition device in an embodiment of the present application;
图3 为本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。 本发明的最佳实施方式 The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings. The best mode of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
参照图1,本申请一实施例提供一种特征识别方法,包括以下步骤:1, an embodiment of the present application provides a feature recognition method, including the following steps:
步骤S1,获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Step S1: Obtain multiple original features, and generate multiple candidate features according to a preset candidate feature generation method for the multiple original features;
步骤S2,将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Step S2, generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
步骤S3,将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Step S3, input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label as the target probability of each of the candidate features being the preset label; where , The probability model is trained based on a random forest model;
步骤S4,将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Step S4, comparing the target probability of each candidate feature with a first preset threshold value, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold value;
步骤S5,将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Step S5, combining each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;
步骤S6,将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Step S6, comparing each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
在本实施例中,上述方法应用于特征工程中,特征工程本质是一项工程活动,目的是最大限度地从原始数据中提取特征以供算法和模型使用。现有的特征工程中,只是机械的将特征进行组合和合并,由此生成的新特征并不一定能提升模型的准确度,因此需要从生成的新特征中选择出对机器学习有效的特征。In this embodiment, the above method is applied to feature engineering. Feature engineering is essentially an engineering activity whose purpose is to extract features from raw data to the maximum extent for use by algorithms and models. In the existing feature engineering, features are only mechanically combined and merged. The new features generated from this may not necessarily improve the accuracy of the model. Therefore, it is necessary to select features that are effective for machine learning from the generated new features.
具体的,如上述步骤S1所述的,上述原始特征为一些基本的特征,按照预先设立的候选特征生成方法将每个所述原始特征生成对应的候选特征,预先设立的候选特征生成方法可包括以下一种或多种:Specifically, as described in step S1 above, the above-mentioned original features are some basic features, and each of the original features is generated according to a pre-established candidate feature generation method to generate a corresponding candidate feature. The pre-established candidate feature generation method may include One or more of the following:
a、特征转化:将原始特征中的连续特征和时间特征转化为离散特征,将数值特征的特征值范围划分为多个相等的段,将时间特征转化为星期数、月份、年份、是否为周末等,或者将连续特征的范围适合于特定的分布,如采用归一化,将数值变量归一化到[0,1];a. Feature transformation: Convert the continuous feature and time feature in the original feature into discrete features, divide the feature value range of the numerical feature into multiple equal segments, and convert the time feature into week number, month, year, whether it is a weekend Etc., or adapt the range of continuous features to a specific distribution, such as using normalization to normalize numeric variables to [0,1];
b、特征合并:将两个特征合并成一个新的特征,如两个连续变量采用加减乘除运算,两个分类变量进行组合;b. Feature combination: Combine two features into a new feature, such as two continuous variables using addition, subtraction, multiplication and division operations, and two categorical variables are combined;
c、高阶特征交叉:将不少于两个的特征生成一个新的特征,如聚合取最大、聚合取最小、聚合取平均、聚合取方差和聚合取个数。c. High-level feature intersection: generate a new feature from no less than two features, such as maximum aggregation, minimum aggregation, average aggregation, variance aggregation and number aggregation.
如上述步骤S2所述,按照预先设立的第一元特征生成方法,将上述生成的每个所述候选特征生成对应的第一元特征,所述预设第一元特征生成方法用于生成以下两种元信息:As described in step S2 above, according to the pre-established first meta feature generation method, each of the candidate features generated above is generated corresponding to the first meta feature, and the preset first meta feature generation method is used to generate the following Two kinds of meta information:
1、候选特征的熵值和统计检验:将所有原始特征按照原始特征的类型,即离散型、数值型、日期-时间型三种类型分为三个子组,使用卡方检验和t检验计算每个组与所述候选特征的相关性,还计算了基于熵的候选特征的测量;1. Entropy and statistical test of candidate features: Divide all original features into three subgroups according to the original feature types, namely discrete, numeric, and date-time, and use chi-square test and t-test to calculate each The correlation between each group and the candidate feature, and the measurement of the candidate feature based on entropy is also calculated;
2、候选特征的信息,包括候选特征的最大值、最小值、平均值、方差等。2. Information about candidate features, including maximum, minimum, average, variance, etc. of candidate features.
如上述步骤S3所述,将每个所述第一元特征输入至概率模型,所述概率模型为基于随机森林模型预先训练完成的,训练完成的概率模型可计算出每个所述第一元特征为预设标签的概率,将此概率作为每个所述候选特征为预设标签的的目标概率,预设标签可为“好”或“坏”。As described in step S3 above, each of the first meta features is input to a probability model, and the probability model is pre-trained based on the random forest model. The trained probability model can calculate each of the first meta features. The probability that the feature is a preset label is used as the target probability of each candidate feature being a preset label, and the preset label can be "good" or "bad".
如上述步骤S4所述,将每个所述目标概率与第一预设阀值进行比较,第一预设阀值为用户自定义的一个值,位于[0,1]之间,如将第一预设阀值设为0.5,将大于等于第一预设阀值的目标概率所对应的所有候选特征组成一个候选特征集。As described in step S4 above, compare each of the target probabilities with a first preset threshold. The first preset threshold is a value defined by the user and lies between [0, 1]. A preset threshold is set to 0.5, and all candidate features corresponding to target probabilities greater than or equal to the first preset threshold are formed into a candidate feature set.
如上述步骤S5所述,将候选特征集中的各个候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;可采用AUC(Area Under the Curve,模型评估指标)或者准确度来作为所述候选特征的评估值,AUC值为ROC(Receiver Operating Characteristic,感受性曲线)曲线下的面积,采用AUC值作为候选特征的评估值,AUC值越大,候选特征为有效特征越准确。准确度是指预测正确的结果所占的比。As described in step S5 above, each candidate feature in the candidate feature set is combined with multiple original features to calculate the evaluation value of each candidate feature; AUC (Area Under the Curve, model evaluation index) or accuracy can be used As the evaluation value of the candidate feature, the AUC value is the area under the ROC (Receiver Operating Characteristic, susceptibility curve) curve, and the AUC value is used as the evaluation value of the candidate feature. The larger the AUC value, the more accurate the candidate feature is an effective feature. Accuracy refers to the percentage of results that are predicted correctly.
如上述步骤S6所述,将各个所述候选特征对应的评估值与第二预设阀值进行比较,所述第二预设阀值可由用户预先设置,也可由所有的原始特征确定,若某个候选特征的评估值大于所述第二预设阀值时,则判定该候选特征为有效特征,由于每个候选特征都跟所有的原始特征结合,第二预设阀值的确定可由所有的原始特征确定,当所有的原始特征加入一个候选特征后评估值大于第二预设阀值,表明该候选特征的加入能有益于算法或模型的使用。本申请中,对由原始特征构造形成的候选特征先输入至概率模型计算其为预设标签的目标概率,根据目标概率选择出大于等于第一预设阀值的候选特征,将选择出的候选特征与所有原始特征结合再计算各自的评估值,将评估值大于第二预设阀值的候选特征作为有效特征,本申请结合元学习,快速有效的从构造的特征中选择出对机器学习有效的特征。As described in step S6 above, the evaluation value corresponding to each candidate feature is compared with a second preset threshold. The second preset threshold can be preset by the user or determined by all the original features. When the evaluation value of a candidate feature is greater than the second preset threshold, it is determined that the candidate feature is a valid feature. Since each candidate feature is combined with all the original features, the second preset threshold can be determined by all The original feature is determined. When all the original features are added to a candidate feature, the evaluation value is greater than the second preset threshold, indicating that the addition of the candidate feature can be beneficial to the use of the algorithm or model. In this application, the candidate feature formed by the original feature structure is first input into the probability model to calculate the target probability of the preset label, and the candidate feature greater than or equal to the first preset threshold is selected according to the target probability, and the selected candidate The feature is combined with all the original features to calculate the respective evaluation value, and the candidate feature whose evaluation value is greater than the second preset threshold is used as the effective feature. This application combines meta-learning to quickly and effectively select from the constructed features that are effective for machine learning Characteristics.
在一实施例中,所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤S3之前,还包括:In an embodiment, each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated, as each candidate feature is a preset label Before step S3 of the target probability, it also includes:
步骤S3a,获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Step S3a: Obtain multiple labeled training sets, the labeled training set containing multiple original training features;
步骤S3b,根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Step S3b, generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
步骤S3c,将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Step S3c: Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of candidate training features corresponding to the preset first meta feature generation method. The third element feature;
步骤S3d,为每个所述候选训练特征分配标签;Step S3d, assign a label to each candidate training feature;
步骤S3e,将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Step S3e: Combine the second meta-features corresponding to the multiple labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to Training is performed in the random forest model, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
本实施例中,如上述步骤S3a所述,获取多个标记训练集,所述标记训练集从数据集的存储库中获取,每个所述标记训练集中包含多个的原始训练特征。In this embodiment, as described in step S3a above, multiple labeled training sets are obtained, the labeled training sets are obtained from a data set storage library, and each labeled training set contains multiple original training features.
如上述步骤S3b所述,将每个标记训练集的原始训练特征按照所述预设候选特征生成方法生成多个候选特征,所述预设候选特征生成方法具体如上个实施例所述。As described in step S3b, the original training features of each labeled training set are generated according to the preset candidate feature generation method to generate multiple candidate features, and the preset candidate feature generation method is specifically as described in the previous embodiment.
如上述步骤S3c所述,将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,每个所述标记训练集生成一个第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征,第三元特征包括上个实施例所述的两种元信息。所述预设第二元特征生成方法用于生成以下四种元信息:As described in step S3c, each of the marked training sets is generated according to the preset second meta-feature generation method to generate corresponding second meta-features, each of the marked training sets is generated with a second meta-feature, according to the preset second meta-features It is assumed that the first meta feature generation method generates the third meta feature corresponding to a plurality of candidate training features, and the third meta feature includes the two types of meta information described in the previous embodiment. The preset second meta feature generating method is used to generate the following four types of meta information:
a、一般信息:分析标记训练集的一般统计信息,如原始训练特征的数量、数据大小的统计信息以及原始训练特征的其他统计信息;a. General information: Analyze the general statistical information of the marked training set, such as the number of original training features, the statistical information of the data size, and other statistical information of the original training features;
b、初始评估:对原始训练特征应用学习算法时当前性能的统计,生成的元信息包括定义的评估机制、运行时间,评估机制可包括AUC或者准确度;b. Initial evaluation: Statistics of the current performance when the learning algorithm is applied to the original training features. The generated meta-information includes the defined evaluation mechanism and running time. The evaluation mechanism can include AUC or accuracy;
c、基于熵的度量:根据原始训练特征的类型,即离散型、数值型、日期-时间型,将所有原始训练特征按照类型划分成三个子组,并计算每个子组中原始训练特征的信息增益;c. Entropy-based measurement: According to the types of original training features, namely discrete, numeric, and date-time, all the original training features are divided into three subgroups according to their types, and the information of the original training features in each subgroup is calculated Gain
d、特征多样性:将所有原始训练特征分成基于类型的子组后,使用卡方检验和t检验来计算组中每对原始训练特征的相似度。d. Feature diversity: After dividing all original training features into type-based subgroups, use chi-square test and t-test to calculate the similarity of each pair of original training features in the group.
如上述步骤S3d所述,为每个所述候选训练特征分配标签,所述标签可为“好”或“坏”表示该候选训练特征能否提升学习器的性能,当能提升学习器的性能时,该候选训练特征的标签为“好”。As described in step S3d above, a label is assigned to each candidate training feature. The label can be "good" or "bad" to indicate whether the candidate training feature can improve the performance of the learner, and if it can improve the performance of the learner At the time, the label of the candidate training feature is "good".
如上述步骤S3e所述,将多个所述标记训练集所对应的第二元特征与一个候选训练特征所对应的第三元特征生成一个新的训练数据集,有多少个第三元特征便会形成多少个新的训练数据集,再将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,当每个新的训练数据集均输入至随机森林模型中进行迭代训练后,得到训练完成的概率模型。As described in step S3e above, generate a new training data set from the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to one candidate training feature, and how many third meta-features there are How many new training data sets will be formed, and then input the new training data set into the random forest model for training, so that the output result of the random forest model is the probability of the label, when each new training After all the data sets are input into the random forest model for iterative training, the probability model of the completed training is obtained.
在一实施例中,所述为每个所述候选特征分配标签的步骤S3d,包括:In an embodiment, the step S3d of assigning a label to each candidate feature includes:
步骤S3d1,将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Step S3d1, input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
步骤S3d2,将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Step S3d2, combining a plurality of the labeled training sets and each candidate training feature into a data set;
步骤S3d3,将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Step S3d3, input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;
步骤S3d4,将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。Step S3d4, comparing the respective second learning value of each data set with the first learning value, and if the second learning value of the data set is greater than the first learning value, it is the data set The added candidate training feature is assigned a first type label.
本实施例中,如上述步骤S3d1所述,将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值。所述学习器为一种模型评估标准,其用于计算输入数据集的学习值,所述学习器可计算AUC或准确度。In this embodiment, as described in step S3d1 above, a plurality of the labeled training sets are input to a learner, and the first learning values of the plurality of labeled training sets are calculated. The learner is a model evaluation standard, which is used to calculate the learning value of the input data set, and the learner can calculate AUC or accuracy.
如上述步骤S3d2所述,将多个所述标记训练集与一个所述候选训练特征组合组合成一个新的数据集,有多少个候选训练特征便组合成多少个数据集,每个数据集中均包括所有的标记训练集。As described in step S3d2 above, combine multiple labeled training sets and one candidate training feature to form a new data set. As many candidate training features are combined into as many data sets, each data set has Including all labeled training sets.
如上述步骤S3d3所述,将每个所述数据集输入至所述学习器中,计算得出每个新的数据集各自的第二学习值,每个候选训练特征均与所有标记训练集进行结合计算第二学习值,因此可将第二学习值视为该候选训练特征的第二学习值。As described in the above step S3d3, each of the data sets is input into the learner, and the second learning value of each new data set is calculated, and each candidate training feature is performed with all the labeled training sets. Combined with the calculation of the second learning value, the second learning value can be regarded as the second learning value of the candidate training feature.
如上述步骤S3d4所述,将每个所述第二学习值与所述第一学习值进行对比,若所述第二学习值大于所述第一学习值,则表明将该候选训练特征能够提升学习器的性能,则为所述第二学习值所对应的候选训练特征分配第一类标签,所述第一类标签可为“好”,当所述第二学习值小于等于第一学习值时,为第二学习值所对应的候选特征分配第二类标签,所述第二类标签可为“坏”。As described in step S3d4 above, compare each of the second learning values with the first learning values. If the second learning value is greater than the first learning value, it indicates that the candidate training feature can be improved For the performance of the learner, the candidate training feature corresponding to the second learning value is assigned a first type label. The first type label may be "good", when the second learning value is less than or equal to the first learning value At this time, a second type of label is assigned to the candidate feature corresponding to the second learning value, and the second type of label may be "bad".
在一实施例中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤S5,包括:In an embodiment, the step S5 of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
步骤S501,将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Step S501: compose a first target feature set into each of the multiple original features and each candidate feature in the candidate feature set;
步骤S502,计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。Step S502: Calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
本实施例中,如上述步骤S501所述,将多个所述原始特征与所述候选特征集中每个所述候选特征组成第一目标特征集,所述候选特征集中有多少个候选特征便组成多少个第一目标特征集。In this embodiment, as described in step S501 above, a plurality of the original features and each candidate feature in the candidate feature set are combined to form a first target feature set, and how many candidate features there are in the candidate feature set are composed How many first target feature sets.
如上述步骤S502所述,计算所述第一目标特征集的AUC值,每一个第一目标特征集均包括一个候选特征与所有的原始特征,将所述第一目标特征集的AUC值作为该候选特征的评估值。As described in step S502, the AUC value of the first target feature set is calculated. Each first target feature set includes a candidate feature and all original features, and the AUC value of the first target feature set is taken as the The evaluation value of the candidate feature.
在一实施例中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤S5,包括:In an embodiment, the step S5 of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
步骤S50a,将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Step S50a, combining a plurality of the original features and each candidate feature in the candidate feature set to form a second target feature set;
步骤S50b,计算每个所述第二目标特征集的AUC值和准确度;Step S50b, calculating the AUC value and accuracy of each of the second target feature sets;
步骤S50c,根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Step S50c: Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, and b is the accuracy of the second target feature The k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
本实施例中,如上述步骤S50a所述,将多个所述原始特征和候选特征集中的一个候选特征组成第二目标特征集,所述候选特征集中有多少个候选特征便组成多少个第二目标特征集,每个第二目标特征集中包括了所有的原始特征和一个候选特征集中的候选特征。In this embodiment, as described in step S50a above, a plurality of the original features and one candidate feature in the candidate feature set are combined to form a second target feature set, and the number of candidate features in the candidate feature set constitutes as many second target features. Target feature set, each second target feature set includes all original features and candidate features in a candidate feature set.
如上述步骤S50b所述,计算每个所述第二目标特征集的AUC值和准确度,由于每个第二目标特征集中包括了所有原始特征和一个候选特征集中的候选特征,每个第二目标特征集中的候选特征不同,因此可将第二目标特征集的AUC值和准确度作为计算该候选特征评估值的基础。As described in step S50b, the AUC value and accuracy of each second target feature set are calculated. Since each second target feature set includes all original features and candidate features in a candidate feature set, each second target feature set includes all original features and candidate features in a candidate feature set. The candidate features in the target feature set are different, so the AUC value and accuracy of the second target feature set can be used as the basis for calculating the evaluation value of the candidate feature.
如上述步骤S50c所述,采用上述公式计算所述候选特征的评估值,上述公式结合原始特征并从AUC值和准确度两个方面来计算候选特征的评估值,提高候选特征评估值的准确度。As described in the above step S50c, the above formula is used to calculate the evaluation value of the candidate feature. The above formula combines the original feature and calculates the evaluation value of the candidate feature from both the AUC value and the accuracy, so as to improve the accuracy of the evaluation value of the candidate feature .
参见图2,本申请一实施例还提供了一种特征识别装置,包括:Referring to Fig. 2, an embodiment of the present application also provides a feature recognition device, including:
第一获取单元10,用于获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;The first acquiring unit 10 is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;
第一生成单元20,用于将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;The first generating unit 20 is configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
第一计算单元30,用于将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;The first calculation unit 30 is configured to input each of the first meta features into the probability model, and calculate the probability that each of the first meta features is a preset label, as each candidate feature is a preset label Target probability;
比较单元40,用于将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;The comparing unit 40 is configured to compare the target probability of each candidate feature with a first preset threshold value, and compose all the candidate features whose target probability is greater than or equal to the first preset threshold value into candidate features set;
第二计算单元50,用于将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;The second calculation unit 50 is configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;
判定单元60,用于将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。The determining unit 60 is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature .
在一实施例中,所述特征识别装置还包括:In an embodiment, the feature recognition device further includes:
第二获取单元,用于获取多个标记训练集,所述标记训练集中包含多个原始训练特征;The second acquiring unit is configured to acquire multiple labeled training sets, where the labeled training set contains multiple original training features;
第二生成单元,用于根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;The second generating unit is configured to generate multiple candidate training features from the multiple original training features according to the preset candidate feature generating method;
第三生成单元,用于将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;The third generating unit is configured to generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and train a plurality of the candidates according to the preset first meta feature generation method Feature generation corresponding to the third element feature;
分配单元,用于为每个所述候选训练特征分配标签;An allocation unit, configured to allocate a label to each candidate training feature;
训练单元,用于将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。The training unit is configured to combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; and to combine the new training data set Input to the random forest model for training, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
进一步地,所述分配单元,包括:Further, the allocating unit includes:
第一计算子单元,用于将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;The first calculation subunit is configured to input a plurality of the marked training sets into a learner, and calculate the first learning value of the plurality of the marked training sets;
组合子单元,用于将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;A combination subunit, configured to combine a plurality of the labeled training sets and each of the candidate training features into a data set;
第二计算子单元,用于将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;The second calculation subunit is used to input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;
分配子单元,用于将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The allocation subunit is used to compare the respective second learning value of each data set with the first learning value. If the second learning value of the data set is greater than the first learning value, it is The candidate training features added in the data set are assigned a first type label.
在一实施例中,所述第二计算单元50,包括:In an embodiment, the second calculation unit 50 includes:
第一组成子单元,用于将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;The first composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a first target feature set;
第三计算子单元,用于计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The third calculation subunit is configured to calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
在一实施例中,所述第二计算单元50,包括:In an embodiment, the second calculation unit 50 includes:
第二组成子单元,用于将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;The second composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a second target feature set;
第四计算子单元,用于计算每个所述第二目标特征集的AUC值和准确度;The fourth calculation subunit is used to calculate the AUC value and accuracy of each of the second target feature sets;
第五计算子单元,用于根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 The fifth calculation subunit is configured to calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, and b is the second target feature. For the accuracy of the target feature, the k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
在本实施例中,上述各个单元、子单元的具体实现请参照上述方法实施例中所述,在此不再进行赘述。In this embodiment, please refer to the description in the foregoing method embodiment for the specific implementation of the foregoing units and sub-units, and details are not described herein again.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储原始特征数据、第一元特征数据等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种特征识别方法。Referring to FIG. 3, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store the original feature data, the first meta feature data, and so on. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a feature recognition method.
上述处理器执行上述特征识别方法的步骤:The above-mentioned processor executes the steps of the above-mentioned feature recognition method:
获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
在一实施例中,上述处理器执行所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤之前,还包括:In an embodiment, the processor executes the input of each of the first meta features into the probability model, and calculates the probability that each of the first meta features is a preset label, as each candidate feature Before the step of presetting the target probability of the label, it also includes:
获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;
根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics
为每个所述候选训练特征分配标签;Assign a label to each candidate training feature;
将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
在一实施例中,上述处理器执行所述为每个所述候选特征分配标签的步骤,包括:In an embodiment, the above-mentioned processor executing the step of assigning a label to each candidate feature includes:
将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Combining a plurality of the labeled training sets and each of the candidate training features into a data set;
将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;
将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training features are assigned the first category of labels.
在一实施例中,上述处理器执行所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:In an embodiment, the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;
计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
在一实施例中,上述处理器执行所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:In an embodiment, the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;
计算每个所述第二目标特征集的AUC值和准确度;Calculating the AUC value and accuracy of each of the second target feature sets;
根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
本申请一实施例还提供一种计算机可读存储介质,上述存储介质可以是非易失性存储介质,也可以是易失性存储介质。其上存储有计算机程序,计算机程序被处理器执行时实现一种特征识别方法,具体为:An embodiment of the present application also provides a computer-readable storage medium. The above-mentioned storage medium may be a non-volatile storage medium or a volatile storage medium. A computer program is stored thereon, and when the computer program is executed by the processor, a feature recognition method is realized, specifically:
获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
在一实施例中,上述处理器执行所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤之前,还包括:In an embodiment, the processor executes the input of each of the first meta features into the probability model, and calculates the probability that each of the first meta features is a preset label, as each candidate feature Before the step of presetting the target probability of the label, it also includes:
获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;
根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics
为每个所述候选训练特征分配标签;Assign a label to each candidate training feature;
将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
在一实施例中,上述处理器执行所述为每个所述候选特征分配标签的步骤,包括:In an embodiment, the above-mentioned processor executing the step of assigning a label to each candidate feature includes:
将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Combining a plurality of the labeled training sets and each of the candidate training features into a data set;
将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;
将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training features are assigned the first category of labels.
在一实施例中,上述处理器执行所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:In an embodiment, the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;
计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
在一实施例中,上述处理器执行所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:In an embodiment, the above-mentioned processor executing the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features includes:
将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;
计算每个所述第二目标特征集的AUC值和准确度;Calculating the AUC value and accuracy of each of the second target feature sets;
根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
综上所述,为本申请实施例中提供的特征识别方法、装置、计算机设备和存储介质,原始特征按照预设的候选特征生成方法生成候选特征,再由候选特征生成对应的第一元特征,将第一元特征输入预先训练完成的概率模型计算每个候选特征为预设标签的概率,将候选特征的概率大于等于第一预设阀值的候选特征组成候选特征集,将候选特征集中的每个候选特征与全部原始特征集合计算各个候选特征的评估值,评估值大于第一预设阀值的候选特征为有效特征,本申请结合元学习,对构造的特征进行评估,选择出对机器学习有效的特征。In summary, for the feature recognition method, device, computer equipment, and storage medium provided in the embodiments of this application, the original feature generates candidate features according to the preset candidate feature generation method, and then the corresponding first meta feature is generated from the candidate feature , The first meta feature is input into the pre-trained probability model to calculate the probability that each candidate feature is a preset label, the candidate features whose probability of the candidate feature is greater than or equal to the first preset threshold are formed into a candidate feature set, and the candidate features are collected The evaluation value of each candidate feature is calculated for each candidate feature and all the original feature sets. The candidate feature whose evaluation value is greater than the first preset threshold is the effective feature. This application combines meta-learning to evaluate the constructed feature and select the right Features that are effective for machine learning.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储与一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM通过多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored and a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that, in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims (20)

  1. 一种特征识别方法,其中,包括以下步骤:A feature recognition method, which includes the following steps:
    获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
    将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
    将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
    将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
    将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
    将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  2. 根据权利要求1所述的特征识别方法,其中,所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤之前,还包括:The feature recognition method according to claim 1, wherein the inputting each of the first meta features into a probability model calculates the probability that each of the first meta features is a preset label, as each Before the step of the candidate feature being the target probability of the preset label, it further includes:
    获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;
    根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
    将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics
    为每个所述候选训练特征分配标签;Assign a label to each candidate training feature;
    将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  3. 根据权利要求2所述的特征识别方法,其中,所述为每个所述候选特征分配标签的步骤,包括:The feature recognition method according to claim 2, wherein the step of assigning a label to each of the candidate features comprises:
    将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
    将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Combining a plurality of the labeled training sets and each of the candidate training features into a data set;
    将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;
    将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training feature is assigned the first class label.
  4. 根据权利要求1所述的特征识别方法,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:The feature recognition method according to claim 1, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;
    计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
  5. 根据权利要求1所述的特征识别方法,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:The feature recognition method according to claim 1, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;
    计算每个所述第二目标特征集的AUC值和准确度;Calculating the AUC value and accuracy of each of the second target feature sets;
    根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  6. 一种特征识别装置,其中,包括:A feature recognition device, which includes:
    第一获取单元,用于获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;The first acquiring unit is configured to acquire multiple original features, and generate multiple candidate features from the multiple original features according to a preset candidate feature generation method;
    第一生成单元,用于将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;A first generating unit, configured to generate a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
    第一计算单元,用于将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;The first calculation unit is configured to input each of the first meta features into a probability model, and calculate the probability that each of the first meta features is a preset label, as the probability that each of the candidate features is a preset label Target probability
    比较单元,用于将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;The comparing unit is configured to compare the target probability of each candidate feature with a first preset threshold, and compose a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold ;
    第二计算单元,用于将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;A second calculation unit, configured to combine each candidate feature in the candidate feature set with a plurality of original features to calculate an evaluation value of each candidate feature;
    判定单元,用于将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。The determining unit is configured to compare each of the evaluation values with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, determine that the candidate feature corresponding to the evaluation value is a valid feature.
  7. 根据权利要求6所述的特征识别装置,其中,所述特征识别装置还包括:The feature recognition device according to claim 6, wherein the feature recognition device further comprises:
    第二获取单元,用于获取多个标记训练集,所述标记训练集中包含多个原始训练特征;The second acquiring unit is configured to acquire multiple labeled training sets, where the labeled training set contains multiple original training features;
    第二生成单元,用于根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;The second generating unit is configured to generate multiple candidate training features from the multiple original training features according to the preset candidate feature generating method;
    第三生成单元,用于将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;The third generating unit is configured to generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and train a plurality of the candidates according to the preset first meta feature generation method Feature generation corresponding to the third element feature;
    分配单元,用于为每个所述候选训练特征分配标签;An allocation unit, configured to allocate a label to each candidate training feature;
    训练单元,用于将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。The training unit is configured to combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; and to combine the new training data set Input to the random forest model for training, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  8. 根据权利要求6所述的特征识别装置,其中,所述第二计算单元包括:The feature recognition device according to claim 6, wherein the second calculation unit comprises:
    第一组成子单元,用于将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;The first composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a first target feature set;
    第三计算子单元,用于计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The third calculation subunit is configured to calculate the AUC value of the first target feature set, and use the AUC value of the first target feature set as the evaluation value of the candidate feature.
  9. 根据根据权利要求7所述的特征识别装置,其中,所述分配单元,包括:8. The feature recognition device according to claim 7, wherein the allocating unit comprises:
    第一计算子单元,用于将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;The first calculation subunit is configured to input a plurality of the marked training sets into a learner, and calculate the first learning value of the plurality of the marked training sets;
    组合子单元,用于将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;A combination subunit, configured to combine a plurality of the labeled training sets and each of the candidate training features into a data set;
    第二计算子单元,用于将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;The second calculation subunit is used to input each of the data sets into the learner, and calculate the respective second learning value of each of the data sets;
    分配子单元,用于将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The allocation subunit is used to compare the respective second learning value of each data set with the first learning value. If the second learning value of the data set is greater than the first learning value, it is The candidate training features added in the data set are assigned a first type label.
  10. 根据根据权利要求6所述的特征识别装置,其中,所述第二计算单元,包括:The feature recognition device according to claim 6, wherein the second calculation unit comprises:
    第二组成子单元,用于将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;The second composition subunit is used to compose a plurality of the original features and each candidate feature in the candidate feature set into a second target feature set;
    第四计算子单元,用于计算每个所述第二目标特征集的AUC值和准确度;The fourth calculation subunit is used to calculate the AUC value and accuracy of each of the second target feature sets;
    第五计算子单元,用于根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 The fifth calculation subunit is configured to calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, and b is the second target feature. For the accuracy of the target feature, the k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种特征识别方法的步骤:A computer device includes a memory and a processor, and a computer program is stored in the memory, wherein the steps of a feature recognition method are realized when the processor executes the computer program:
    获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
    将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
    将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
    将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
    将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
    将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  12. 根据权利要求11所述的计算机设备,其中,所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤之前,还包括:11. The computer device according to claim 11, wherein said inputting each said first meta feature into a probability model calculates the probability of each said first meta feature being a preset label as each said Before the step of the candidate feature being the target probability of the preset label, it also includes:
    获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;
    根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
    将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method. Ternary characteristics
    为每个所述候选训练特征分配标签;Assign a label to each candidate training feature;
    将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  13. 根据权利要求12所述的计算机设备,其中,所述为每个所述候选特征分配标签的步骤,包括:The computer device according to claim 12, wherein the step of assigning a label to each of the candidate features comprises:
    将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
    将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Combining a plurality of the labeled training sets and each of the candidate training features into a data set;
    将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;
    将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training feature is assigned the first class label.
  14. 根据权利要求11所述的计算机设备,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:11. The computer device according to claim 11, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;
    计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
  15. 根据权利要求11所述的计算机设备,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:11. The computer device according to claim 11, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features comprises:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;
    计算每个所述第二目标特征集的AUC值和准确度;Calculating the AUC value and accuracy of each of the second target feature sets;
    根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种特征识别方法的步骤:A computer-readable storage medium with a computer program stored thereon, wherein the steps of a feature recognition method are realized when the computer program is executed by a processor:
    获取多个原始特征,将多个所述原始特征按照预设候选特征生成方法生成多个候选特征;Acquiring multiple original features, and generating multiple candidate features from the multiple original features according to a preset candidate feature generation method;
    将每个所述候选特征按照预设第一元特征生成方法生成对应的第一元特征;Generating a corresponding first meta feature for each of the candidate features according to a preset first meta feature generation method;
    将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率;其中,所述概率模型基于随机森林模型训练而成;Each of the first meta features is input into a probability model, and the probability that each of the first meta features is a preset label is calculated as the target probability of each candidate feature being a preset label; wherein, The probability model is trained based on the random forest model;
    将每个所述候选特征的目标概率与第一预设阀值进行比较,将所述目标概率大于等于所述第一预设阀值的所有所述候选特征组成候选特征集;Comparing the target probability of each candidate feature with a first preset threshold, and composing a candidate feature set for all the candidate features whose target probability is greater than or equal to the first preset threshold;
    将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值;Combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate an evaluation value of each of the candidate features;
    将各个所述评估值与第二预设阀值进行比较,若所述评估值大于所述第二预设阀值,则判定所述评估值对应的候选特征为有效特征。Each of the evaluation values is compared with a second preset threshold value, and if the evaluation value is greater than the second preset threshold value, it is determined that the candidate feature corresponding to the evaluation value is a valid feature.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将每个所述第一元特征输入至概率模型中,计算每个所述第一元特征为预设标签的概率,作为每个所述候选特征为预设标签的目标概率的步骤之前,还包括:The computer-readable storage medium according to claim 16, wherein the inputting each of the first meta features into a probability model calculates the probability that each of the first meta features is a preset label as each Before the step that the candidate feature is the target probability of the preset label, the method further includes:
    获取多个标记训练集,所述标记训练集中包含多个原始训练特征;Acquiring a plurality of labeled training sets, the labeled training set containing a plurality of original training features;
    根据所述预设候选特征生成方法将多个所述原始训练特征生成多个候选训练特征;Generating multiple candidate training features from multiple original training features according to the preset candidate feature generating method;
    将每个所述标记训练集根据预设第二元特征生成方法生成对应的第二元特征,根据所述预设第一元特征生成方法将多个所述候选训练特征生成对应的所述第三元特征;Generate a corresponding second meta feature for each of the labeled training sets according to a preset second meta feature generation method, and generate a plurality of the candidate training features corresponding to the first meta feature according to the preset first meta feature generation method Ternary characteristics
    为每个所述候选训练特征分配标签;Assign a label to each candidate training feature;
    将多个所述标记训练集所对应的第二元特征与每个所述候选特征所对应的第三元特征组合生成新的训练数据集;将所述新的训练数据集输入至随机森林模型中进行训练,使得所述随机森林模型的输出结果为所述标签的概率,得到训练完成的概率模型。Combine the second meta-features corresponding to a plurality of the labeled training sets and the third meta-features corresponding to each candidate feature to generate a new training data set; input the new training data set to a random forest model Training is performed in, so that the output result of the random forest model is the probability of the label, and the probability model of the completed training is obtained.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述为每个所述候选特征分配标签的步骤,包括:18. The computer-readable storage medium of claim 17, wherein the step of assigning a label to each of the candidate features comprises:
    将多个所述标记训练集输入至学习器中,计算多个所述标记训练集的第一学习值;Input a plurality of the labeled training sets into a learner, and calculate a first learning value of the plurality of the labeled training sets;
    将多个所述标记训练集与每个所述候选训练特征分别组合成数据集;Combining a plurality of the labeled training sets and each of the candidate training features into a data set;
    将每个所述数据集输入至学习器中,计算每个所述数据集各自的第二学习值;Input each of the data sets into a learner, and calculate the respective second learning value of each of the data sets;
    将每个所述数据集各自的第二学习值与所述第一学习值进行对比,若所述数据集的第二学习值大于所述第一学习值,则为所述数据集中加入的所述候选训练特征分配第一类标签。The respective second learning value of each data set is compared with the first learning value. If the second learning value of the data set is greater than the first learning value, then it is all added to the data set. The candidate training features are assigned the first category of labels.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features, include:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第一目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a first target feature set;
    计算所述第一目标特征集的AUC值,将所述第一目标特征集的AUC值作为所述候选特征的评估值。The AUC value of the first target feature set is calculated, and the AUC value of the first target feature set is used as the evaluation value of the candidate feature.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述候选特征集中的各个所述候选特征与多个所述原始特征进行结合计算各个所述候选特征的评估值的步骤,包括:The computer-readable storage medium according to claim 16, wherein the step of combining each of the candidate features in the candidate feature set with a plurality of the original features to calculate the evaluation value of each of the candidate features, include:
    将多个所述原始特征与所述候选特征集中每个所述候选特征分别组成第二目标特征集;Forming a plurality of the original features and each of the candidate features in the candidate feature set to form a second target feature set;
    计算每个所述第二目标特征集的AUC值和准确度;Calculating the AUC value and accuracy of each of the second target feature sets;
    根据公式M=ak 1+bk 2计算所述候选特征的评估值;其中,所述a为所述第二目标特征的AUC值,所述b为所述第二目标特征的准确度,所述k 1为所述第二目标特征的AUC值的权重,所述k 2为所述第二目标特征的准确度的权重。 Calculate the evaluation value of the candidate feature according to the formula M=ak 1 +bk 2 ; wherein, a is the AUC value of the second target feature, b is the accuracy of the second target feature, and k 1 is the weight of the AUC value of the second target feature, and the k 2 is the weight of the accuracy of the second target feature.
PCT/CN2021/096980 2020-06-23 2021-05-28 Feature recognition method and apparatus, and computer device and storage medium WO2021259003A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010583878.8 2020-06-23
CN202010583878.8A CN111832631A (en) 2020-06-23 2020-06-23 Feature recognition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021259003A1 true WO2021259003A1 (en) 2021-12-30

Family

ID=72898031

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096980 WO2021259003A1 (en) 2020-06-23 2021-05-28 Feature recognition method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN111832631A (en)
WO (1) WO2021259003A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114764603A (en) * 2022-05-07 2022-07-19 支付宝(杭州)信息技术有限公司 Method and device for determining characteristics aiming at user classification model and service prediction model

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832631A (en) * 2020-06-23 2020-10-27 平安科技(深圳)有限公司 Feature recognition method and device, computer equipment and storage medium
CN112286980B (en) * 2020-12-03 2021-08-17 北京口袋财富信息科技有限公司 Information pushing method and system based on user behaviors

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
US10284585B1 (en) * 2016-06-27 2019-05-07 Symantec Corporation Tree rotation in random classification forests to improve efficacy
CN111832631A (en) * 2020-06-23 2020-10-27 平安科技(深圳)有限公司 Feature recognition method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
US10284585B1 (en) * 2016-06-27 2019-05-07 Symantec Corporation Tree rotation in random classification forests to improve efficacy
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN111832631A (en) * 2020-06-23 2020-10-27 平安科技(深圳)有限公司 Feature recognition method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114764603A (en) * 2022-05-07 2022-07-19 支付宝(杭州)信息技术有限公司 Method and device for determining characteristics aiming at user classification model and service prediction model

Also Published As

Publication number Publication date
CN111832631A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2021259003A1 (en) Feature recognition method and apparatus, and computer device and storage medium
US11256555B2 (en) Automatically scalable system for serverless hyperparameter tuning
Hien et al. A decision support system for evaluating international student applications
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
Benedetto et al. The creation and use of the SIPP Synthetic Beta
WO2021179445A1 (en) Conversation state prediction-based multi-round conversation method, device, and computer apparatus
EP4075281A1 (en) Ann-based program test method and test system, and application
Minku et al. Clustering dycom: An online cross-company software effort estimation study
CN115879748B (en) Enterprise informatization management integrated platform based on big data
Lengler et al. The (1+ 1)-EA on noisy linear functions with random positive weights
US20230072297A1 (en) Knowledge graph based reasoning recommendation system and method
CN114781532A (en) Evaluation method and device of machine learning model, computer equipment and medium
US20210326475A1 (en) Systems and method for evaluating identity disclosure risks in synthetic personal data
Yet et al. Estimating criteria weight distributions in multiple criteria decision making: a Bayesian approach
WO2021212654A1 (en) Physical machine resource allocation model acquisition method and apparatus, and computer device
WO2023134072A1 (en) Default prediction model generation method and apparatus, device, and storage medium
CN113516189B (en) Website malicious user prediction method based on two-stage random forest algorithm
WO2022217712A1 (en) Data mining method and apparatus, and computer device and storage medium
CN115222081A (en) Academic resource prediction method and device and computer equipment
Blasco-Blasco et al. Characterization of university students through indicators of adequacy and excellence. analysis from gender and socioeconomic status perspective
Ji et al. A Two-stage feature weighting method for Naive Bayes and its Application in Software Defect Prediction
Kiekhaefer Simulation ranking and selection procedures and applications in network reliability design
Esmaeilzadeh et al. InfoMoD: Information-theoretic Model Diagnostics
CN116227995B (en) Index analysis method and system based on machine learning
Cerulli Methods Based on Selection on Observables

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21830143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21830143

Country of ref document: EP

Kind code of ref document: A1