WO2020082865A1 - 用于构建机器学习模型的特征选取方法、装置以及设备 - Google Patents

用于构建机器学习模型的特征选取方法、装置以及设备 Download PDF

Info

Publication number
WO2020082865A1
WO2020082865A1 PCT/CN2019/101397 CN2019101397W WO2020082865A1 WO 2020082865 A1 WO2020082865 A1 WO 2020082865A1 CN 2019101397 W CN2019101397 W CN 2019101397W WO 2020082865 A1 WO2020082865 A1 WO 2020082865A1
Authority
WO
WIPO (PCT)
Prior art keywords
ranking
training data
features
index
importance
Prior art date
Application number
PCT/CN2019/101397
Other languages
English (en)
French (fr)
Inventor
唐渝洲
金宏
王维强
赵闻飙
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020082865A1 publication Critical patent/WO2020082865A1/zh
Priority to US17/162,939 priority Critical patent/US11222285B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • One or more embodiments of this specification relate to the field of computer technology, and in particular, to a feature selection method, device, and device for constructing a machine learning model.
  • One or more embodiments of this specification describe a feature selection method, device, and equipment for building a machine learning model, which can filter out more accurate features.
  • a feature selection method for building a machine learning model including:
  • a target feature is selected from the plurality of features.
  • a feature selection device for building a machine learning model including:
  • the acquisition unit is used to acquire the training data set
  • a splitting unit configured to split the training data set acquired by the acquiring unit according to a preset splitting method to obtain k sets of training data subsets
  • the execution unit is configured to execute the following process k times in parallel on the k sets of training data subsets obtained by the splitting unit:
  • a fusion unit configured to fuse the k * m group index rankings and k group importance rankings obtained by the execution unit k times to obtain a total ranking of the multiple features
  • the selecting unit is configured to select a target feature from the multiple features according to the total ranking obtained by the fusion unit.
  • a feature selection device for building a machine learning model including:
  • One or more processors are One or more processors.
  • One or more programs wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, and when the programs are executed by the processors, the following steps are implemented:
  • a target feature is selected from the plurality of features.
  • the feature selection method, device and equipment for constructing a machine learning model provided by one or more embodiments of this specification to obtain a training data set.
  • the preset splitting method split the training data set to obtain k sets of training data subsets. For this k group of training data subsets, the following process is performed in parallel k times: the k-1 group of training data subsets is selected from the k group of training data subsets as the current training data set.
  • the multiple features are sorted to obtain m sets of multiple feature index rankings.
  • a machine learning model is trained to predict the importance ranking of a set of multiple features.
  • the k * m group index ranking obtained by k times and the k group importance ranking are merged to obtain the total ranking of multiple features. According to the total ranking, select target features from multiple features. It can be seen from this that after splitting the k sets of training data subsets in this specification, k times of selection of the current training data set, index ranking of multiple features, and importance ranking are performed in parallel. thus. The comprehensive performance of multiple features in each training data subset can be considered, and more accurate features can be selected.
  • the feature selection method provided in this specification also comprehensively considers multiple evaluation indicators of each feature, which can filter out more stable and more effective features.
  • Figure 1 is a schematic diagram of a feature selection system provided by this specification
  • FIG. 2 is a flowchart of a feature selection method for building a machine learning model provided by an embodiment of this specification
  • Figure 3 is a schematic diagram of the feature ranking fusion process provided by this specification.
  • FIG. 4 is a schematic diagram of a feature selection device for constructing a machine learning model provided by an embodiment of the present specification
  • FIG. 5 is a schematic diagram of a feature selection device for constructing a machine learning model provided by an embodiment of the present specification.
  • the total training data set is first divided into multiple groups. Then select several training data sets from multiple groups, and select features based on the several training data sets. It can be seen from this that the feature selection method only considers the performance of the features on part of the training data set, and does not consider the comprehensive performance of the features on each training data set. Therefore, the features selected by this feature selection method are usually not stable enough.
  • the solution provided in this specification can refer to the practice of k-fold cross validation (k-fold cross validation).
  • k-fold cross validation The main idea of k-fold cross-validation is as follows: the initial sample is divided into k sub-samples, a single sub-sample is retained as the data of the verification model, and the other k-1 samples are used for training. The cross-validation is repeated k times, each sub-sample is verified once, and the results of k times are averaged or other combination methods are used to finally obtain a single estimate. Because this scheme is for selecting features, not for training the model.
  • the training data set may be split into k groups, and then the steps of selecting the training data subset and the feature ranking from the k groups are performed k times, wherein the training data subset selected each time is: k-1 group.
  • the split 4 sets of training data subsets are: training subsets 1-4.
  • the first selected training data subset can be: training subset 2-4, and training subset 1 is used as the test set
  • the second selected training data subset can be: training subset 1 and training subset 3 -4, and training subset 2 is used as the test set
  • the third selected training data subset can be: training subset 1-2 and training subset 4, and training subset 3 is used as the test set
  • the fourth selected The training data subset may be: training subset 1-3, and training subset 4 as the test set.
  • each group of training data subsets can be selected. That is, the comprehensive performance of multiple features in each set of training data subsets can be considered, and more accurate features can be selected.
  • the feature selection can be performed by the method of filtering feature selection.
  • Filtered feature selection is mainly based on the training data set, calculating an evaluation index for each feature, and then filtering features based on the evaluation index.
  • this program will comprehensively consider multiple evaluation indicators for each feature.
  • the above evaluation indicators may include, but are not limited to, information value or amount of information (Information Value, IV), Gini coefficient GINI, information gain (Info Gain, IG), mutual information (Mutual Information, MI), Relief score and Sample stability index (PSI), etc. It should be noted that the calculation method of the above evaluation indicators is a conventional conventional technology, which will not be repeated here.
  • Next round Reduce one of the least important features, and continue to train the model based on the remaining features, and get the importance of the remaining features. After that, one of the least important features is reduced from the remaining features, and so on, until the specified number of features are obtained by screening. It can be understood that when the total number of features is 100 and the specified number is 50, the above model training process needs to be performed for 50 rounds.
  • N e.g, 10
  • N 10
  • this solution increases the number of features eliminated in each round, since the selection of the subset of training data and the ranking of features are performed in parallel in the feature screening process of each round, it does not affect The accuracy and stability of the features selected through this scheme.
  • the feature selection method for constructing a machine learning model can be applied to the feature selection system 10 shown in FIG. 1.
  • the feature selection system 10 may include: a data module 102, a function module 104, an analysis module 106 and a decision module 108.
  • the data module 102 is used to divide the training data set into k sets of training data subsets according to a preset splitting method.
  • the preset splitting method here may include, but is not limited to, a time splitting method and a random splitting method.
  • the function module 104 is used to perform the following process k times: selecting k-1 training data subsets from k training data subsets. Based on the selected training data subset, m evaluation indexes of multiple features are calculated. According to each evaluation index, the multiple features are sorted to obtain m sets of multiple feature index rankings. In addition, based on the selected training data subset, a machine learning model is trained to predict the importance ranking of a set of multiple features.
  • the analysis module 106 is used to integrate the index ranking and importance ranking of each feature. Specifically, the k * m group index ranking obtained by k times and the k group importance ranking are fused to obtain a total ranking of multiple features. In addition, it is also possible to perform index derivation and index fusion according to the evaluation index of each feature calculated by the function module 104. Among them, the index derivation refers to the evaluation index obtained from the current calculation of a certain characteristic, and other indexes are derived. For example, according to the k-group IV values of a certain characteristic, the rate of change of the IV value is obtained. Index fusion refers to the fusion of multiple evaluation indexes of a certain feature. For example, the k group IV values of a certain feature are merged into one IV value. The fusion process here may be to take the maximum value, minimum value and average value among the k groups of IV values.
  • the decision module 108 is used to select target features from multiple features according to the overall ranking of each feature.
  • variable information such as variable metadata (metaData) and the classification to which the variable belongs, etc.
  • the variable information configured here is to facilitate the subsequent configuration of fine filtering conditions.
  • the selection methods for the above features may include but are not limited to the following two types: direct culling and iterative culling.
  • Direct culling refers to directly excluding features that do not meet the conditions at once based on hard conditions, and filtering out target features that meet the requirements. Iterative culling refers to the process of iteratively performing multiple or multiple rounds of feature filtering, in which N unimportant features are eliminated during each round of feature filtering.
  • FIG. 2 is a flowchart of a feature selection method for building a machine learning model provided by an embodiment of the present specification.
  • the execution subject of the method may be the feature selection system in FIG. 1. As shown in FIG. 2, the method may specifically include:
  • Step 202 Obtain a training data set.
  • the training data set here can be a transaction record for multiple users.
  • the record may include user information, transaction amount, and transaction time.
  • the training data set here may be a filtered training data set.
  • Step 204 Split the training data set according to a preset splitting method to obtain k sets of training data subsets.
  • the preset splitting method may include but is not limited to a time splitting method and a random splitting method. Taking the time splitting method as an example, assuming that the recording time of the training data in the training data set is January 1, 2017-January 30, 2017, then when k is 3, you can change January 1, 2017 Day-January 10, 2017 training data is split into one group; January 11, 2017-January 20, 2017 training data is split into another group; January 21, 2017-2017 The training data on January 30, 2014 was split into the third group.
  • steps 202 and 204 may be performed by the data module 102.
  • Step 206 Perform steps a-d in k times in parallel.
  • Step a Select the k-1 training data subset from the k training data subsets as the current training data set.
  • each group of training data subsets can be selected. That is, the comprehensive performance of multiple features in each set of training data subsets can be considered, and more accurate features can be selected.
  • Step b According to the current training data set, calculate m evaluation indicators of multiple features to be filtered.
  • the multiple features to be filtered may be preset by the data analyst and data mining engineer based on business experience and understanding of the data. This may be, for example, the user's identity information or the user's transaction times in the past few days, and so on.
  • the above evaluation index can be used to characterize the absolute importance of features, and has nothing to do with other features. It may include, but not limited to, IV, GINI, IG, MI, Relief score, PSI, etc.
  • m evaluation indicators may be counted, where m is a positive integer.
  • CV1_IV CV1_GINI CV1_IG Feature 1 CV1_IV1 CV1_GINI1 CV1_IG1 Feature 2 CV1_IV2 CV1_GINI2 CV1_IG2 Feature 3 CV1_IV3 CV1_GINI3 CV1_IG3
  • each evaluation index in Table 1 is calculated based on the training data subset (represented as CV1) selected once. It can be understood that when calculating various evaluation indexes of each feature based on the training data subset selected k times, k sets of data shown in Table 1 can be obtained.
  • Step c Sort multiple features according to each evaluation index, so as to obtain an index ranking of m sets of multiple features.
  • the sorting result can be: feature 1, feature 2, feature 3.
  • a set of index rankings for multiple features can be obtained: ⁇ 1,2,3 ⁇ , where the first digit represents the index ranking corresponding to feature 1 and the second digit represents the index ranking corresponding to feature 2, And so on.
  • the index rankings of m groups of multiple characteristics can be obtained.
  • the m-group index ranking can be obtained based on the training data subset selected only once. Then, when step c is performed k times, the k * m group index ranking can be obtained. That is, based on the selected training data subset k times, the k * m group index ranking can be obtained.
  • Step d Based on the current training data set, train a machine learning model to predict the importance ranking of a set of multiple features.
  • the importance ranking here is based on the relative importance of each feature. Relative importance, as the name implies, is relative to other features, that is, related to other features. Specifically, when the machine learning model is trained, the importance ranking result of the features may be output after the model is trained. According to the importance ranking result, the importance ranking of a set of multiple features can be obtained. For example, suppose there are 3 features: features 1-3, and the importance ranking results of the 3 features are: feature 2, feature 3, and feature 1. According to the importance ranking result, the importance ranking of a set of features 1-3 can be obtained: ⁇ 3,1,2 ⁇ .
  • step d may be interchanged or may be executed in parallel, which is not limited in this specification.
  • steps a-d may be performed by the function module 104.
  • step 208 k * m group index rankings and k group importance rankings obtained k times are fused to obtain a total ranking of multiple features.
  • the k * m group index ranking and the k group importance ranking can be directly fused to obtain the total ranking of multiple features.
  • the k * m group index ranking may be first fused to obtain the total index ranking of multiple features. And the k group importance ranking is fused to obtain the total importance ranking of multiple features. After that, the overall index ranking and the overall importance ranking are merged to obtain the overall ranking of multiple features.
  • the specific obtaining process of the above-mentioned general index ranking may be: extracting k group index rankings obtained according to the same evaluation index from the k * m group index rankings.
  • the first ranking fusion algorithm the corresponding ranking of each feature in the k-group index ranking is fused to obtain the comprehensive ranking of the indicators corresponding to the evaluation indicators.
  • the steps of extraction and fusion processing are repeated until the comprehensive ranking of m indexes corresponding to the m evaluation indexes for each feature is obtained.
  • the second ranking fusion algorithm the m comprehensive index rankings of each feature are respectively fused to obtain the total index ranking of each feature.
  • the first ranking fusion algorithm or the second ranking fusion algorithm may include, but not limited to, a mean algorithm, a maximum algorithm, a minimum algorithm, a weighted average algorithm, a robust aggregation (RRA) algorithm, and so on. It can be understood that the first ranking fusion algorithm and the second ranking fusion algorithm may be the same or different. In this specification, the two are the same and are both mean algorithm as an example.
  • the following is an example of the process of obtaining the ranking of each indicator and the ranking of the overall indicator.
  • Table 2 taking the second column as an example, the numbers in each row of the second column are used to indicate the ranking of each feature obtained after sorting each feature based on the IV value obtained by CV1. That is, a set of index rankings for each feature corresponding to the IV value.
  • Table 3 taking the second column as an example, the numbers in each row of the second column are used to indicate the ranking of each feature obtained after sorting each feature based on the GINI value obtained by CV1. That is, a set of index rankings for each feature corresponding to the GINI value.
  • Table 4 taking the second column as an example, the numbers in each row of the second column are used to indicate the ranking of each feature obtained after sorting each feature based on the IG value obtained by CV1. That is, a set of index rankings for each feature corresponding to the IG value.
  • the total indicator ranking of each feature can be obtained, as shown in Table 5.
  • the importance ranking of each feature can also be obtained.
  • the corresponding ranking of each feature in the k sets of importance rankings may be separately fused to obtain the total importance ranking of each feature.
  • the definition of the third sorting fusion algorithm here may be the same as the above-mentioned first sorting fusion algorithm or second sorting fusion algorithm, which will not be repeated here.
  • the obtained importance ranking may be as shown in Table 6.
  • Table 6 taking the second column as an example, the number of each row in the second column is used to indicate the importance ranking of each feature output by the machine learning model after training a machine learning model based on CV1. That is, the importance ranking of a group of features.
  • the overall ranking of each feature can be obtained.
  • the overall index ranking and the overall importance ranking may be fused to obtain the overall ranking of multiple features.
  • the definition of the fourth sorting fusion algorithm here may be the same as the above-mentioned first sorting fusion algorithm or second sorting fusion algorithm, and will not be repeated here.
  • the total ranking obtained may be as shown in Table 7.
  • step 208 may be performed by the analysis module 106.
  • Step 210 Select target features from multiple features based on the overall ranking.
  • the decision module 108 may perform screening by combining pre-configured variable information or screening conditions.
  • the decision module 108 adopts the feature selection method of iterative culling, the above steps 202-210 may be repeatedly executed until the specified number of target features are obtained through screening. In each round of feature selection, N unimportant features are eliminated.
  • FIG. 3 For the specific fusion process in another implementation manner described above, refer to FIG. 3.
  • k 4.
  • the upper left of Figure 3 shows the fusion process of the ranking of the four sets of indicators corresponding to the same evaluation indicators (such as IV, GINI, or IG) for each feature. And IG comprehensive ranking and so on.
  • the upper right shows the fusion process of the four groups of importance rankings of each feature, and finally the total importance ranking of each feature is obtained.
  • the bottom part shows that the index ranking of each feature is first fused to obtain the total index ranking. After that, the overall index ranking and the overall importance ranking are merged to obtain the overall ranking of each feature.
  • the target features selected through the embodiments of this specification can be used to build machine learning models, such as risk control models (a model used to identify and prevent risks such as misappropriation, fraud, and cheating).
  • risk control models a model used to identify and prevent risks such as misappropriation, fraud, and cheating.
  • the feature selection method for constructing a machine learning model provided by the embodiments of this specification can realize the comprehensive performance of considering multiple features in each set of training data subsets, and thus can filter out more accurate features.
  • the feature selection method provided in this specification also comprehensively considers the absolute importance (eg, each evaluation index) and relative importance of each feature, so that more stable and more effective features can be selected.
  • an embodiment of this specification also provides a feature selection apparatus for constructing a machine learning model.
  • the apparatus may include:
  • the obtaining unit 402 is used to obtain a training data set.
  • the splitting unit 404 is configured to split the training data set acquired by the acquiring unit 402 according to a preset splitting method to obtain k sets of training data subsets.
  • the preset splitting method here includes any one of the following: a time splitting method and a random splitting method.
  • the execution unit 406 is configured to perform the following process k times in parallel on the k sets of training data subsets obtained by the splitting unit 404:
  • the multiple features are sorted to obtain m sets of multiple feature index rankings.
  • a machine learning model is trained to predict the importance ranking of a set of multiple features.
  • the above evaluation indicators may include: information value IV, Gini coefficient GINI, information gain IG, mutual information MI, Relief score and several of PSI.
  • the fusion unit 408 is configured to fuse the k * m group index ranking and the k group importance ranking obtained by the execution unit 406 k times to obtain a total ranking of multiple features.
  • the selecting unit 410 is used to select target features from multiple features according to the overall ranking obtained by the fusion unit 408.
  • the fusion unit 408 may be specifically used for:
  • the k * m group index ranking is fused to obtain the total index ranking of multiple features.
  • the importance ranking of k groups is fused to obtain the total importance ranking of multiple features.
  • the fusion unit 408 can also be specifically used for:
  • the corresponding rankings of each feature in the k-group index rankings are respectively fused to obtain the comprehensive index ranking of each feature and the evaluation index.
  • the m comprehensive index rankings of each feature are fused to obtain the total index ranking of each feature.
  • the first sorting fusion algorithm or the second fusion sorting algorithm may include any one of the following: mean algorithm, maximum algorithm, minimum algorithm, weighted average algorithm, and robust aggregation RRA algorithm.
  • the fusion unit 408 can also be specifically used for:
  • the corresponding rankings of each feature in the k group importance rankings are respectively fused to obtain the total importance ranking of each feature.
  • the fusion unit 408 can also be specifically used for:
  • the total index ranking and the total importance ranking are fused to obtain the total ranking of multiple features.
  • the functions of the acquiring unit 402 and the splitting unit 404 can be implemented by the data module 102.
  • the function of the execution unit 406 can be realized by the function module 104.
  • the function of the fusion unit 408 can be realized by the analysis module 106.
  • the function of the selection unit 410 can be implemented by the decision module 108.
  • the feature selection device for constructing a machine learning model provided by an embodiment of this specification can filter out more stable and more effective features.
  • the device may include: a memory 502, a Or more processors 504 and one or more programs.
  • the one or more programs are stored in the memory 502, and are configured to be executed by one or more processors 504.
  • the programs are executed by the processor 504, the following steps are implemented:
  • the preset splitting method split the training data set to obtain k sets of training data subsets.
  • the multiple features are sorted to obtain m sets of multiple feature index rankings.
  • a machine learning model is trained to predict the importance ranking of a set of multiple features.
  • the k * m group index ranking obtained by k times and the k group importance ranking are merged to obtain the total ranking of multiple features.
  • a feature selection device for constructing a machine learning model provided by an embodiment of this specification can filter out more stable and effective features.
  • the steps of the method or algorithm described in conjunction with the disclosure of the present specification may be implemented in hardware, or may be implemented by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, which can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM or any other form of storage known in the art Medium.
  • An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and can write information to the storage medium.
  • the storage medium may also be an integral part of the processor.
  • the processor and the storage medium may be located in the ASIC.
  • the ASIC may be located in the server.
  • the processor and the storage medium may also exist as discrete components in the server.
  • Computer-readable media includes computer storage media and communication media, where communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • the storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书实施例提供一种用于构建机器学习模型的特征选取方法、装置及设备,在特征选取方法中,获取筛选后的训练数据集。根据预设的拆分方式,对训练数据集进行拆分,以获得k组训练数据子集。对该k组训练数据子集,并行执行如下过程k次:从k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集。根据当前训练数据集,计算多个待筛选的特征的m个评价指标。根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。基于当前训练数据集,训练机器学习模型,以预测一组多个特征的重要性排名。将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。根据总排名,从多个特征中选取目标特征。

Description

用于构建机器学习模型的特征选取方法、装置以及设备 技术领域
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及一种用于构建机器学习模型的特征选取方法、装置及设备。
背景技术
为了构建一个性能最优的机器学习模型,数据分析师和数据挖掘工程师通常会根据业务经验以及对数据的理解,暴力衍生出很多维度的特征(也称变量)。但这个过程往往会产生很多冗余细微的信息,这些信息对于我们想要构建的机器学习模型没有太多价值甚至会产生副作用。因此在构建机器学习模型的过程中,我们需要不断的去试验,经过细致的特征筛选,最后构建一个最优的机器学习模型。
对于上述特征筛选的过程,当通过人为的方式进行时,通常非常耗费人力,且会拖慢模型构建的速度,因此通常采用自动化的方式进行。传统技术中,主要有如下几种自动化的特征选取方式:过滤式特征选择、嵌入式特征选择以及包裹式特征选择。这些特征选取方式在筛选特征时,通常只考虑特征在部分数据拆分集合上的表现。
因此,需要提供一种特征的选取方式,以能够筛选出更准确的特征。
发明内容
本说明书一个或多个实施例描述了一种用于构建机器学习模型的特征选取方法、装置及设备,可以筛选出更准确的特征。
第一方面,提供了一种用于构建机器学习模型的特征选取方法,包括:
获取训练数据集;
根据预设的拆分方式,对所述训练数据集进行拆分,以获得k组训练数据子集;
对所述k组训练数据子集,并行执行如下过程k次:
从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指 标排名;
基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
根据所述总排名,从所述多个特征中选取目标特征。
第二方面,提供了一种用于构建机器学习模型的特征选取装置,包括:
获取单元,用于获取训练数据集;
拆分单元,用于根据预设的拆分方式,对所述获取单元获取的所述训练数据集进行拆分,以获得k组训练数据子集;
执行单元,用于对所述拆分单元拆分得到的所述k组训练数据子集,并行执行如下过程k次:
从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指标排名;
基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
融合单元,用于将所述执行单元执行k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
选取单元,用于根据所述融合单元得到的所述总排名,从所述多个特征中选取目标特征。
第三方面,提供了一种用于构建机器学习模型的特征选取设备,包括:
存储器;
一个或多个处理器;以及
一个或多个程序,其中所述一个或多个程序存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,所述程序被所述处理器执行时实现以下步骤:
获取训练数据集;
根据预设的拆分方式,对所述训练数据集进行拆分,以获得k组训练数据子集;
对所述k组训练数据子集,并行执行如下过程k次:
从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指标排名;
基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
根据所述总排名,从所述多个特征中选取目标特征。
本说明书一个或多个实施例提供的用于构建机器学习模型的特征选取方法、装置及设备,获取训练数据集。根据预设的拆分方式,对训练数据集进行拆分,以获得k组训练数据子集。对该k组训练数据子集,并行执行如下过程k次:从k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集。根据当前训练数据集,计算多个待筛选的特征的m个评价指标。根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。基于当前训练数据集,训练机器学习模型,以预测一组多个特征的重要性排名。将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。根据总排名,从多个特征中选取目标特征。由此可以看出,本说明书在拆分得到k组训练数据子集之后,并行执行了k次当前训练数据集的选取、多个特征的指标排名以及重要性排名。由此。可以实现考虑多个特征在各组训练数据子集中的综合表现,进而可以筛选出更准确的特征。此外,本说明书提供的特征选取方法,还综合考虑了各个特征的多个评价指标,由此可以筛选出更稳定、更有效的特征。
附图说明
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例, 对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书提供的特征选取系统示意图;
图2为本说明书一个实施例提供的用于构建机器学习模型的特征选取方法流程图;
图3为本说明书提供的特征排名融合过程示意图;
图4为本说明书一个实施例提供的用于构建机器学习模型的特征选取装置示意图;
图5为本说明书一个实施例提供的用于构建机器学习模型的特征选取设备示意图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
在介绍本说明书一个或多个实施例提供的用于构建机器学习模型的特征选取方法之前,先对该方法的发明构思进行如下描述。
在训练数据集划分方面,传统技术中,首先将总的训练数据集划分为多组。之后从多组中选取若干组训练数据集,并基于该若干组训练数据集来选取特征。由此可以看出,该特征选取方法只考虑了特征在部分训练数据集上的表现,并没有考虑特征在各个训练数据集上的综合表现。因此,通过该特征选取方法选取的特征通常不够稳定。
为了提高选取的特征的稳定性,本说明书提供的方案(简称本方案)可以借鉴k-折叠交叉验证(k-fold Cross Validation)的做法。k-折叠交叉验证的主要思想如下:将初始采样分割成k个子样本,一个单独的子样本被保留作为验证模型的数据,其他k-1个样本用来训练。交叉验证重复k次,每个子样本验证一次,平均k次的结果或者使用其它结合方式,最终得到一个单一估测。由于本方案是为了选取特征,而并非为了训练模型。由此,本方案可以只采用k-折叠交叉验证的样本划分思想以及重复k次的思想。具体地,可以将训练数据集拆分为k组,然后执行k次从k组中选取训练数据子集以及特征排名的步骤,其中,每次选取的训练数据子集为:k-1组。
举例来说,假设k为4,且拆分的4组训练数据子集为:训练子集1-4。则第1次选取的训练数据子集可以为:训练子集2-4,而训练子集1作为测试集;第2次选取的训练数据子集可以为:训练子集1以及训练子集3-4,而训练子集2作为测试集;第3次选取的训练数据子集可以为:训练子集1-2以及训练子集4,而训练子集3作为测试 集;第4次选取的训练数据子集可以为:训练子集1-3,而训练子集4作为测试集。
需要说明的是,在拆分得到k组训练数据子集之后,当执行k次训练数据子集的选取步骤时,可以使得每组训练数据子集都可以被选取到。也即可以实现考虑多个特征在各组训练数据子集中的综合表现,进而可以筛选出更准确的特征。
此外,根据背景技术的内容可知,可以通过过滤式特征选择的方法来进行特征的筛选。而过滤式特征选择主要是基于训练数据集,计算每个特征的某个评价指标,之后再基于该评价指标来筛选特征。然而当仅依据一个评价指标进行特征筛选时,选取的特征通常不够稳定。因此,本方案将综合考虑每个特征的多个评价指标。
需要说明的是,上述评价指标可以包括但不限于信息价值或者信息量(Information Value,IV)、基尼系数GINI、信息增益(Info Gain,IG)、互信息(Mutual Information,MI)、Relief分数以及样本稳定指数(Sample stability index,PSI)等。需要说明的是,上述各评价指标的计算方法为传统常规技术,在此不复赘述。
最后,需要说明的是,当执行本方案一次,就可以完成一次特征的筛选。当待筛选的特征的数量比较大时,通常需要迭代执行多次(或者多轮)特征的筛选过程,也即需要执行本方案多次,以避免一次性完成特征的筛选时,会遗漏掉部分重要特征的问题。具体地,可以在每次特征筛选的过程中,都消除掉一定个数的特征。关于特征的消除可以借鉴递归式特征消除(Recursive Feature Elimination)的想法。Recursive Feature Elimination的核心思想是:第一轮:基于所有特征训练模型,以得到所有特征的重要性。下一轮:减少1个最不重要的特征,并基于剩余的特征,继续训练模型,并得到该剩余特征的重要性。之后再从剩余特征中减少1个最不重要的特征,以此类推,直至筛选得到指定数量的特征。可以理解的是,当总的特征个数为100个,而指定数量为50个时,则需要执行上述模型训练的过程50轮。
然而,当待筛选的特征的个数通常为成千上万个,而指定数量可能为几百个时,若采用上述每轮只消除1个特征的方法会极大地耗费计算资源。因此,本方案为了减少计算资源的消耗,在每轮模型训练结束后,可以消除N(如,10)个不重要的特征。以N为10为例来说,则针对前述例子,只需执行模型训练的过程5轮。需要说明的是,虽然本方案增加了每轮消除的特征的个数,但由于在每一轮的特征筛选过程中,并行执行k次训练数据子集的选取以及特征排名,因此,并不影响通过本方案选取的特征的准确性和稳定性。
上述就是本说明书提供的方案的发明构思,基于该发明构思,就可以得到本说明书提供的方案。以下对本说明书提供的方案进行进一步地详细阐述:
本说明书一个或多个实施例提供的用于构建机器学习模型的特征选取方法可以应用于如图1所示的特征选取系统10中。图1中,特征选取系统10可以包括:数据模块102、功能模块104、分析模块106以及决策模块108。
数据模块102用于根据预设的拆分方式,将训练数据集划分为k组训练数据子集。这里的预设的拆分方式可以包括但不限于时间拆分方式以及随机拆分方式等。
功能模块104用于执行如下过程k次:从k组训练数据子集中选取k-1组训练数据子集。基于选取的训练数据子集,计算多个特征的m个评价指标。根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。此外,还基于选取的训练数据子集,训练机器学习模型,以预测一组多个特征的重要性排名。
分析模块106用于对各个特征的指标排名以及重要性排名进行融合。具体地,将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。此外,还可以根据功能模块104计算的各个特征的评价指标,进行指标衍生以及指标融合等。其中,指标衍生是指根据某个特征的当前计算得到的评价指标,衍生出其它指标。如,根据某个特征的k组IV值,得到IV值变化率等。指标融合是指将某个特征的多个评价指标进行融合。如,将某个特征的k组IV值融合为一个IV值。这里的融合过程可以为取k组IV值中的最大值、最小值以及平均值等等。
决策模块108用于根据各个特征的总排名,从多个特征中选取目标特征。当然,在实际应用中,也可以结合其它的配置信息,来进行目标特征的选取。这里的配置信息可以包括预先配置的变量信息(如,变量元数据(metaData)以及变量所属分类等)以及筛选条件(如,IV>0.01,MAX_PSI<0.25,TOP_N=100)等。需要说明的是,这里配置的变量信息是为了便于后续精细筛选条件的配置。此外,对上述特征的选取方式可以包括但不限于如下两种:直接剔除以及迭代剔除等。直接剔除是指直接根据硬性条件一次性剔除不满足条件的特征,筛选出符合要求的目标特征。迭代剔除是指迭代执行多次或者多轮特征筛选的过程,其中在每轮特征筛选的过程中,消除掉N个不重要的特征。
图2为本说明书一个实施例提供的用于构建机器学习模型的特征选取方法流程图。所述方法的执行主体可以为图1中的特征选取系统。如图2所示,所述方法具体可以包括:
步骤202,获取训练数据集。
以构建的机器学习模型为风控模型(一种用于识别和防控盗用、欺诈以及作弊等风险的模型)为例来说,这里的训练数据集可以为多条用户的交易记录,该交易记录可以包括用户信息、交易金额以及交易时间等信息。此外,这里的训练数据集可以是经过筛选后的训练数据集。
步骤204,根据预设的拆分方式,对训练数据集进行拆分,以获得k组训练数据子集。
这里的k可以为大于1的正整数。上述预设的拆分方式可以包括但不限于时间拆分的方式以及随机拆分的方式等。以时间拆分的方式为例来说,假设训练数据集中的训练数据的记录时间为2017年1月1日-2017年1月30日,那么当k为3时,可以将2017年1月1日-2017年1月10日的训练数据拆分为一组;将2017年1月11日-2017年1月20日的训练数据拆分为另一组;将2017年1月21日-2017年1月30日的训练数据拆分为第三组。
需要说明的是,上述步骤202和步骤204可以是由数据模块102执行的。
步骤206,并行执行k次步骤a-步骤d。
步骤a,从k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集。
如前述例子,可以选取其中的2组训练数据子集作为当前训练数据集。需要说明的是,当执行k次训练数据子集的选取步骤时,可以使得每组训练数据子集都可以被选取到。也即可以实现考虑多个特征在各组训练数据子集中的综合表现,进而可以筛选出更准确的特征。
步骤b,根据当前训练数据集,计算多个待筛选的特征的m个评价指标。
多个待筛选的特征可以是由数据分析师和数据挖掘工程师根据业务经验以及对数据的理解预先设置好的。其例如可以为:用户的身份信息或者用户在过去若干天的交易次数等等。
上述评价指标可以用于表征特征的绝对重要性,与其它特征无关。其可以包括但不限于IV、GINI、IG、MI、Relief分数以及PSI等。本实施例中,可以统计m个评价指标,其中m为正整数。通过综合考虑各个特征的多个评价指标,可以保证筛选的目标特征的稳定性以及有效性。
以待筛选的特征为:特征1-3,各评价指标为:IV、GINI以及IG为例来说,上述三个特征的计算结果可以如表1所示。
表1
特征 CV1_IV CV1_GINI CV1_IG
特征1 CV1_IV1 CV1_GINI1 CV1_IG1
特征2 CV1_IV2 CV1_GINI2 CV1_IG2
特征3 CV1_IV3 CV1_GINI3 CV1_IG3
需要说明的是,表1中的各评价指标仅仅是基于1次选取的训练数据子集(表示为CV1)而计算得到的。可以理解的是,当基于k次选取的训练数据子集,计算各个特征的各个评价指标时,可以得到k组如表1所示的数据。
步骤c,根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。
以表1为例来说,当根据IV对各个特征进行排序时,假设CV1_IV1>CV1_IV2>CV1_IV3,则排序结果可以为:特征1,特征2,特征3。根据该排序结果,可以得到一组多个特征的指标排名:{1,2,3},其中,第1位数字代表特征1对应的指标排名,第2位数字代表特征2对应的指标排名,依次类推。同理,依据m个排名指标,可以获取到m组多个特征的指标排名。
可以理解的是,仅根据1次选取的训练数据子集就可以得到m组指标排名。那么当执行k次步骤c时,就可以得到k*m组指标排名。也即基于k次选取的训练数据子集,就可以得到k*m组指标排名。
步骤d,基于当前训练数据集,训练机器学习模型,以预测一组多个特征的重要性排名。
这里的重要性排名是依据各个特征的相对重要性而得到的。相对重要性顾名思义是相对于其它特征的重要性,即与其它特征相关。具体地,在对机器学习模型进行训练时,可以设置在模型训练好之后输出特征的重要性排序结果。根据该重要性排序结果,就可以得到一组多个特征的重要性排名。举例来说,假设有3个特征:特征1-3,且该3个特征的重要性排序结果为:特征2,特征3,特征1。根据该重要性排序结果,可以得到一组特征1-3的重要性排名:{3,1,2}。
可以理解的是,重复执行k次步骤d之后,就可以得到k组重要性排名。
需要说明的是,在实际应用中,上述步骤b-c与步骤d的执行顺序可以互换,也可以并行执行,本说明书对此不作限定。此外,上述步骤a-d可以是由功能模块104执行的。
步骤208,将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。
在一种实现方式中,可以直接对k*m组指标排名与k组重要性排名进行融合,以获取多个特征的总排名。
在另一种实现方式中,可以先对k*m组指标排名进行融合,以获取多个特征的总指标排名。并对k组重要性排名进行融合,以获取多个特征的总重要性排名。之后,再将总指标排名与总重要性排名进行融合,以获取多个特征的总排名。
上述总指标排名的具体获取过程可以为:从k*m组指标排名中抽取依据同一评价指标所获得的k组指标排名。根据第一排序融合算法,分别对各个特征在k组指标排名中对应的排名进行融合,以获取各个特征与该评价指标对应的指标综合排名。重复执行上述抽取以及融合处理的步骤,直至获取到各个特征与m个评价指标对应的m个指标综合排名。根据第二排序融合算法,分别将所述各个特征的m个指标综合排名进行融合,以获取各个特征的总指标排名。
当然,在实际应用中,也可以直接根据一种排序融合算法,将k*m组指标排名直接进行融合,本说明书对此不作限定。
上述第一排序融合算法或者第二排序融合算法可以包括但不限于均值算法、最大值算法、最小值算法、加权平均值算法以及鲁棒性聚合(Robust Rank Aggregation,RRA)算法等等。可以理解是,上述第一排序融合算法与第二排序融合算法可以相同也可以不同。在本说明书中,以两者相同且均为均值算法为例。
以下对各个指标排名以及总指标排名的获取过程进行举例说明。
假设有3个特征:特征1-3,3个评价指标:IV、GINI以及IG。此外,还假设k=4,也即每个特征有4个IV值,4个GINI值以及4个IG值。那么各个特征与3个评价指标对应的3个指标综合排名可以如表2-4所示。
表2
特征 CV1_IV CV2_IV CV3_IV CV4_IV IV综合排名
特征1 1 3 2 1 (1+3+2+1)/4=1.75
特征2 2 2 3 2 (2+2+3+2)/4=2.25
特征3 3 1 1 3 (3+1+1+3)/4=2
表2中,以第2列为例来说,第2列中的各行数字用于表示基于CV1所获取的IV值,对各个特征排序之后,所获得的各个特征的排名。也即与IV值对应的一组各个特征的指标排名。
表3
特征 CV1_GINI CV2_GINI CV3_GINI CV4_GINI GINI综合排名
特征1 2 3 1 1 (2+3+1+1)/4=1.75
特征2 3 2 2 2 (3+2+2+2)/4=2.25
特征3 1 1 3 3 (3+1+1+3)/4=2
表3中,以第2列为例来说,第2列中的各行数字用于表示基于CV1所获取的GINI值,对各个特征排序之后,所获得的各个特征的排名。也即与GINI值对应的一组各个特征的指标排名。
表4
特征 CV1_IG CV2_IG CV3_IG CV4_IG IG综合排名
特征1 1 1 3 3 (1+1+3+3)/4=2
特征2 3 2 2 2 (3+2+2+2)/4=2.25
特征3 2 3 1 1 (2+3+1+1)/4=1.75
表4中,以第2列为例来说,第2列中的各行数字用于表示基于CV1所获取的IG值,对各个特征排序之后,所获得的各个特征的排名。也即与IG值对应的一组各个特征的指标排名。
在获取到如上3个评价指标的综合排名之后,就可以得到各个特征的总指标排名,如表5所示。
表5
特征 IV综合排名 GINI综合排名 IG综合排名 总指标排名
特征1 1.75 1.75 2 (1.75+1.75+2)/3=1.83
特征2 2.25 2.25 2.25 (2.25+2.25+2.25)/3=2.25
特征3 2 2 1.75 (2+2+1.75)/3=1.92
可以理解的是,表5中的第2-4列的数字分别取自表2-4的计算结果。
类似于上述指标排名或者总指标排名的获取过程,还可以获取到各个特征的重要性排名。具体地,可以根据第三排序融合算法,分别对各个特征在k组重要性排名中对应的排名进行融合,以获取各个特征的总重要性排名。这里的第三排序融合算法的定义可以同上述第一排序融合算法或者第二排序融合算法,在此不复赘述。
以上述例子为例来说,假设第三排序融合算法为均值算法,那么获取到的重要性排名可以如表6所示。
表6
Figure PCTCN2019101397-appb-000001
表6中,以第2列为例来说,第2列中的各行数字用于表示基于CV1对一种机器学习模型进行训练后,由该机器学习模型输出的各个特征的重要性排名。也即一组各个特征的重要性排名。
在获取到各个特征的总指标排名以及总重要性排名之后,就可以得到各个特征的总排名。具体地,可以根据第四排序融合算法,将总指标排名与总重要性排名进行融合,以获取多个特征的总排名。这里的第四排序融合算法的定义可以同上述第一排序融合算法或者第二排序融合算法,在此不复赘述。
以上述例子为例来说,假设第四排序融合算法为均值算法,那么获取到的总排名可以如表7所示。
表7
特征 总指标排名 总重要性排名 总排名
特征1 1.83 2 (1.83+2)/2=1.915
特征2 2.25 2.25 (2.25+2.25)/2=2.25
特征3 1.92 1.75 (1.92+1.75)/2=1.835
可以理解的是,表7中的第2-3列的数字分别取自表5-6的计算结果。
需要说明的是,上述步骤208可以是由分析模块106执行的。
步骤210,根据总排名,从多个特征中选取目标特征。
以表7中的总排名结果为例来说,假设要选取两个特征,那么就可以选取特征1和特征2,从而特征1和特征2即为选取的目标特征。当然,这里仅仅是依照排名来进行特征筛选。在实际应用中,可以是由决策模块108结合预先配置的变量信息或者筛选条件进行筛选。
可以理解的是,当决策模块108采用迭代剔除的特征选取方式时,则可以重复执行上述步骤202-步骤210,直至筛选得到指定数量的目标特征。其中在每轮特征筛选的过程中,消除掉N个不重要的特征。
上述另一种实现方式的具体融合过程可参照图3所示。图3中,k=4。在图3的左上方示出了各个特征的与相同评价指标(如,IV、GINI或者IG等)对应的4组指标排名的融合过程,最后得到的指标综合排名包括IV综合排名、GINI综合排名以及IG综合排名等等。右上方示出了各个特征的4组重要性排名的融合过程,最后得到各个特征的总重要性排名。最下方示出了首先对各个特征的各指标排名进行融合,得到总指标排名。之后再将总指标排名与总重要性排名进行融合,得到各个特征的总排名。
通过本说明书实施例选取的目标特征可以用于构建机器学习模型,如,风控模型(一种用于识别和防控盗用、欺诈以及作弊等风险的模型)等。
综上,本说明书实施例提供的用于构建机器学习模型的特征选取方法,可以实现考虑多个特征在各组训练数据子集中的综合表现,进而可以筛选出更准确的特征。此外,本说明书提供的特征选取方法,还综合考虑了各个特征的绝对重要性(如,各个评价指标)以及相对重要性,由此可以筛选出更稳定、更有效的特征。
与上述用于构建机器学习模型的特征选取方法对应地,本说明书一个实施例还提供的一种用于构建机器学习模型的特征选取装置,如图4所示,该装置可以包括:
获取单元402,用于获取训练数据集。
拆分单元404,用于根据预设的拆分方式,对获取单元402获取的训练数据集进行拆分,以获得k组训练数据子集。
这里的预设的拆分方式包括以下任一种:时间拆分方式以及随机拆分方式。
执行单元406,用于对拆分单元404拆分得到的k组训练数据子集,并行执行如下过程k次:
从k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集。
根据当前训练数据集,计算多个待筛选的特征的m个评价指标。
根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。
基于当前训练数据集,训练机器学习模型,以预测一组多个特征的重要性排名。
上述评价指标可以包括:信息价值IV、基尼系数GINI、信息增益IG、互信息MI、Relief分数以及PSI中的若干个。
融合单元408,用于将执行单元406执行k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。
选取单元410,用于根据融合单元408得到的总排名,从多个特征中选取目标特征。
可选地,融合单元408具体可以用于:
对k*m组指标排名进行融合,以获取多个特征的总指标排名。
对k组重要性排名进行融合,以获取多个特征的总重要性排名。
将总指标排名与总重要性排名进行融合,以获取多个特征的总排名。
融合单元408还具体可以用于:
从k*m组指标排名中抽取依据同一评价指标所获得的k组指标排名。
根据第一排序融合算法,分别对各个特征在k组指标排名中对应的排名进行融合,以获取各个特征与评价指标对应的指标综合排名。
重复执行上述抽取以及融合处理的步骤,直至获取到各个特征与m个评价指标对 应的m个指标综合排名。
根据第二排序融合算法,分别将各个特征的m个指标综合排名进行融合,以获取各个特征的总指标排名。
这里的第一排序融合算法或者第二融合排序算法可以包括以下任一种:均值算法、最大值算法、最小值算法、加权平均值算法以及鲁棒性聚合RRA算法。
融合单元408还具体可以用于:
根据第三排序融合算法,分别对各个特征在k组重要性排名中对应的排名进行融合,以获取各个特征的总重要性排名。
融合单元408还具体可以用于:
根据第四排序融合算法,将总指标排名与总重要性排名进行融合,以获取多个特征的总排名。需要说明的是,上述获取单元402以及拆分单元404的功能可以由数据模块102来实现。执行单元406的功能可以由功能模块104来实现。融合单元408的功能可以由分析模块106来实现。选取单元410的功能可以由决策模块108来实现。
本说明书上述实施例装置的各功能模块的功能,可以通过上述方法实施例的各步骤来实现,因此,本说明书一个实施例提供的装置的具体工作过程,在此不复赘述。
本说明书一个实施例提供的用于构建机器学习模型的特征选取装置,可以筛选出更稳定、更有效的特征。
与上述用于构建机器学习模型的特征选取方法对应地,本说明书实施例还提供了一种用于构建机器学习模型的特征选取设备,如图5所示,该设备可以包括:存储器502、一个或多个处理器504以及一个或多个程序。其中,该一个或多个程序存储在存储器502中,并且被配置成由一个或多个处理器504执行,该程序被处理器504执行时实现以下步骤:
获取训练数据集。
根据预设的拆分方式,对训练数据集进行拆分,以获得k组训练数据子集。
对k组训练数据子集,并行执行如下过程k次:
从k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集。
根据当前训练数据集,计算多个待筛选的特征的m个评价指标。
根据各个评价指标,对多个特征进行排序,从而得到m组多个特征的指标排名。
基于当前训练数据集,训练机器学习模型,以预测一组多个特征的重要性排名。
将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取多个特征的总排名。
根据总排名,从多个特征中选取目标特征。
本说明书一个实施例提供的用于构建机器学习模型的特征选取设备,可以筛选出更稳定、更有效的特征。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
结合本说明书公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于服务器中。当然,处理器和存储介质也可以作为分立组件存在于服务器中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是 可以的或者可能是有利的。
以上所述的具体实施方式,对本说明书的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本说明书的具体实施方式而已,并不用于限定本说明书的保护范围,凡在本说明书的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本说明书的保护范围之内。

Claims (17)

  1. 一种用于构建机器学习模型的特征选取方法,包括:
    获取训练数据集;
    根据预设的拆分方式,对所述训练数据集进行拆分,以获得k组训练数据子集;
    对所述k组训练数据子集,并行执行如下过程k次:
    从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
    根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
    根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指标排名;
    基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
    将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
    根据所述总排名,从所述多个特征中选取目标特征。
  2. 根据权利要求1所述的方法,所述将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名,包括:
    对所述k*m组指标排名进行融合,以获取所述多个特征的总指标排名;
    对所述k组重要性排名进行融合,以获取所述多个特征的总重要性排名;
    将所述总指标排名与所述总重要性排名进行融合,以获取所述多个特征的总排名。
  3. 根据权利要求2所述的方法,所述对所述k*m组指标排名进行融合,以获取所述多个特征的总指标排名,包括:
    从所述k*m组指标排名中抽取依据同一评价指标所获得的k组指标排名;
    根据第一排序融合算法,分别对各个特征在所述k组指标排名中对应的排名进行融合,以获取所述各个特征与所述评价指标对应的指标综合排名;
    重复执行上述抽取以及融合处理的步骤,直至获取到所述各个特征与所述m个评价指标对应的m个指标综合排名;
    根据第二排序融合算法,分别将所述各个特征的所述m个指标综合排名进行融合,以获取所述各个特征的总指标排名。
  4. 根据权利要求2所述的方法,所述对所述k组重要性排名进行融合,以获取所述多个特征的总重要性排名,包括:
    根据第三排序融合算法,分别对所述各个特征在所述k组重要性排名中对应的排名 进行融合,以获取所述各个特征的总重要性排名。
  5. 根据权利要求2所述的方法,所述将所述总指标排名与所述总重要性排名进行融合,以获取所述多个特征的总排名,包括:
    根据第四排序融合算法,将所述总指标排名与所述总重要性排名进行融合,以获取所述多个特征的总排名。
  6. 根据权利要求2所述的方法,所述第一排序融合算法或者所述第二融合排序算法包括以下任一种:均值算法、最大值算法、最小值算法、加权平均值算法以及鲁棒性聚合RRA算法。
  7. 根据权利要求1所述的方法,所述预设的拆分方式包括以下任一种:时间拆分方式以及随机拆分方式。
  8. 根据权利要求1所述的方法,所述评价指标包括:信息价值IV、基尼系数GINI、信息增益IG、互信息MI、Releif分数以及样本稳定指数PSI中的若干个。
  9. 一种用于构建机器学习模型的特征选取装置,包括:
    获取单元,用于获取训练数据集;
    拆分单元,用于根据预设的拆分方式,对所述获取单元获取的所述训练数据集进行拆分,以获得k组训练数据子集;
    执行单元,用于对所述拆分单元拆分得到的所述k组训练数据子集,并行执行如下过程k次:
    从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
    根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
    根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指标排名;
    基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
    融合单元,用于将所述执行单元执行k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
    选取单元,用于根据所述融合单元得到的所述总排名,从所述多个特征中选取目标特征。
  10. 根据权利要求9所述的装置,所述融合单元具体用于:
    对所述k*m组指标排名进行融合,以获取所述多个特征的总指标排名;
    对所述k组重要性排名进行融合,以获取所述多个特征的总重要性排名;
    将所述总指标排名与所述总重要性排名进行融合,以获取所述多个特征的总排名。
  11. 根据权利要求10所述的装置,所述融合单元还具体用于:
    从所述k*m组指标排名中抽取依据同一评价指标所获得的k组指标排名;
    根据第一排序融合算法,分别对各个特征在所述k组指标排名中对应的排名进行融合,以获取所述各个特征与所述评价指标对应的指标综合排名;
    重复执行上述抽取以及融合处理的步骤,直至获取到所述各个特征与所述m个评价指标对应的m个指标综合排名;
    根据第二排序融合算法,分别将所述各个特征的所述m个指标综合排名进行融合,以获取所述各个特征的总指标排名。
  12. 根据权利要求10所述的装置,所述融合单元还具体用于:
    根据第三排序融合算法,分别对所述各个特征在所述k组重要性排名中对应的排名进行融合,以获取所述各个特征的总重要性排名。
  13. 根据权利要求10所述的装置,所述融合单元还具体用于:
    根据第四排序融合算法,将所述总指标排名与所述总重要性排名进行融合,以获取所述多个特征的总排名。
  14. 根据权利要求10所述的装置,所述第一排序融合算法或者所述第二融合排序算法包括以下任一种:均值算法、最大值算法、最小值算法、加权平均值算法以及鲁棒性聚合RRA算法。
  15. 根据权利要求9所述的装置,所述预设的拆分方式包括以下任一种:时间拆分方式以及随机拆分方式。
  16. 根据权利要求9所述的装置,所述评价指标包括:信息价值IV、基尼系数GINI、信息增益IG、互信息MI、Releif分数以及样本稳定指数PSI中的若干个。
  17. 一种用于构建机器学习模型的特征选取设备,包括:
    存储器;
    一个或多个处理器;以及
    一个或多个程序,其中所述一个或多个程序存储在所述存储器中,并且被配置成由所述一个或多个处理器执行,所述程序被所述处理器执行时实现以下步骤:
    获取训练数据集;
    根据预设的拆分方式,对所述训练数据集进行拆分,以获得k组训练数据子集;
    对所述k组训练数据子集,并行执行如下过程k次:
    从所述k组训练数据子集中选取k-1组训练数据子集,以作为当前训练数据集;
    根据所述当前训练数据集,计算多个待筛选的特征的m个评价指标;
    根据各个评价指标,对所述多个特征进行排序,从而得到m组所述多个特征的指标排名;
    基于所述当前训练数据集,训练机器学习模型,以预测一组所述多个特征的重要性排名;
    将k次得到的k*m组指标排名以及k组重要性排名进行融合,以获取所述多个特征的总排名;
    根据所述总排名,从所述多个特征中选取目标特征。
PCT/CN2019/101397 2018-10-24 2019-08-19 用于构建机器学习模型的特征选取方法、装置以及设备 WO2020082865A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/162,939 US11222285B2 (en) 2018-10-24 2021-01-29 Feature selection method, device and apparatus for constructing machine learning model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811244486.8A CN109460825A (zh) 2018-10-24 2018-10-24 用于构建机器学习模型的特征选取方法、装置以及设备
CN201811244486.8 2018-10-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/162,939 Continuation US11222285B2 (en) 2018-10-24 2021-01-29 Feature selection method, device and apparatus for constructing machine learning model

Publications (1)

Publication Number Publication Date
WO2020082865A1 true WO2020082865A1 (zh) 2020-04-30

Family

ID=65608270

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101397 WO2020082865A1 (zh) 2018-10-24 2019-08-19 用于构建机器学习模型的特征选取方法、装置以及设备

Country Status (4)

Country Link
US (1) US11222285B2 (zh)
CN (1) CN109460825A (zh)
TW (1) TWI705388B (zh)
WO (1) WO2020082865A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613983A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备
CN114119058A (zh) * 2021-08-10 2022-03-01 国家电网有限公司 用户画像模型的构建方法、设备及存储介质
CN115695502A (zh) * 2022-12-15 2023-02-03 国网浙江省电力有限公司 适用于电力可靠通信的数据处理方法及装置
CN115713224A (zh) * 2022-11-28 2023-02-24 四川京龙光电科技有限公司 适用于lcd装配流程的多次元评估方法及系统
US11593388B2 (en) 2021-03-19 2023-02-28 International Business Machines Corporation Indexing based on feature importance
US11774941B2 (en) 2021-04-22 2023-10-03 Abb Schweiz Ag Method for providing a list of equipment elements in industrial plants

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460825A (zh) * 2018-10-24 2019-03-12 阿里巴巴集团控股有限公司 用于构建机器学习模型的特征选取方法、装置以及设备
CN110334773A (zh) * 2019-07-12 2019-10-15 四川新网银行股份有限公司 基于机器学习的模型入模特征的筛选方法
KR102315074B1 (ko) * 2019-07-26 2021-10-21 주식회사 히타치하이테크 데이터 처리 장치, 방법, 및 반도체 제조 장치
US11373760B2 (en) 2019-10-12 2022-06-28 International Business Machines Corporation False detection rate control with null-hypothesis
US11645555B2 (en) 2019-10-12 2023-05-09 International Business Machines Corporation Feature selection using Sobolev Independence Criterion
TWI762853B (zh) 2020-01-06 2022-05-01 宏碁股份有限公司 利用自動化機制挑選影響力指標的方法及電子裝置
CN111326260A (zh) * 2020-01-09 2020-06-23 上海中科新生命生物科技有限公司 一种医学分析方法、装置、设备及存储介质
CN113130073B (zh) * 2020-01-16 2024-01-19 宏碁股份有限公司 利用自动化机制挑选影响力指标的方法及电子装置
CN111401759B (zh) * 2020-03-20 2022-08-23 支付宝(杭州)信息技术有限公司 数据处理方法、装置、电子设备及存储介质
CN111738297A (zh) * 2020-05-26 2020-10-02 平安科技(深圳)有限公司 特征选择方法、装置、设备及存储介质
CN111783869B (zh) * 2020-06-29 2024-06-04 杭州海康威视数字技术股份有限公司 训练数据筛选方法、装置、电子设备及存储介质
CN111860630B (zh) * 2020-07-10 2023-10-13 深圳无域科技技术有限公司 基于特征重要性的模型建立方法及系统
CN113762005B (zh) * 2020-11-09 2024-06-18 北京沃东天骏信息技术有限公司 特征选择模型的训练、对象分类方法、装置、设备及介质
CN112508378A (zh) * 2020-11-30 2021-03-16 国网北京市电力公司 电力设备生产制造商筛选的处理方法和装置
CN113657481A (zh) * 2021-08-13 2021-11-16 上海晓途网络科技有限公司 一种模型构建系统及方法
US20230128548A1 (en) * 2021-10-25 2023-04-27 International Business Machines Corporation Federated learning data source selection
CN113887089A (zh) * 2021-11-17 2022-01-04 中冶赛迪重庆信息技术有限公司 线棒材力学性能预测方法及计算机可读存储介质
CN114898155B (zh) * 2022-05-18 2024-05-28 平安科技(深圳)有限公司 车辆定损方法、装置、设备及存储介质
CN114936205A (zh) * 2022-06-02 2022-08-23 江苏品生医疗科技集团有限公司 一种特征筛选方法、装置、存储介质及电子设备
CN114996331B (zh) * 2022-06-10 2023-01-20 北京柏睿数据技术股份有限公司 一种数据挖掘控制方法和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732242A (zh) * 2015-04-08 2015-06-24 苏州大学 一种多分类器构建方法和系统
CN105825081A (zh) * 2016-04-20 2016-08-03 苏州大学 一种基因表达数据分类方法及分类系统
US20160358099A1 (en) * 2015-06-04 2016-12-08 The Boeing Company Advanced analytical infrastructure for machine learning
WO2018107906A1 (zh) * 2016-12-12 2018-06-21 腾讯科技(深圳)有限公司 一种训练分类模型的方法、数据分类的方法及装置
CN108446741A (zh) * 2018-03-29 2018-08-24 中国石油大学(华东) 机器学习超参数重要性评估方法、系统及存储介质
CN109460825A (zh) * 2018-10-24 2019-03-12 阿里巴巴集团控股有限公司 用于构建机器学习模型的特征选取方法、装置以及设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7398269B2 (en) * 2002-11-15 2008-07-08 Justsystems Evans Research Inc. Method and apparatus for document filtering using ensemble filters
US20090063646A1 (en) * 2007-09-04 2009-03-05 Nixle, Llc System and method for collecting and organizing popular near real-time data in a virtual geographic grid
US8417715B1 (en) * 2007-12-19 2013-04-09 Tilmann Bruckhaus Platform independent plug-in methods and systems for data mining and analytics
CN107133436A (zh) * 2016-02-26 2017-09-05 阿里巴巴集团控股有限公司 一种多重抽样模型训练方法及装置
US11568170B2 (en) * 2018-03-30 2023-01-31 Nasdaq, Inc. Systems and methods of generating datasets from heterogeneous sources for machine learning
US11544630B2 (en) * 2018-10-15 2023-01-03 Oracle International Corporation Automatic feature subset selection using feature ranking and scalable automatic search

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732242A (zh) * 2015-04-08 2015-06-24 苏州大学 一种多分类器构建方法和系统
US20160358099A1 (en) * 2015-06-04 2016-12-08 The Boeing Company Advanced analytical infrastructure for machine learning
CN105825081A (zh) * 2016-04-20 2016-08-03 苏州大学 一种基因表达数据分类方法及分类系统
WO2018107906A1 (zh) * 2016-12-12 2018-06-21 腾讯科技(深圳)有限公司 一种训练分类模型的方法、数据分类的方法及装置
CN108446741A (zh) * 2018-03-29 2018-08-24 中国石油大学(华东) 机器学习超参数重要性评估方法、系统及存储介质
CN109460825A (zh) * 2018-10-24 2019-03-12 阿里巴巴集团控股有限公司 用于构建机器学习模型的特征选取方法、装置以及设备

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613983A (zh) * 2020-12-25 2021-04-06 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备
CN112613983B (zh) * 2020-12-25 2023-11-21 北京知因智慧科技有限公司 一种机器建模过程中的特征筛选方法、装置及电子设备
US11593388B2 (en) 2021-03-19 2023-02-28 International Business Machines Corporation Indexing based on feature importance
US11774941B2 (en) 2021-04-22 2023-10-03 Abb Schweiz Ag Method for providing a list of equipment elements in industrial plants
CN114119058A (zh) * 2021-08-10 2022-03-01 国家电网有限公司 用户画像模型的构建方法、设备及存储介质
CN114119058B (zh) * 2021-08-10 2023-09-26 国家电网有限公司 用户画像模型的构建方法、设备及存储介质
CN115713224A (zh) * 2022-11-28 2023-02-24 四川京龙光电科技有限公司 适用于lcd装配流程的多次元评估方法及系统
CN115695502A (zh) * 2022-12-15 2023-02-03 国网浙江省电力有限公司 适用于电力可靠通信的数据处理方法及装置

Also Published As

Publication number Publication date
US20210150415A1 (en) 2021-05-20
TWI705388B (zh) 2020-09-21
US11222285B2 (en) 2022-01-11
CN109460825A (zh) 2019-03-12
TW202032440A (zh) 2020-09-01

Similar Documents

Publication Publication Date Title
WO2020082865A1 (zh) 用于构建机器学习模型的特征选取方法、装置以及设备
KR102315497B1 (ko) 채점 모델을 구축하고 사용자 신용을 평가하기 위한 방법 및 디바이스
US9195910B2 (en) System and method for classification with effective use of manual data input and crowdsourcing
CN105718490A (zh) 一种用于更新分类模型的方法及装置
CN108229588B (zh) 一种基于深度学习的机器学习识别方法
CN109635010B (zh) 一种用户特征及特征因子抽取、查询方法和系统
CN110991474A (zh) 一种机器学习建模平台
CN110610193A (zh) 标注数据的处理方法及装置
CN110728313B (zh) 一种用于意图分类识别的分类模型训练方法及装置
CN109816043B (zh) 用户识别模型的确定方法、装置、电子设备及存储介质
CN109726764A (zh) 一种模型选择方法、装置、设备和介质
CN114418035A (zh) 决策树模型生成方法、基于决策树模型的数据推荐方法
US20180307720A1 (en) System and method for learning-based group tagging
CN104598632A (zh) 热点事件检测方法和装置
CN110647995A (zh) 规则训练方法、装置、设备及存储介质
CN110458600A (zh) 画像模型训练方法、装置、计算机设备及存储介质
US20190220924A1 (en) Method and device for determining key variable in model
CN108427756A (zh) 基于同类用户模型的个性化查询词补全推荐方法和装置
CN112396211A (zh) 一种数据预测方法及装置、设备和计算机存储介质
CN114428748A (zh) 一种用于真实业务场景的模拟测试方法及系统
CN105447519A (zh) 基于特征选择的模型检测方法
CN111325255B (zh) 特定人群圈定方法、装置、电子设备及存储介质
CN106874286B (zh) 一种筛选用户特征的方法及装置
CN109889981B (zh) 一种基于二分类技术的定位方法及系统
CN114443506B (zh) 一种用于测试人工智能模型的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19875882

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19875882

Country of ref document: EP

Kind code of ref document: A1