WO2021101463A1

WO2021101463A1 - Computer-implemented methods for selection of features in predictive data modeling

Info

Publication number: WO2021101463A1
Application number: PCT/TR2019/050984
Authority: WO
Inventors: Şadi Evren ŞEKER
Original assignee: Bi̇lkav Eği̇ti̇m Danişmanlik A.Ş.
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2021-05-27

Abstract

The present invention provides a computer-implemented method for selecting a subset of features in an automated manner wherein the features correspond to a dataset to be analyzed for a predictive model development in order to improve the performance of a predictive model.

Description

COMPUTER-IMPLEMENTED METHODS FOR SELECTION OF FEATURES

IN PREDICTIVE DATA MODELING

BACKGROUND

The present invention relates generally to the field of predictive data modeling and more specifically to computer-implemented method of selection of features for use in predictive data modeling.

Knowledge discovery is the most desirable end product of data collection. Recent advancements in database technology have lead to an explosive growth in systems and methods for generating, collecting and storing vast amounts of data. While database technology enables efficient collection and storage of large data sets, the challenge of facilitating human comprehension of the information in this data is growing ever more difficult. With many existing techniques the problem has become unapproachable. Thus, there remains a need for a new generation of automated knowledge discovery tools.

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition. While recent developments in deep learning and automated processing of images, text, and signals have enabled significant automation in feature engineering for those data types, feature engineering for relational and human behavioral data remains iterative, human-intuition driven, and challenging, and hence, time consuming. At the same time, because the efficacy of a machine learning algorithm relies heavily on the input features, any replacement for a human must be able to engineer them acceptably well.

A common problem in classification, and machine learning in general, is the reduction of dimensionality of feature space to overcome the risk of "overfitting". Data overfitting arises when the number n of features is large and the number of training patterns is comparatively small. In such situations, one can find a decision function that separates the training data, even a linear decision function, but it will perform poorly on test data. The task of choosing the most suitable representation is known as “feature selection”.

A number of different approaches to feature selection exists, where one seeks to identify the smallest set of features that still conveys the essential information contained in the original attributes. This is known as "dimensionality reduction" and can be very beneficial as both computational and generalization performance can degrade as the number of features grows, a phenomenon sometimes referred to as the "curse of dimensionality." Dimension reduction and transformation techniques are known in the literature. For example, for unsupervised (unlabeled) data, PCA (principal component analysis) or for supervised (labeled) data LDA (linear discriminant analysis) are useful. For the feature elimination, calculating p-value and eliminating the features with a predefined significance level) are also available in the literature.

The U.S. patent publication no. US 2011/0119213 A1 discloses a method for identification of a determinative subset of features from within a group of features. The disclosed method requires the use of specific machine learning algorithm which leads to dividing data set into training data set and test data set. Additionally the method does require to divide the data into train and test data which may lead time consumption and overfitting problems.

Machine learning algorithms provide desirable solution for the problem of discovering knowledge from vast amounts of input data. Flowever, the ability of a machine learning algorithm to discover knowledge from a data set is limited in proportion to the information included within the data set. Accordingly, there exists a need for a method and system for pre-processing data so as to augment the data to maximize the knowledge discovery by the available machine learning algorithms and the ones which may be developed in the future.

Additionally the automation of pre-processing data has some problems. For example, the feature selection/elimination requires an objective and depending on the effect of feature to this objective, if the feature has not have enough effect, the feature is eliminated. In the literature, well known methods like backward or forward elimination have some drawbacks, like the loss of their positive impact to the machine learning models even if their effect is minor. Accordingly, the need remains for an automated method of preparing data for a number of machine learning algorithms.

SUMMARY

One object of the present invention is to automate the data pre-processing, storage or transfer phases for predictive model development.

Another object of the present invention is to minimize the human errors during data pre-processing, storage or transfer phases for predictive model development.

One another object of the present invention is to minimise time consumed for preparing data pre-processing, storage or transfer phases for predictive model development.

One another object of the present invention is to increase the performance of a predictive model.

Another object of the present invention is to apply a machine learning algorithm more advantageously. One another object of the present invention is to facilitate the analysis of data.

A first aspect of the present invention provides a computer-implemented method for selecting a subset of features in an automated manner wherein the features correspond to a dataset to be analyzed for a predictive model development in order to improve the performance of a predictive model. The method of the present invention facilitates analysis of data by reformatting or augmenting the data prior to using the data in order to allow a machine learning algorithm to be applied more advantageously. The method of the present invention involves enriching the data, converting the data into a smaller and still meaningful version of the original data. Original data is processed in an automated manner with statistical techniques and converted to a completely new form of data, which is useful for a predictive model development.

A second aspect of the present invention provides a computer-implemented method for the development of a predictive model having an improved performance.

A third aspect of the present invention provides computer readable storage device for selecting a subset of features in an automated manner wherein the features correspond to a dataset to be analyzed for a predictive model development in order to improve the performance of a predictive model.

DETAILED DESCRIPTION

A computer-implemented method for selecting a subset of features in an automated manner wherein the features correspond to a dataset to be analyzed for a predictive model development, comprises the steps of:

(i) executing at least one feature engineering technique on the dataset to create new features; (ii) executing at least one dimension reduction technique for reducing the number of features in the data set; and

(iii) executing at least one feature selection technique on the dataset to constitute a subset of features.

Figure 1 illustrates one example of the method of the present invention. At 101 ; data loading is realized. This data may come from customers, research facilities, academic institutions, national laboratories, commercial entities or other public or confidential sources. The source of the data and the types of data provided are not crucial to the methods. The data may be collected from one or more local and/or remote sources. The data may be provided through any means such as via the internet, server linkages or discs, CD/ROMs, DVDs or other storage means. The collection of the data may be accomplished manually or by way of an automated process, such as known electronic data transfer methods.

At 102; an automated feature engineering technique is executed on the dataset to create new features. The objective of feature engineering is to create new features to represent as much information from an entire dataset in one table. Typically, this process is done by hand using pandas operation such as groupby, agg or merge in the state of the art and can be very tedious. Moreover manual feature engineering is limited both by human time constraints and imagination. Automated feature engineering aims to help with the problem of feature creation by automatically building hundreds or thousands of new features from a dataset. Automated feature engineering also provides some data transformations for different types of data like transforming the categorical data into numerical data by just using some encoding techniques, such as normalization, quantization, scaling, one-hot encoding and label encoding etc. For unstructured data such as texts or images, some feature extraction techniques like word2vec, tf-idf, n-grams can be applied. In the automated feature engineering step of the present invention the number of columns are increased by any technique known in the art. In the automated feature engineering step of the present invention the number of columns are increased by any technique known in the art.

According to one example of the present invention, there is a date field among the features of raw data, for example this is the visiting dates of the customer to a web page. The feature engineering automatically adds new features like, day, month, year, day of week, season etc.

At 103; an aggregation technique is realized. In the present invention an aggregation technique which is known in the state of the art such as the sum, product, median, minimum and maximum, quantiles, etc. may be used. In a preferred embodiment of the present invention repeating id’s are aggregated by using max, min, count, average.

After the aggregation, an automated preprocessing technique may be applied for the problems in the data set to enhance its usefulness. An automated preprocessing technique may comprise a variety of different techniques performed on the data set which are readily apparent to those of skilled in the art. Preprocessing the data set may comprise identifying missing or erroneous data points such as missing values, dirty data and noisy data (hereinafter “issues”) and taking appropriate steps to correct or fill in the flawed data or as appropriate remove the observation or the entire field from the scope of the problem. As an example of a preferred preprocessing technique, missing values may exist for example if the data for any record in any feature/column is missing. An example of the missing data is a customer without birth date in a data set. As an another example of a preferred preprocessing technique, dirty data may exist for example if a data point has a problem (like not satisfying any precondition on the feature). An example of the dirty data is a customer with birth date ahead from today. As another example of a preferred preprocessing technique, noisy data may exist for example if a data point has a questionable value. An example of the noisy data is a customer 150 years old. In the automated automated preprocessing step of the present invention any algorithm known in the art can be applied.

At 104; an automated dimension reduction step is applied for reducing the number of features in a data set. Dimension reduction algorithms develop small feature subset consisting of either same types of features as in the original feature set or derive some new features from original features depending on the need. More specifically, dimension reduction applied for reducing the dimensionality of the feature space, i.e., selecting the features which best represent the data. The techniques which may be used for this purpose include Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). PCA and LDA are some well-known dimension reduction approaches. PCA consists into an orthogonal transformation to convert samples belonging to correlated variables into samples of linearly uncorrelated variables. It can project the data from the original space into lower dimensional space in an unsupervised manner. The main reason for the use of PCA concerns the fact that PCA is a simple nonparametric method used to extract the most relevant information from a set of redundant or noisy data. PCA approach is described in the state of the art for example in the article WOLD SVANTE, KIM ESBENSEN AND PAUL GELADI, "Principal Component Analysis", Chemometrics and Intelligent Laboratory Systems 2.1-3 (1987): 37-52 ", which is hereby incorporated by reference.

The objective of LDA is to find a sub-space, where the projected samples from the same class are close to each other, while the projected samples from different classes are far from each other. As a result, LDA achieves maximum discrimination between classes in its lower-dimensional representation. LDA is a linear dimensionality reduction method, which works well only when the sample data is distributed on a linear subspace in the original space. LDA approach is described in the state of the art for example in the article GARY J., AND S. SELCUK ERENGUC, "Minimizing misclassifications in linear discriminant analysis", Decision Sciences 21.1 (1990): 63-85.

Both PCA and LDA can offer lossless or lossy dimension reductions. For example, PCA can reduce the 100 features into 2 dimensional data set but it does not guarantee the 2 dimensional data set can be expanded back to the 100 features, if the algorithms are configured for the 2 features output. On the other hand, there are ways in the state of the art for testing the data loss of the dimension reduction algorithms. There are ways in the state of the art for testing the data loss of the dimension reduction algorithms. For example, the present invention uses sensitivity mismatch measures for testing data loss of the dimension reduction algorithms for example as described in the document KUCFIIMANCFII GOPI K. et al., "Dimension reduction using feature extraction methods for real-time misuse detection systems", Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, 2004, IEEE, 2004 which is hereby incorporated by reference.

In a preferred embodiment of the present invention, the dimension reduction technique comprises the application of a dimension reduction method selected from the group consisting of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

In the present invention, an automated dimension reduction step is applied on the data set until the target loss value is above a predefined threshold. In an exemplary embodiment of the present invention the method keeps reducing the number of dimensions iteratively by decreasing the dimension as long as the loss value is zero when the system is built as a lossless dimension reduction. In other embodiments the loss value might be different and parametric.

At 105; an automated feature selection step is applied after the dimension reduction step. In other words, the features with reduced dimension go to the feature selection. According to another embodiment of the present invention, the feature subsets generated in the feature selection step goes to the dimension reduction as the feature selection is one effective means to identify relevant features for dimension reduction.

In machine learning and statistics, feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection enables the machine learning algorithm to train faster, reduces the complexity and makes it easier to interpret. It improves the accuracy of a model if the right subset is chosen and reduces overfitting. Feature selection has 3 major approaches in the literature: i) Filter methods, ii) Wrapper methods and iii) Embedded/hybrid methods.

Relying on characteristics of data, the filter methods evaluate features without utilizing any classification algorithm. In the filter method a feature selection algorithm applies to all features or subsets and eliminates if they do not satisfy the elimination criteria. Some techniques might be used in the selection criteria are correlation such as Pearson’s Correlation (p), p-value and significance, mutual information, variance thresholds, LDA (Linear Discriminant Analysis), ANOVA and Chi-Square. Filter models are easily scalable to very high dimensional datasets, computationally simple and fast.

The wrapper methods utilize a predetermined learning algorithm to evaluate the quality of selected features and offer a simple and powerful way to address the problems of feature selection. In other words, feature selection gets tested with the learning algorithm thus the accuracy measured by this algorithm is very high. The feature selection is proceeded if the feature elimination has a positive effect on the reduction of data set while not reducing the success rate of the algorithm or if the negative effect of eliminating one feature is negligible. Thus there is a continuous loop iterating over each of the features and for each iteration the learning algorithm is executed for the eliminated case of the feature. Some useful techniques for doing this automatically are Backward elimination, Forward Selection and Recursive Feature Elimination (RFE).

The embedded/hybrid methods are combination of both filter models and wrapper models. They include the features of 1) filter models: they are less computationally intensive. 2) Wrapper models: they include the interaction with model construction.

An automated feature selection may be used in the case that there is no or limited domain knowledge. For instance, if the scientist involved into the data science process has descent knowledge on the domain of expertise, than the feature selection can be done manually by going over each feature one by one. As this is time consuming the process needs to be handled automatically. An automated feature selection may be used in the case that there is limited time. Majority of the machine learning processes have to be completed in a limited time. For example, deciding on the advertisement model for a customer in an ad network should react immediately after getting some insights about the customer behavior. In such cases, only a limited time is provided for the machine learning model to decide or train itself automatically. An automated feature selection may be used in the case that the data is sensitive and it is not possible viewing the data manually. In such cases, the data is processed by the machine automatically and there is no human interaction to keep sensitive data from anybody. An automated feature selection may be used in the case that the machine learning model is isolated. In some cases, the installation of machine learning model is on a remote system and the system is isolated from the access of data scientist. This may happen because of the sensitive data, secure data centers, limited connections, etc. An automated feature selection may be used in the case that there is a high number of demand beyond the capacity of data scientists. In some cases, for example, more than 300,000 machine learning algorithms needs to be created and the feature selection process needs to be changed for each case (each problem, each person, each time etc) automatically. Besides all the cases above, even if it is possible to make feature selection by hand, an automated approach may be useful to avoid human mistakes and can provide a standardized approach for the manual feature selection and eliminates the time wasting feature selection routines. Also another advantage of an automated approach is to provide a baseline for the data scientists. A human mistake is something machine is checking for the alternatives of human decisions, but providing baseline is feeding a data scientist with a starting point because a manual feature selection has a high number of routine steps like applying same correlation checks between all the features or calculating the significance level of each feature etc. Thus the automated feature selection eliminates the routines for data scientists.

The feature selection technique of the present invention comprises the application of one or more automated feature selection methods known in the state of the art. In one preferred embodiment of the present invention the feature selection method comprises the calculation of feature relations and finding the highly dependent features by using correlation and/or mutual information. Afterwards the less significant feature is calculated by using p- value and it is eliminated. In another preferred embodiment of the present invention the feature selection technique comprises the application of a method selected from the group consisting of Backward elimination, Forward Selection and Recursive Feature Elimination (RFE).

A computer-implemented method for the development of a predictive model comprises the steps of:

(i) executing at least one automated feature engineering technique on the dataset to create new features;

(ii) executing at least one automated dimension reduction technique for reducing the number of features; (iii) executing at least one automated feature selection technique on the dataset to constitute a subset of features; and

(iv) executing at least one machine learning algorithm on the data set resulting from the above steps (i) to (iii).

At 106; the features remaining after feature selection are then used to develop a predictive model by using a machine learning algorithm such as artificial neural network, support vector machine (SVM), decision trees or Bayesian networks.

The present invention merges feature engineering, data preprocessing, dimension reduction and feature selection steps into an automated way for a machine learning algorithm.

Figure 2 illustrates an exemplary embodiment of the present invention which comprises an application of the feature selection and dimension reduction steps. The process starts with data loading and calculation of a correlation matrix (201). The correlation matrix shows the highly correlated i.e. highly dependent features. From each pair of features, the feature carrying little or no additional information beyond that carried by the other feature, is redundant and it will be eliminated. In other words, if there are highly correlated features in the data set, then one of these features will be eliminated. A dependency measure, correlation coefficient is used for this purpose. For instance, Pearson’s correlation coefficient (p) is used and the algorithm checks if the Pearson’s correlation coefficient (p) of each feature pair is higher than a predefined threshold (202) e.g. 0.05. If the p of a feature pair is higher than the predefined threshold, the p-value is calculated (203) and the less significant feature is eliminated (204). In an exemplary embodiment of the present invention it has been checked whether the value of the p of each pair is greater than the predefined threshold, then the feature with maximum variance is discarded and the one with minimum variance is retained to constitute the feature subsets.

If the p of a feature pair is not higher than the predefined threshold, the process starts with the dimension reduction (205), where application of LDA (206) is given as an example technique. The algorithm checks if the loss value is zero (207). If it is determined that the loss value is zero (208) then there is at least one feature to eliminate and this feature is found by calculating the p- value (209) and the less significant feature is eliminated (210). If it is determined that the loss value is not zero the algorithm keeps reducing the number of dimensions iteratively. Finally the p-value of each new feature is checked (211) and the features with p-value less than 5% significance are eliminated (210). The final data set is loaded to at least one machine learning algorithm to develop a predictive model (212).

The present invention is preferably applicable on big data analytics, streaming data analytics or embedded/hardware level data processing and automated machine learning preferably in the domains of banking and finance, real estate, customer service, hr, marketing, telecom, energy, retail and tourism.

Claims

1. A computer-implemented method for selecting a subset of features for a machine learning algorithm in an automated manner wherein the features correspond to a dataset to be analyzed for the development of predictive model, comprising the steps of: a. executing at least one automated feature engineering technique on the dataset to create new features; b. executing at least one automated dimension reduction technique for reducing the number of features; and c. executing at least one automated feature selection technique on the dataset to constitute a subset of features.

2. The method according to claim 1, further comprising the step of executing at least one preprocessing technique for the issues in the dataset such as missing values, dirty data and noisy data.

3. The method according to claim 1 or 2, wherein the feature selection technique comprises the calculation of feature relations and finding the highly dependent features by using correlation and/or mutual information.

4. The method according to claim 3, wherein the feature selection technique comprises the calculation and elimination of the less significant feature by using p-value.

5. The method according to claim 1 or 2, wherein the feature selection technique comprises the application of a wrapper method on the data set.

6. The method according to claim 5, wherein the feature selection technique comprises the application of a method selected from the group consisting of Backward elimination, Forward Selection and Recursive Feature Elimination (RFE).

7. The method according to any one of claims 1 to 6, wherein the dimension reduction technique comprises the application of a method selected from the group consisting of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

8. The method according to claim 7, comprising repeating the steps (a) to (c) until the target loss value is above a predefined threshold.

9. The method according to any one of claims 1 to 8, comprising repeating the steps (a) to (c) as long as the loss value is zero.

10. A computer-implemented method for the development of a predictive model comprising the steps of: a. executing at least one automated feature engineering technique on the dataset to create new features; b. executing at least one automated dimension reduction technique for reducing the number of features; c. executing at least one automated feature selection technique on the dataset to constitute a subset of features; and d. executing at least one machine learning algorithm on the data set resulting from the above steps (a) to (c).

11. The method according to claim 10, wherein the machine learning algorithm is artificial neural network, support vector machine (SVM), decision trees or Bayesian networks.

12. The method according to claim 10 or 11 , further comprising the step of executing at least one preprocessing technique for the issues in the dataset such as missing values, dirty data and noisy data.

13. The method according to any one of claims 10 to 12, wherein the feature selection technique comprises the calculation of feature relations and finding the highly dependent features by using correlation and/or mutual information.

14. The method according to claim 13, wherein the feature selection technique comprises the elimination of the less significant feature by using p- value.

15. The method according to any one of claims 10 to 12, wherein the feature selection technique comprises the application of a wrapper method on the data set.

16. The method according to claim 15, wherein the feature selection technique comprises the application of a method selected from the group consisting of Backward elimination, Forward Selection and Recursive Feature Elimination (RFE).

17. The method according to any one of claims 10 to 16, wherein the dimension reduction technique comprises the application of a method selected from the group consisting of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

18. The method according to any one of claims 10 to 17, comprising repeating the steps (a) to (c) until the target loss value is above a threshold.

19. The method according to any one of claims 10 to 18, comprising repeating the steps (a) to (c) as long as loss value is zero.

20. The method according to any one of claims 10 to 19, wherein the predictive model predicts behavior of a current customer with respect to retention of a current service or product of a vendor.

21. The method according to any one of claims 10 to 19, wherein the predictive model predicts behavior of a current customer with respect to risk of asserting claims, loan payment or prepayment to a vendor.

22. The method according to any one of claims 10 to 19, wherein the predictive model predicts behavior of a current customer with respect to usage of a current service or product of a vendor.

23. A computer readable storage device comprising instructions that, in response to execution, cause a system comprising a processor to perform operations, comprising: a. executing at least one automated feature engineering technique on the dataset to create new features; b. executing at least one automated dimension reduction technique for reducing the number of features; and c. executing at least one automated feature selection technique on the dataset to constitute a subset of features.

24. The computer readable storage device according to claim 23, further comprising executing at least one machine learning algorithm on the data set resulting from the above steps (a) to (c).