WO2023136812A1

WO2023136812A1 - Automatic feature generation and its application in intrusion detection

Info

Publication number: WO2023136812A1
Application number: PCT/US2022/011995
Authority: WO
Inventors: Yongqiang Zhang; Wei Lin
Original assignee: Hitachi Vantara Llc
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2023-07-20

Abstract

Example implementations described herein are directed to systems and methods for automatically iteratively generating features used to train a machine learning model, which can involve a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

Description

AUTOMATIC FEATURE GENERATION AND ITS APPLICATION IN INTRUSION DETECTION

BACKGROUND

Field

[0001] The present disclosure relates generally to machine learning, and more specifically, to automatic feature generation and application in intrusion detection. The example implementations described herein have applications in multiple fields including but not limited to Internet of Things (loT) and Operational Technology (OT).

[0002] A feature is a measurable property of the object. In machine learning, features are individual independent variables that are used as input to the machine learning models. Features captures the relevant information in the data to perform a specific machine learning task. There are two cases to use features for machine learning tasks. In supervised learning tasks, features are used together with the target, which is the dependent variable, to train a machine learning model. Then, only features are used as input to the model to generate outputs during the model inference phase. In unsupervised tasks, only features are used to train the machine learning model, and used as input to the model to generate outputs during the model inference phase.

[0003] Features are critical components to build a machine learning model. The number and quality of the features have a direct and major impact on the quality or performance of the machine learning models. To generate features that are used in the downstream machine learning modeling, there are two steps: feature derivation and feature selection. The feature derivation step first derives features from raw data. Then, some of the derived features will be selected to feed into the machine learning model to optimize the performance of the downstream machine learning models.

[0004] With regards to the feature derivation step, features are usually derived from raw data based on domain knowledge and data analysis. Given a table or a data frame of data, the columns represent the variables, and the rows represent the records or data. The features can be the columns in the raw data; they can also be derived from the raw data based on the problems, domain knowledge and the data analysis result. The process to derive features from raw data is called “feature engineering”, which is usually manual, time consuming and not reliable. It usually cannot guarantee the optimal set of features for the downstream machine learning tasks.

[0005] In most cases, features are not generic across different problems and domains. This means that there is a need to derive a different set of features for each specific problem. This adds more time and efforts to solve the machine learning problems.

[0006] With regards to the feature selection, the features, either obtained directly or derived from the raw data, may not be relevant or used in the downstream machine learning tasks. Unrelated features introduce noise to the data and make the model perform badly. Such unrelated features need to be removed before the machine learning modeling. Feature selection is a technique to select only the relevant features for the particular machine learning task, and remove the unrelated features. This can be done before or after building the model. There are various feature selection techniques: variance based, correlation-based, model-based, forward selection, backward selection or hybrid selection, and so on.

[0007] One example having above mentioned feature related issue is an intrusion detection system (IDS). IDSis a hardware device or software application that monitors a network or system for malicious activity or compliance policy violations. Such malicious activity or compliance policy violation is called an intrusion and is typically collected, managed and reported (e.g., through alert or event) centrally in the intrusion detection system.

[0008] Due to the dynamic nature of the malicious activities and compliance policy violations, the mechanism that is used to detect the intrusion needs to be updated frequently in order not to miss any potentially harmful intrusions. Recently, machine learning techniques such as anomaly detection techniques are commonly used in the intrusion detection systems. In order to achieve better performance for the anomaly detection model, there is a need to derive the features from the raw data and keep updating the features to capture the newly introduced intrusions or anomalies.

[0009] In an example related art implementation, there can be an automatic stochastic method for feature discovery and use of the same in a repeatable process. The algorithm to generate features in such a related art implementation is through a general stochastic method, and in particular an evolutionary algorithm can be used. The method is used in the manufacturing domain, especially in welding industry. The data is mainly on time series data. The first set of candidate features is generated based on domain knowledge. Then the first set of features are refined with evolutionary algorithms. In this related art implementation, the way to generate multiple features is through multiple iterations: essentially one feature is generated from each iteration and added to the candidate feature set, until the feature set does not change. In this sequential way, the feature generated from each iteration may be highly correlated with the previously generated features and as a result the generated features are not good from modeling point of view.

[0010] In another related art implementation, there is feature selection and feature synthesis methods for predictive modeling in a twinned physical system. Such a related art implementation uses evolutionary techniques for both feature selection and feature synthesis (or feature derivation). For feature synthesis, the related art implementation uses information gain as the fitness function. Such related art implementations also utilize algorithms that only supports classification type or numerical dataset with binary labels.

SUMMARY

[0011] Although the related art implementations use evolutionary algorithms to generate new features, such related art implementations are very specific with regards to problem domain, data type, as well as details of the algorithms (e.g., including fitness functions, ways to select parents, ways to select operators, ways to generate children, and so on). No related art implementation utilizes feature selection based on either Bayesian optimization or reinforcement learning. Further, the example implementations described herein are generic to be applied to all domains and any data type.

[0012] Further, the related art implementations involve several limitations and restrictions, which the example implementations described herein address through techniques as described herein.

[0013] One limitation is with respect to the feature derivation. In the related art, features are usually manually derived, based on domain knowledge and data analysis. Such a manual process is time-consuming, error-prone and not reliable for finding the optimal feature(s). Other approaches may try different operators on the raw variables to see which feature(s) can be good candidates for the downstream machine learning models. Such exhaustive approaches are time-consuming and impossible for large dataset with many variables. Therefore, there is a need to find an automated and effective solution to derive features which can lead to more optimized performance for the downstream machine learning model. [0014] Another limitation is with respect to the feature selection. Conventionally, several techniques exist to select relevant features to build machine learning models. Such techniques include the following.

[0015] Manual feature selection is usually based on domain knowledge and data analysis, limited to correlation or variance, and can only capture univariate or pairwise relationship and linear relationship are captured. Such a manual process is time-consuming, error-prone and not reliable for finding the optimal set of features.

[0016] Exhaustive feature selection involves features that can be selected by exhaustively trying out all the combinations of the features. This approach is time-consuming and impossible for large datasets with many features.

[0017] Gradual feature selection involves features that can be selected gradually: via forward selection, backward selection, hybrid selection. In these approaches, one feature is added or eliminated at each step. Some useful features may be eliminated too early, and as a result the global optimal set of features may not be obtained.

[0018] Model based feature selection applies all the features to build models, and let the models determine the importance of each feature. This technique only applies to simple models such as linear or tree-based models. Further, it may not be possible to build models with too many features.

[0019] Therefore, there is a need to find an automated and effective solution to select features which can lead to an optimized performance for the downstream machine learning model.

[0020] Another limitation is with respect to intrusion detection. Conventionally, intrusions are detected through a rule-based model or machine learning models. However, since intrusion patterns keep changing, new patterns may not be captured effectively with the existing approaches in time. The delay to capture the intrusions may lead to big damage to the system. There is a need to build a solution to automatically capture the dynamic intrusions in time.

[0021] In addition, it is preferable if highly correlated features are not used to build a model for most model algorithms. Example implementations described herein address such issues. [0022] To solve the problems of the related art, the example implementations described herein involve the following solutions.

[0023] With respect to feature derivation, example implementations described herein introduce a solution that use an evolutionary optimization approach and automatically derive new features based on raw variables and a predefined list of operators. The fitness function is based on the correlation between the feature and the target, which essentially capture the nonlinear relationship between the target and the raw variables. There are several variations for this approach as will be described herein.

[0024] With respect to the feature selection, example implementations involve two modelbased feature selection solutions; one is based on Bayesian optimization and the other is based on reinforcement learning. There is also an option to include a Bayesian optimization based approach as part of the reinforcement learning based approach.

[0025] With respect to the intrusion detection, example implementations apply the feature derivation and feature selection techniques to automatically generate dynamic features in time for intrusion detection.

[0026] Aspects of the present disclosure can involve a method for automatically iteratively generating features used to train a machine learning model, the method involving a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

[0027] Aspects of the present disclosure can involve a computer program, storing instructions for automatically iteratively generating features used to train a machine learning model, the instructions involving a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with modelbased feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model. The computer program and instructions may be stored on a non-transitory computer readable medium and executed by one or more processors.

[0028] Aspects of the present disclosure can involve an apparatus, configured to automatically iteratively generating features used to train a machine learning model, the apparatus involving a processor, configured to execute instructions that include a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

[0029] Aspects of the present disclosure can involve a system for automatically iteratively generating features used to train a machine learning model, the system involving means for a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and means for applying the selected subset of derived features that met the exit criteria to the machine learning model.

BRIEF DESCRIPTION OF DRAWINGS

[0030] FIG. 1 illustrates a solution architecture for automatic feature generation in machine learning, in accordance with an example implementation.

[0031] FIG. 2 illustrates the workflow for the evolutionary feature derivation, in accordance with an example implementation. [0032] FIG. 3 illustrates an example workflow for the feature selection based on Bayesian optimization, in accordance with an example implementation.

[0033] FIG. 4 illustrates a workflow for the feature selection based on reinforcement learning, in accordance with an example implementation.

[0034] FIG. 5 illustrates how to build the intrusion detection models, in accordance with an example implementation.

[0035] FIG. 6 illustrates the solution architecture for monitoring and detection of intrusions in real time, in accordance with an example implementation.

[0036] FIG. 7 illustrates a system involving a plurality of assets networked to a management apparatus, in accordance with an example implementation.

[0037] FIG. 8 illustrates an example computing environment with an example computer device suitable for use in some example implementations.

DETAILED DESCRIPTION

[0038] The following detailed description provides details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term “automatic” may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.

[0039] FIG. 1 illustrates a solution architecture for automatic feature generation in machine learning, in accordance with an example implementation. The features are derived and selected in an iterative manner until the optimized features are generated for the downstream machine learning tasks. [0040] The elements in the solution architecture are described as follows, with further details provided with respect to the other figures. Data (Input) 100 is the raw data that is provided as the input to the automatic feature generator. Data Preprocessing 101 involves preprocessing the raw data with several techniques, including but not limited to: removing highly correlated variables, removing variables with too many missing values, removing variables with the same values, data resampling, and so on. Evolutionary feature derivation 102 is an evolutionary programming technique by which the features are automatically derived in an optimized manner. Optimized feature selection 103 selects the features from the candidate features in order to optimize downstream machine learning tasks. The optimized feature selection 103 utilizes two techniques: one is based on Bayesian optimization, and the other is based on reinforcement learning.

[0041] Selected features 104 are the selected features from the optimized feature selection 103. These selected features 104 will be used as input to evolutionary feature derivation 102 in the next iteration. Note that in the next iteration, the derived features that are not selected features will be removed/filtered out in the population before the evolutionary feature derivation 102 is applied. Features (Output) 105 is the features that is the output of the feature generator.

[0042] The iteration will stop once some predefined exit criteria is met. The exit criteria can be the number of iterations that meets the predefined threshold, the performance of machine learning model based on selected features meets the success criteria, or the set of features does not change from one iteration to the next iteration.

[0043] In the following description, each component in the solution architecture is discussed in detail.

Data 100

[0044] In the machine learning domain, a lot of data is physically collected. For example, in e-commerce, user engagement (page view, page click, image view, review, and so on) with the website can be recorded and this can generate a lot of data. In Internet of Things (loT) systems, the sensors are installed on the assets to collect time-series data at some collection frequency: one minute or even one millisecond. This will generate a lot of data as well. [0045] In another case, the data cannot be physically collected due to some constraints on cost, time, and so on. In this case, synthetic data can be generated based on the domain knowledge. Example implementations described herein can use some transformation (e.g., for example on images and sound data) or some randomization on the numerical data to generate synthetic data. In the loT space, digital twins of the physical systems can be built to generate “virtual sensor” data to complement or validate the physical sensor data.

[0046] Regardless, there is a large amount of data (“Big Data”) and to solve a particular machine learning problem, there is a need to identify what data is useful for the problem. If the raw data cannot be used directly to solve the problem, then there is a need to derive some other variables (or features) to solve the problem.

Data Preprocessing 101

[0047] The raw data usually cannot be directly used for the machine learning tasks due to the availability and quality of the data. For data availability, if the data is not enough for a machine learning task, the example implementations may resample the data. For instance, for a failure prediction task, if there are very few failures (say 0.1% of the data), then there is a need to resample the data to make the amounts of normal data and failure data roughly on the same scale. Resampling technique include up-sampling of the minority class, down-sampling the majority class or generate synthetic data of minority class (like “SMOTE” algorithm).

[0048] For data quality, there are a lot of missing values for one variable (say 99% of the values are missing) and there is a need to impute the missing values for this variable or remove such a variable. Another case is that the data in a variable is all the same (or just with a very small variance). Since there is no benefit to use such a variable in the machine learning task, the example implementations will remove such variable. In another case, if two variables are highly correlated, one of them should be removed for most of the downstream machine learning tasks.

Evolutionary Feature Derivation 102

[0049] In the following description, example implementations involve a solution based on evolutionary programming to automatically derive features in an optimal manner. FIG. 2 illustrates the workflow of this solution. The algorithm workflow is as follows. [0050] At 201, the feature derivation starts with the preprocessed variables based on the raw variables. At 202, the flow initializes the population of features with all the preprocessed variables. The flow also initializes a set of operators that can be applied to the features. The operators can include, but are not limited to: “add”, “subtract”, “multiply”, “divide”, “exponential”, “logarithm”, “power”, “sine”, “cosine”, “tangent”, and so on. They can also be some user-defined functions.

[0051] At 203, the flow calculates the fitness function (i.e., the correlation between each variable and target variable). The correlation can be the Pearson correlation coefficient. Then, the flow checks the calculated correlation coefficient with a predefined threshold. Depending on the desired implementation, other functions can be used as fitness function, such as but not limited to overall accuracy, root mean squared error, user defined metrics, and so on.

[0052] If the absolute value of the correlation coefficient is above a predefined threshold, then the flow adds the variable to the result set. Then, the flow calculates the correlation between this variable with any other feature in the result set. If the absolute value of the correlation coefficient with a feature is above a predefined threshold, then the flow keeps the one feature which has higher correlation with the target.

[0053] Otherwise, and if the variable is a derived feature, then this variable is removed. Usually in machine learning modeling, variables or features that are not linearly correlated with the target cannot be dropped since the features may have non-linear correlation with the target and they are still useful for modeling. However, since here the derived feature is already a nonlinear combination of raw variables, the nonlinear relationship between target and raw variables are evaluated, so it is safe to drop derived features when they are not highly linearly correlated with the target.

[0054] At 204, the flow checks if the result set meets the exit criteria, for example, the number of features in the result set is above a predefined threshold. Other exit criteria can be used in accordance with the desired implementation, such as but not limited to time spent on the whole process, model metrics meeting the success criteria, no change of the set of features from one iteration to the next iteration; and so on.

[0055] If the answer is (yes), then the flow stops the whole process at 205 and returns the result set. Otherwise (no), the flow generates a new feature population by applying operators to the individuals in the feature population by using evolutionary operation techniques at 206. Examples of evolutionary operation techniques can involve selection, crossover, mutation and inversion. The flow then returns to 203 to repeat the process starting with the calculation of the fitness function.

[0056] There are several variations for the above evolutionary feature derivation solution, based on the desired implementation. With regards to the fitness function, besides the correlation coefficient, some model metrics may also be used. With symbolic regression (which is based on the evolutionary programming), other model metrics including but not limited to overall accuracy, root mean squared error, or user-defined metrics can be used. With regards to the exit criteria, this can be the number of features in the result set, time spent on the whole process, model metrics meeting the success criteria, no change of the set of features from one iteration to the next iteration, and so on in accordance with the desired implementation. Further, operators can be basic operators including but not limited to: “add”, “subtract”, “multiply”, “divide”, “exponential”, “logarithm”, “power”, “sine”, “cosine”, “tangent”, and so on. It may also be user-defined functions.

[0057] Depending on the desired implementation, there can also be multiple runs with random initialization, evolutionary programming requires a random seed as an input, which controls how the evolutionary programming perform. Usually the best random seed is not known and there is a need to try several of them and see how each one performs. In this solution, different random seeds can be used to control what variables and operators to use at the start for each run. Each run will generate a result feature set and then the result feature sets from all the runs can be merged. The features in the merged set can be ranked based on the number of their appearances in the feature sets from all the runs and/or their correlation with the target. Then the highly correlated features in the merged set need to be identified and removed from the final feature set based on the rank (i.e., if feature 1 and feature 2 are highly correlated and feature 1 has a higher rank than feature 2, then feature 2 will be removed and feature 1 will be kept in the merged feature set). Finally only a predefined number of features with high ranks will be kept in the final feature set.

[0058] Depending on the desired implementation, to avoid the highly correlated features in the result feature set , in each run, only the features that are derived in the last iteration of each run can be used, without using the features that are derived in the intermediate iterations of each run. [0059] Depending on the desired implementation, there can also be multiple outputs in each iteration: multiple children can be derived in each iteration and used as features.

[0060] Additionally, symbolic regression can also be used as an implementation of convolutionary programming. For example, “gplearn” is an example open-source implementation of symbolic regression. After setting the regression model metrics, symbolic regression can be run against the individuals and the operators, and gradually train and derive features. Once the exit criteria is met, the output can be used as the result feature set. Each feature in the result feature set can be represented by a formula of the original individuals and operators. The features can come from different iterations during this training process, or just the last iteration.

Optimized Feature Selection 103

[0061] Once a set of features are derived from the raw data, there is a need to identify which subset of features are optimal for the downstream machine learning tasks. Example implementations described herein provide two solutions for feature selection: one is based on Bayesian optimization and the other is based on reinforcement learning. Both are model-based, which means that for each selected set of features, a model or surrogate function is used to evaluate the performance of the selected features and decide on which subset of features are optimal.

[0062] FIG. 3 illustrates an example workflow for the feature selection based on Bayesian optimization, in accordance with an example implementation. At 301, the solution starts with the derived features from evolutionary feature derivation 102 of FIG. 2. Then at 302 and 303, the solution randomly samples a subset of features from the derived features via a randomly selected binary mask and trains machine learning models on the sampled features and gets the performance metrics at 304. This flow can be reiterated for several runs M with several subsets of features.

[0063] Then at 305, the flow trains the Gaussian regression as a surrogate of machine learning models by using the subsets of features and the performance metrics from 304. The features for the Gaussian regression are the subset of features, which is represented by a binary sequence of derived features. The target is the performance metrics. At 306, the flow defines and gets the acquisition function and chooses the optimal set of features. At 307, the flow trains the machine learning model for the problem with the optimal set of features and gets the performance metrics. At 308, the flow checks the exit criteria to determine if the process should be stopped. Depending on the desired implementation, the exit criteria can be the number of rounds, whether the model metrics meets the success criteria, and so on in accordance with the desired implementation. If the exit criteria is met (Yes), then the flow ends, otherwise (No) the flow returns to 305.

[0064] There can also be several variations to the features solution of FIG. 3 depending on the desired implementation.

[0065] In an example variation, there can be surrogate functions. While the Gaussian process model is the most common surrogate function for the Bayesian optimization, other surrogate functions for particular business problems may also be used depending on the desired implementation, such as Tree Parzen Estimators (TPE). In another case, if the machine learning model for the downstream tasks is too complex, a simpler machine model (e.g., linear model, tree-based model) can be used as a surrogate of the complex machine learning model. This essentially solves the problems for conventional model-based feature selection techniques which are applicable for simple machine learning model algorithms.

[0066] There can be various acquisition functions in accordance with the desired implementation. Different acquisition functions can be used, including but not limited to: probability of improvement, expected improvement, Bayesian expected losses, upper confidence bounds (UCB), Thompson sampling and any hybrids of such depending on the desired implementation. They all trade-off exploration and exploitation so as to minimize the number of function queries.

[0067] In another variation, there can be feature representation for surrogate functions. Besides binary representation of the features, integer representation can be used as features to train the surrogate model.

[0068] FIG. 4 illustrates a workflow for the feature selection based on reinforcement learning, in accordance with an example implementation. The flow is as follows. The flow begins at 401 with the derived features from the evolutionary feature derivation 102. At 402 and 403, the flow randomly samples a subset of features from the derived features via a randomly selected binary mask, and trains machine learning models on the features to get the performance metrics at 404. This flow can be reiterated for several runs M with several subsets of features. [0069] At 405, for a set of feature lists, the flow obtains the feature importance for each feature. At 406 and 407, the flow selects features through exploitation and exploration. At 406, for exploitation, the flow selects the top Ki important features, where Ki is a predefined number of features selected by exploitation. At 407 for exploration, the flow randomly selects K2 features from the rest of the features, where K2 is a predefined number of features for exploration. At 408, the flow builds a model for the set of K1+K2 features and gets the performance metrics. The flow updates the feature importance for each feature based on this run. At 409, the flow checks whether the exit criteria is met or not, where the exit criteria can be: number of rounds, model metrics meeting the success criteria, number of result features, and so on. If so (yes) then the process ends, otherwise (no), the flow returns back to 405.

[0070] Depending on the desired implementation, there can be several variations for the solution of FIG. 4 in accordance with the desired implementation. For example, the epsilon- greedy algorithm or Thompson’s sampling can be used to select features. In such a variation, the features can be split into two groups based on a predefined feature importance threshold: group A contains features greater than the threshold; group B contains features less than the threshold. A random number is generated and if it is greater than epsilon, the feature from B (or A and B) is randomly selected; otherwise a feature from A is randomly selected. Epsilon controls the tradeoffs between exploration and exploitation. If a probability distribution for each feature importance score is formed, then Thompson’s sampling can be used to select features.

[0071] In another variation, Ki and K2 can be adjusted across the iterations. For example, Ki can be increased for the purpose of more exploitation and K2 can be decreased for the purpose less exploration as the iterations go.

[0072] Feature importance is done for each run and are combined to get a single list of features. Their importance can be done through aggregating the features importance values from multiple lists with aggregation functions such as average, maximum, and so on. Using the performance metric value for each run as weight, example implementations can multiply it with the feature importance values and then aggregate the features.

[0073] In another variation, both of the approaches for feature selection can be combined. In such an example implementation, the feature selection approach is run based on Bayesian optimization and a list of important features, Fb is selected. Then when running the feature selection approach based on reinforcement learning, instead of randomly generate features, training models and identify important features, feature set Fb is used as the important features, and Ki features are selected from this list. As a summary, there are three options to select features based on the two approaches: a. the feature selection approach based on Bayesian optimization; b. the feature selection approach based on reinforcement learning; c. the feature selection approach based on Bayesian optimization and reinforcement learning.

Each one is a separate approach, but option c. takes advantage of both the Bayesian optimization based approach and reinforcement learning based approach, and is preferred.

[0074] In an intrusion detection system (IDS), due to the dynamic nature of the malicious activities and compliance policy violations, the mechanism that is used to detect the intrusion needs to be updated frequently in order not to miss any potentially harmful intrusions. Nowadays, machine learning techniques like anomaly detection techniques are commonly used in the intrusion detection systems. In order to achieve better performance for the anomaly detection model, there is a need to derive the features from the raw data and to keep updating the features to capture the newly introduced intrusions or anomalies.

[0075] The solutions to automatically generate features are described above, as shown in FIG. 1. These solutions are used to automatically and dynamically generate features and feed them into the anomaly detection model for intrusion detection. FIG. 5 illustrates how to build the intrusion detection models while FIG. 6 illustrates how to use the intrusion detection model to monitor and detect intrusion in the real time system.

[0076] FIG. 5 illustrates how to build the intrusion detection models, in accordance with an example implementation. In FIG. 5, there are several components, as described below. Historical Data (Input) 501 involves the historical data that are collected and used to build intrusion detection model. This can be collected from some logs, Internet of Things (loT) sensors, and so on in accordance with the desired implementation.

[0077] Automatic feature generator 502 is the same as that of FIG. 1. The module takes raw data and automatically generates the features that capture the signals in the raw data and are useful for the downstream modeling. Features (Output) 503 is the output of the automatic feature generator 502, which are features that can be used for downstream modeling.

[0078] With regards to intrusion detection model 504, there are two types of intrusion detection models. One type of model is a signature-based detection model which is used to detect known intrusions; the other is an anomaly-based detection model, which is used to detect unknown intrusions. Both types of models can use the generated features to detect intrusions in the system. As part of model building process, there is also a need to evaluate the model performance manually or automatically with the intrusions confirmed by the operators or domain experts.

[0079] With regards to the intrusion mode identification model 505, explainable artificial intelligence (Al) can be applied to identify the root cause of the intrusion and cluster them into an intrusion mode. Explainable Al can be used to derive root causes for each detected intrusion. For example, ELI5 and SHAP are two open-source libraries used to explain the prediction results for machine learning models. Such libraries are designed to explain the result from one example each time.

[0080] In order to keep detecting the intrusions including both previously known and newly introduced unknown intrusions, there is a need to keep generating features to match the newly introduced intrusions, while continuing detecting the previously known intrusions. Thus, there is a need to retrain the models by running the whole process in FIG. 5 based on some schedule, or in a real-time streaming manner.

[0081] FIG. 6 illustrates the solution architecture for monitoring and detection of intrusions in real time, in accordance with an example implementation. In FIG. 6, there are several components as follows.

[0082] Realtime Data (Input) 601 is the data that is collected in realtime and fed into the automatic feature generator 602 module. Automatic feature generator 602 is the module as shown in FIG. 1. The module takes raw data and automatically generate the features. Features (Output) 603 is the output of the automatic feature generator 602, which are features for downstream machine learning modeling. Intrusion detection model 604 is the intrusion detection model constructed from the flow of FIG. 5 and is applied to the generated features to detect intrusions 605. Intrusions 605 is the the output of the intrusion detection modeling 604 and are indicative of anomalies/intrusions. Intrusion mode identification model 606 is the model generated from the flow of FIG. 5 and is applied to the intrusions to identify the root causes of the intrusion, and then cluster the intrusions into an intrusion mode 607. Intrusion mode 607 is the output of the intrusion mode identification model 606 and is the intrusion mode of the detected intrusion.

[0083] Through the example implementations described herein, the feature engineering process can be improved to efficiently and effectively generate features automatically in order to achieve better performance for the downstream machine learning solutions.

[0084] The example implementations described herein also introduce a feature derivation solution based on evolutionary programming. This solution can automatically and dynamically derive features for optimal performance of the downstream machine learning models.

[0085] Example implementations described herein also involve two feature selection techniques to select optimal set of features for the downstream machine learning modeling. One is based on Bayesian optimization and the other is based on reinforcement learning. There is also an option to incude Bayesian optimization based approach as part of the reinforcement learning based approach.

[0086] Further, the example implementations described herein also introduce a solution for intrusion detection, where features are automatically generated in order to detect time-sensitive dynamic intrusions.

[0087] The solutions for automatic feature generation can also be used for loT insurance. loT insurance installs loT devices onto the asset of interest and uses the data collected from loT devices to improve the understanding of potential risks and issues in the asset. Advances in loT can improve productivity, overall profitability of the business, and the risk profile of the portfolio. loT advances can be realized for the full range of products and lines of business, from commercial, to life, property and casualty and health. New types of data allow for increased precision in assessing risk and pricing policies. For example, underwriters can recommend real-time pricing and policy term adjustments through continuous monitoring and assessment of loT data.

[0088] Similar to how the automatic feature generation is used in intrusion detection problem, the solutions for automatic feature generation can be used to generate features based on loT insurance data and feed them into the downstream machine learning model for loT insurance prediction or evaluation. As an example implementation, the downstream machine learning model can be a failure detection model to predict the failures or anomaly for an asset of the interest, based on the features from the automatic feature generation module. The results from the failure detection model can be used to derive some insights and make business decisions.

[0089] FIG. 7 illustrates a system involving a plurality of assets networked to a management apparatus, in accordance with an example implementation. One or more assets 701 are communicatively coupled to a network 700 (e.g., local area network (LAN), wide area network (WAN)) through the corresponding on-board computer or Internet of Things (loT) device of the assets 701, which is connected to a management apparatus 702. The management apparatus 702 manages a database 703, which contains historical data collected from the assets 701 and also facilitates remote control to each of the assets 701. In alternate example implementations, the data from the assets can be stored to a central repository or central database such as proprietary databases that intake data, or systems such as enterprise resource planning systems, and the management apparatus 702 can access or retrieve the data from the central repository or central database. Asset 701 can involve any physical system for use in a physical process such as an assembly line or production line, in accordance with the desired implementation, such as but not limited to servers, programmable logic controllers, air compressors, lathes, robotic arms, and so on in accordance with the desired implementation. The data provided from the sensors of such assets 701 can serve as the data flows as described herein upon which analytics can be conducted.

[0090] FIG. 8 illustrates an example computing environment with an example computer device suitable for use in some example implementations, such as a management apparatus 702 as illustrated in FIG. 7, or as an on-board computer of an asset 701. Computer device 805 in computing environment 800 can include one or more processing units, cores, or processors 810, memory 815 (e.g., RAM, ROM, and/or the like), internal storage 820 (e.g., magnetic, optical, solid state storage, and/or organic), and/or I/O interface 825, any of which can be coupled on a communication mechanism or bus 830 for communicating information or embedded in the computer device 805. I/O interface 825 is also configured to receive images from cameras or provide images to projectors or displays, depending on the desired implementation. [0091] Computer device 805 can be communicatively coupled to input/user interface 835 and output device/interface 840. Either one or both of input/user interface 835 and output device/interface 840 can be a wired or wireless interface and can be detachable. Input/user interface 835 may include any device, component, sensor, or interface, physical or virtual, that can be used to provide input (e.g., buttons, touch-screen interface, keyboard, a pointing/ cursor control, microphone, camera, braille, motion sensor, optical reader, and/or the like). Output device/interface 840 may include a display, television, monitor, printer, speaker, braille, or the like. In some example implementations, input/user interface 835 and output device/interface 840 can be embedded with or physically coupled to the computer device 805. In other example implementations, other computer devices may function as or provide the functions of input/user interface 835 and output device/interface 840 for a computer device 805.

[0092] Examples of computer device 805 may include, but are not limited to, highly mobile devices (e.g., smartphones, devices in vehicles and other machines, devices carried by humans and animals, and the like), mobile devices (e.g., tablets, notebooks, laptops, personal computers, portable televisions, radios, and the like), and devices not designed for mobility (e.g., desktop computers, other computers, information kiosks, televisions with one or more processors embedded therein and/or coupled thereto, radios, and the like).

[0093] Computer device 805 can be communicatively coupled (e.g., via I/O interface 825) to external storage 845 and network 850 for communicating with any number of networked components, devices, and systems, including one or more computer devices of the same or different configuration. Computer device 805 or any connected computer device can be functioning as, providing services of, or referred to as a server, client, thin server, general machine, special-purpose machine, or another label.

[0094] I/O interface 825 can include, but is not limited to, wired and/or wireless interfaces using any communication or I/O protocols or standards (e.g., Ethernet, 802.1 lx, Universal System Bus, WiMax, modem, a cellular network protocol, and the like) for communicating information to and/or from at least all the connected components, devices, and network in computing environment 800. Network 850 can be any network or combination of networks (e.g., the Internet, local area network, wide area network, a telephonic network, a cellular network, satellite network, and the like). [0095] Computer device 805 can use and/or communicate using computer-usable or computer-readable media, including transitory media and non-transitory media. Transitory media include transmission media (e.g., metal cables, fiber optics), signals, carrier waves, and the like. Non-transitory media include magnetic media (e.g., disks and tapes), optical media (e.g., CD ROM, digital video disks, Blu-ray disks), solid state media (e.g., RAM, ROM, flash memory, solid-state storage), and other non-volatile storage or memory.

[0096] Computer device 805 can be used to implement techniques, methods, applications, processes, or computer-executable instructions in some example computing environments. Computer-executable instructions can be retrieved from transitory media, and stored on and retrieved from non-transitory media. The executable instructions can originate from one or more of any programming, scripting, and machine languages (e.g., C, C++, C#, Java, Visual Basic, Python, Perl, JavaScript, and others).

[0097] Processor(s) 810 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 860, application programming interface (API) unit 865, input unit 870, output unit 875, and inter-unit communication mechanism 895 for the different units to communicate with each other, with the OS, and with other applications (not shown). The described units and elements can be varied in design, function, configuration, or implementation and are not limited to the descriptions provided. Processor(s) 810 can be in the form of hardware processors such as central processing units (CPUs) or in a combination of hardware and software units.

[0098] In some example implementations, when information or an execution instruction is received by API unit 865, it may be communicated to one or more other units (e.g., logic unit 860, input unit 870, output unit 875). In some instances, logic unit 860 may be configured to control the information flow among the units and direct the services provided by API unit 865, input unit 870, output unit 875, in some example implementations described above. For example, the flow of one or more processes or implementations may be controlled by logic unit 860 alone or in conjunction with API unit 865. The input unit 870 may be configured to obtain input for the calculations described in the example implementations, and the output unit 875 may be configured to provide output based on the calculations described in example implementations. [0099] In a first aspect, processor(s) 810 can be configured to execute a method or instructions for automatically iteratively generating features used to train a machine learning model, such a method or instructions involving a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the prepopulated features based on a fitness criteria as illustrated at 102 of FIG. 1; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model as illustrated at 103 and 104 of FIG. 1; c) iteratively executing steps a) to b) until an exit criteria is met as illustrated at 102 to 104 of FIG. 1; and applying the selected subset of derived features that met the exit criteria to the machine learning model as illustrated at 105 of FIG. 1. As described with respect to FIG. 1, the exit criteria can be based on a model evaluation result of the machine learning model, or from other desired exit criteria as described with respect to FIG. 1. As described herein, the machine learning model can be configured to solve the intrusion detection problem, and can also be configured to solve failure detection depending on the desired implementation.

[0100] In a second aspect, processor(s) 810 can be configured to execute the method or instructions as described in the first aspect, wherein the deriving features with the evolutionary optimization process involves using a correlation coefficient as the fitness criteria, the evolutionary optimization process configured to drop derived features based on a linear correlation coefficient as illustrated at 201 to 203 of FIG. 2.

[0101] In a third aspect, processor(s) 810 can be configured to execute the method or instructions as described in any of the above aspects, wherein the deriving features with the evolutionary optimization process involves obtaining the pre-processed variables from preprocessing raw data; calculating the fitness criteria which uses correlations between each of the pre-processed features and a target variable; for an absolute value of a correlation coefficient of the fitness criteria being above a predefined threshold for said each of the pre- processed features, adding said each of the pre-processed features to a result set; calculating another correlation coefficient between said each of the pre-processed features and other features in the result set; for another absolute value of the another coefficient between said each of the pre-processed features with the other features being above another predefined threshold, retaining one of the said each of the pre-processed variables and the other features that have a highest correlation with the target variable; for the absolute value of a correlation coefficient of the fitness criteria not being above the predefined threshold for said each of the pre- processed features, for ones of the pre-processed variables associated with said each of the pre- processed features being a derived feature, removing the ones of the pre-processed variables as described with respect to FIG. 2.

[0102] In a fourth aspect, processor(s) 810 can be configured to execute the method or instructions of any of the above aspects, and further involve for the result set meeting an exit criteria, returning the result set as the derived features; for the result set not meeting the exit criteria, generating a new feature population from the operators by using evolutionary operation techniques; and re-executing the evolutionary optimization process as illustrated in FIG. 2.

[0103] In a fifth aspect, processor(s) 810 can be configured to execute the method or instructions of any of the above aspects, wherein the re-executing the evolutionary optimization process involves multiple runs of multiple random seeds with random initialization and aggregates results as described with respect to FIG. 2.

[0104] In a sixth aspect, processor(s) 810 can be configured to execute the method or instructions of any of the above aspects, wherein the selecting the subset of the derived features is based on the Bayesian optimization, the selecting involving randomly sampling one or more subsets of features from the derived features; obtaining performance metrics of trained machine learning models trained from the randomly sampled one or more subsets of features; training a Gaussian regression model by using the randomly sampled one or more subsets of features and the performance metrics; calculating an acquisition function associated with the trained Gaussian regression model; selecting an optimal set of features based on the acquisition function; and training the machine learning model with the optimal set of features and to obtain additional performance metrics as described with respect to 301 to 305 of FIG. 3.

[0105] In a seventh aspect, processor(s) 810 can be configured to execute the method or instructions as that of any of the above aspects, and further involve, for an exit criteria being met, returning the optimal set of features as the selected subset of the derived features; for the exit criteria not being met, re-executing the training of the Gaussian regression model from the randomly sampled one or more subsets of features, the performance metrics, the optimal set of features, and the additional performance metrics as illustrated at 305 to 308 of FIG. 3. [0106] In an eight aspect, processor(s) 810 can be configured to execute the method or instructions as that of any of the above aspects, wherein selecting the subset of the derived features is based on reinforcement learning, the selecting involving randomly sampling one or more subset of features from the derived features; obtaining performance metrics of trained machine learning models trained from the randomly sampled one or more subsets of features; calculating a feature importance for each feature of the randomly sampled one or more subsets of features; selecting a first set of features from the randomly sampled one or more subsets of features based on importance, and a second set of features from the randomly sampled one or more subset of features exclusive of the first set of features randomly; training the machine learning model with the first set of features and the second set of features to obtain additional performance metrics; updating the feature importance for the each feature based on the additional performance metrics; and stopping the feature selection process if the exit criteria is met; otherwise, continuing the process with selecting features with exploration and exploitation as illustrated in FIG. 4.

[0107] In a ninth aspect, processor(s) 810 can be configured to execute the method or instructions of any of the above aspects, and further involve for an exit criteria being met, returning the first set of features and the second set of features as the selected subset of the derived features; for the exit criteria not being met, reselecting the first set of features and the second set of features based on the updated feature importance; and retraining the machine learning model with the reselected first set of features and the second set of features as illustrated at 405 to 409 from FIG. 4.

[0108] In a tenth aspect, processor(s) 810 can be configured to execute the method or instructions as that of any of the above aspects, and further involve obtaining a list of important features from running the feature selection process, wherein the feature selection process is based on Bayesian optimization; and using the obtained important features for the exploitation.

[0109] In an eleventh aspect, processor(s) 810 can be configured to execute the method or instructions as that of any of the above aspects, wherein the applying the subset of derived features that met the exit criteria to the machine learning model is directed to an intrusion detection problem; the applying involving executing a model building process that applies the selected subset of the derived features to build an intrusion detection model and an intrusion mode identification model; and executing a model application process that generates additional features based on real time data and feeds the additional features into the intrusion detection model and intrusion model identification model to generate an intrusion score and an intrusion mode as illustrated in FIG. 5.

[0110] In a twelfth aspect, processor(s) 810 can be configured to execute the method or instructions according to any of the above aspects, wherein the machine learning model is an intrusion detection model configured to dynamically detect intrusion from input features as illustrated in FIG. 6.

[OHl] In a thirteenth aspect, processor(s) 810 can be configured to execute the method or instructions according to any of the aspects, wherein the machine learning model is a failure detection model configured to conduct failure detection from input features.

[0112] Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result.

[0113] Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system’s memories or registers or other information storage, transmission or display devices.

[0114] Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.

[0115] Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the techniques of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.

[0116] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

[0117] Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the techniques of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.

Claims

CLAIMS What is claimed is:

1. A method for automatically iteratively generating features used to train a machine learning model, the method comprising: a) deriving the features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

2. The method of claim 1, wherein the deriving the features with evolutionary optimization process comprises using a correlation coefficient as the fitness criteria, the evolutionary optimization process configured to drop ones of the features based on a linear correlation coefficient.

3. The method of claim 2, wherein the deriving the features with the evolutionary optimization process comprises: obtaining the pre-processed variables from preprocessing raw data;

- 26 - calculating the fitness criteria which uses correlations between each of the pre- processed features and a target variable; for an absolute value of the correlation coefficient of the fitness criteria being above a predefined threshold for said each of the pre-processed features: add said each of the pre-processed features to a result set; calculating another correlation coefficient between said each of the pre- processed features and other features in the result set; for another absolute value of the another coefficient between said each of the pre-processed features with the other features being above another predefined threshold, retaining one of the said each of the pre-processed variables and the other features that have a highest correlation with the target variable; for the absolute value of the correlation coefficient of the fitness criteria not being above the predefined threshold for said each of the pre-processed features: for ones of the pre-processed variables associated with said each of the pre- processed features being a derived feature, removing the ones of the pre-processed variables.

4. The method of claim 3, future comprising: for the result set meeting the exit criteria, returning the result set as the derived features; for the result set not meeting the exit criteria, generating a new feature population from the operators by using evolutionary operation techniques; and re-executing the evolutionary optimization process.

5. The method of claim 4, wherein the re-executing the evolutionary optimization process involves multiple runs of multiple random seeds with random initialization and aggregates results.

6. The method of claim 1, wherein the selecting the subset of the derived features is based on the Bayesian optimization, the selecting comprising: randomly sampling one or more subsets of the features from the derived features; obtaining performance metrics of trained machine learning models trained from the randomly sampled one or more subsets of the features; training a Gaussian regression model by using the randomly sampled one or more subsets of the features and the performance metrics; calculating an acquisition function associated with the trained Gaussian regression model; selecting an optimal set of features based on the acquisition function; training the machine learning model with the optimal set of features and to obtain additional performance metrics.

7. The method of claim 6, further comprising: for the exit criteria being met, returning the optimal set of features as the selected subset of the derived features; for the exit criteria not being met, re-executing the training of the Gaussian regression model from the randomly sampled one or more subsets of the features, the performance metrics, the optimal set of features, and the additional performance metrics.

8. The method of claim 1, wherein selecting the subset of the derived features is based on reinforcement learning, the selecting comprising: randomly sampling one or more subset of features from the derived features; obtaining performance metrics of trained machine learning models trained from the randomly sampled one or more subsets of features; calculating a feature importance for each feature of the randomly sampled one or more subsets of features; selecting a first set of features from the randomly sampled one or more subsets of features based on importance, and a second set of features from the randomly sampled one or more subset of features exclusive of the first set of features randomly; training the machine learning model with the first set of features and the second set of features to obtain additional performance metrics; updating the feature importance for the each feature based on the additional performance metrics; stopping the feature selection process if the exit criteria is met; otherwise, continuing the process with selecting features with exploration and exploitation.

9. The method of claim 8, further comprising: for the exit criteria being met, returning the first set of features and the second set of features as the selected subset of the derived features; for the exit criteria not being met:

- 29 - reselecting the first set of features and the second set of features based on the updated feature importance; and retraining the machine learning model with the reselected first set of features and the second set of features.

10. The method of claim 8, further comprising: obtaining a list of important features from running the feature selection process, wherein the feature selection process is based on Bayesian optimization; and using the obtained important features for the exploitation.

11. The method of claim 1, wherein the applying the subset of the derived features that met the exit criteria to the machine learning model is directed to an intrusion detection problem; the applying comprising: executing a model building process that applies the selected subset of the derived features to build an intrusion detection model and an intrusion mode identification model; and executing a model application process that generates additional features based on real time data and feeds the additional features into the intrusion detection model and intrusion mode identification model to generate an intrusion score and an intrusion mode.

12. The method of claim 1, wherein the machine learning model is an intrusion detection model configured to dynamically detect intrusion from input features.

13. The method of claim 1, wherein the machine learning model is a failure detection model configured to conduct failure detection from input features.

- 30 -

14. A computer program, storing instructions for automatically iteratively generating features used to train a machine learning model, the instructions comprising: a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables; and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

15. An apparatus, configured to automatically iteratively generating features used to train a machine learning model, the apparatus comprising: a processor, configured to execute instructions comprising: a) deriving features with an evolutionary optimization process configured to: pre-populate the features from pre-processed variables and operators associated with the pre-processed variables;

- 31 - and derive the features from the pre-populated features based on a fitness criteria; b) selecting a subset of the derived features with model-based feature selection techniques based on one of Bayesian optimization or reinforcement learning as tested against the machine learning model; c) iteratively executing steps a) to b) until an exit criteria is met; and applying the selected subset of derived features that met the exit criteria to the machine learning model.

- 32 -