US20200356904A1 - Machine Learning Model Evaluation - Google Patents

Machine Learning Model Evaluation Download PDF

Info

Publication number
US20200356904A1
US20200356904A1 US16/945,697 US202016945697A US2020356904A1 US 20200356904 A1 US20200356904 A1 US 20200356904A1 US 202016945697 A US202016945697 A US 202016945697A US 2020356904 A1 US2020356904 A1 US 2020356904A1
Authority
US
United States
Prior art keywords
samples
data
abnormal data
attacks
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/945,697
Inventor
Eamon Hirata Jordan
Chad Kumao Takahashi
Ryan Susumu Ito
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RESURGO LLC
Original Assignee
RESURGO LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RESURGO LLC filed Critical RESURGO LLC
Priority to US16/945,697 priority Critical patent/US20200356904A1/en
Publication of US20200356904A1 publication Critical patent/US20200356904A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • G06N7/005
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Definitions

  • the present invention relates to evaluation of machine learning models that are used in sensors for defending computer networks against cyber attacks.
  • the invention is also applicable to any other fields in which machine learning sensors are used to detect obfuscated abnormal data in sets of data containing both normal and abnormal data, such as in so-called data mining and “big data”, including bio-informatics, financial analysis, forgery detection, and other fields.
  • Intrusion detection for cyber defense of networks has two main pitfalls that result in malicious penetration: computer networks are always evolving, resulting in new, unknown vulnerabilities that are subject to new attacks, and hackers are always obfuscating known (legacy) attack delivery methods to bypass security mechanisms.
  • Intrusion detection sensors that utilize machine learning models are now being deployed to identify these new attacks and these obfuscated legacy attack delivery methods (Peter, M., Sabu, T., Zakirul, B., Ryan, K., Robin, D., & Jose, C., “Security in Computing and Communications”, 4th International Symposium, SSCC 2016 (p. 400). Jaipur, India: Springer(2016), incorporated herein by reference, and U.S. Pat. No. 8,887,285 to Resurgo entitled “Heterogeneous Sensors for Network Defense”, incorporated herein by reference).
  • State-of-the-art machine learning model evaluation uses statistical analysis to determine model fit for a particular set of data. This works well for typical uses of machine learning (e.g. speech recognition, weather forecasting), but fails to meet cyber defense standards when using machine learning models to detect cyber attacks on networks.
  • machine learning e.g. speech recognition, weather forecasting
  • cyber defense models machine learning models for defense against cyber attacks
  • Cyber attacks such as malware
  • the propagation (attack delivery) method is the means of transporting malicious code from origin to target.
  • the exploit component takes advantage of vulnerabilities in the target system to enable infection and the operation of the payload.
  • the payload is code written to achieve some desired malicious end, such as deleting data.
  • Detection evasion techniques in all fields can include obfuscation, fragmentation, encryption, other ways to change form or add variant forms (sometimes called polymorphous or polymorphic), and other detection evasion techniques, and are all referred to in this specification and claims collectively and singly as “obfuscating”, “obfuscation” or “obfuscated attacks”.
  • evasion techniques allow a hacker to sufficiently modify the pattern of an attack so that the signature will fail to produce a match (during intrusion detection).
  • the most common evasion techniques are obfuscation, fragmentation, and encryption.
  • Obfuscation is hiding intended meaning in communication, making communication confusing, willfully ambiguous, and harder to interpret.
  • obfuscation refers to methods used to obscure an attack payload from inspection by network protection systems. For instance, an attack payload can be hidden in web protocol traffic. Fragmentation is breaking a data stream into segments and sending those segments out of order through the computer network.
  • the segments are reassembled in correct order at the receiving side.
  • the shuffling of the order of data stream segments can change the known attack signature due to the reordering of communication bits.
  • Encryption is the process of encoding messages (or information) in such a way that eavesdroppers or hackers cannot read it, but that authorized parties can. Both the authorized sender and receiver must have an encryption key and a decryption key in order to encode and decode communication.
  • the attack payload can often be encoded/encrypted such that the signature is no longer readable by detection systems. While each evasion technique changes the attack pattern differently, it is important to note that the goal is the same: change the attack pattern enough to no longer match published attack signatures and hence to avoid intrusion detection.
  • U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC incorporated herein by reference
  • provisional patent application 61/872,047 disclose a semi-supervised learning module connected to a network node.
  • the learning module uses labeled and unlabeled data to train a semi-supervised machine learning sensor.
  • semi-supervised learning is a model training technique and not appropriate for model performance evaluation or identifying obfuscated attacks.
  • a set of samples includes both normal network traffic and cyber attacks.
  • the samples of normal network traffic preferably are obtained from the network that is to be protected.
  • the samples of cyber attacks are preferably also obtained from the network that is to be protected, but if there is an insufficient number of samples of cyber attacks on that network, then some or all of the samples of cyber attacks can be provided from an existing repository of cyber attacks.
  • samples of ground truths can be split into two different datasets: training set and test set.
  • the training set is a sufficiently large number of samples, which contains a sufficiently large number of ground truths, used to train (i.e. generate models using some model generating algorithm, and then select from those models).
  • the training set is used to tune the parameters of a model generation algorithm (Bishop, C. (n.d.), “Pattern Recognition and Machine Learning”, New York: Springer, incorporated herein by reference). In machine learning nomenclature, this is known as the training phase, or learning phase.
  • This training phase involves providing a model generating algorithm with the training data; generating multiple models (that is, generating multiple models using multiple different sets of tuned parameters) that each segregate the training data by label or classification; performing a statistical analysis on the performance of each model (or set of parameters) to determine whether the training data has been accurately segregated; and then selecting the model (or set of parameters) that provides the most accurate segregation, which becomes the trained model.
  • the multiple different sets of tuned parameters used to generate multiple models are preferably generated by methods such as exhaustive cross-validation, grid search K-fold cross validation, leave-p-out cross-validation, leave-one-out cross-validation, k-fold cross-validation, or repeated random sub-sampling validation.
  • the training data must have a sufficient number of samples with ground truths (having correct labels or classifications) to train a model that performs well in supervised learning. This is a true statement for all machine learning, however some training techniques try to utilize unlabeled data. Unlabeled data is simply data that has not been classified by an expert and is therefore unknown in content. Supplementing ground truth data with unknown data in the training phase can be a very cost effective approach because creating or classifying ground truths can be an exhaustive process. This approach, called semi-supervised learning, is part of the training phase of machine learning and also disclosed in U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC, incorporated herein by reference, and provisional patent application 61/872,047 (from which U.S.
  • Machine learning utilizing only unlabeled data i.e. no ground truths
  • unsupervised learning is closer to anomaly detection but still can be referred to as machine learning.
  • the algorithms work to classify the data based on optimizing parameters.
  • a support vector machine learning algorithm labels or classifies training data by finding the best separating hyperplane (where hyperplane is a point, line, plane, or volume, depending on the number of dimensions) in a multidimensional space into which the training data has been mapped, separating the space into two or more parts, with each part corresponding to a label or class of the ground truths of the training data.
  • hyperplane is a point, line, plane, or volume, depending on the number of dimensions
  • support vector machine learning algorithms have tunable parameters which can be adjusted to alter how ambiguous data points are labelled or classified.
  • a support vector machine learning algorithm generates a mathematical model which can assign or determine classifications of new data based on the classifications determined during training (Burges, C., “A tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, 121-167(1998), incorporated herein by reference).
  • Other model generating algorithms such as Logistic Regression, Decision Trees, Naive Bayes, Random Forests, Adaboost, Neural Networks, and K-Means Clustering, similarly generate models that segregate training data into patterns, which can allow labeling or classification of data.
  • a trained algorithm will output a model that is useful to make better decisions or predictions, discover patterns in data, or describe the nature of something.
  • the model generation algorithms are designed to minimize an error equation or optimize some other model parameters.
  • the error rate of a model evaluated on a training set is known as the training error.
  • the second set of samples containing ground truths is referred to as the test set.
  • the test set may include the remaining ground truths that were not used in the training set, or only a random portion of the remaining ground truths.
  • the model's performance is evaluated on the test set by measuring the error of the test set. This is known as test error (Friedman, J., Tibshirani, R., & Hastie, T., “The Elements of Statistical Learning”, New York: Springer (2001), incorporated herein by reference).
  • the test error of a model is determined by the same methods as the training error.
  • a test set is important because it determines the model's ability to perform on new data (data that did not train the model). In most applications, comparing test error of different generated models allows selecting an optimal model for real-world application.
  • Model fit is how well a trained model does its designed task (e.g. predicts, determines, decides, classifies, or any other model performing function).
  • Model fit traditionally has been calculated with different techniques such as chi-squared test, Kolmogorov-Smirnov test, coefficient of determination, lack of fit sum of squares, etc. The essence of these techniques is to obtain a value that describes the model fit and to compare that to some desired confidence value (i.e. a model fit value that is better than the desired confidence value is determined to be a good fit for the given data).
  • Model fit can also be determined by the test set error relative to some desired confidence value (i.e. if the model produces a low test error when given a test set, then the model may be a good fit).
  • model fit has to be recalculated for any new data with ground truths because model fit is relative to a static data set.
  • model fit analysis on live networks does not take place, and model retraining is triggered by human subjective assessment based on observables, including the following:
  • Machine learning model output i.e. attack alert log
  • attack alert log Machine learning model output
  • Forensic analysis of the network traffic determines that the machine learning model had false-positives.
  • false-positive alerts are those in which the sensor (or model) indicates an event when none really occurred.
  • Model retraining is enacted as a reactionary step to reduce the quantity of false-positives.
  • Forensic analysis of the network traffic determines that the machine learning model had false-negatives. In sensing, false-negatives are a lack of alert when an event actually occurred. Model retraining is enacted as a reactionary step to reduce the quantity of false-negatives.
  • Obfuscated (including polymorphous or polymorphic) attacks that are similar in propagation method, exploit, or payload components to training data.
  • a trained attack contains a website user input field (propagation method), buffer overflow (exploit), and a trojan (payload); an obfuscated attack could contain the same website user input field and buffer overflow exploit, but contain a different payload such as a virus.
  • the present invention includes a process for improving a machine learning sensor for defending a computer network, comprising:
  • the process further includes:
  • the intrusion detection capability error is the sum of the following:
  • the invention also includes a process for improving sensors using machine learning models for defending a computer network, comprising:
  • the step of determining of model fit is performed using a technique selected from the group consisting of chi-squared test, Kolmogorov-Smirnov test, coefficient of determination, and lack of fit sum of squares.
  • the step of determining model fit is performed using generalization complexity measures selected from the group consisting of Rademacher Complexity, Kolmogorov Complexity, and Vapnik Chervonenkis Entropy.
  • the step of assigning thresholds on a scale of model fits is performed using a scale of model fits in which:
  • model fit is determined to be good if all three of the techniques are below said thresholds;
  • model fit is determined to be deteriorating if one of the three is above said threshold
  • model fit is determined to be lacking if two of the three are above said threshold
  • model fit is determined to be extremely lacking if all of the three are above the threshold.
  • the activating of model retraining step is based on the aggregate of the goodness of model fit measurements for all 3 of said techniques.
  • the activating of model retraining is based on goodness of model fit for only one of the techniques.
  • the selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
  • the invention also includes a system, comprising:
  • a machine-learning based sensor deployed in the network wherein the machine-learning based sensor has been trained by a process comprising:
  • the selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
  • the invention further includes a computer program product stored in a computer readable medium for protecting a network, comprising:
  • a machine learning based sensor that has been trained by a process comprising:
  • the invention further includes a process for improving a machine learning sensor, comprising:
  • the invention further includes a system, comprising:
  • a machine-learning based sensor deployed in the computer wherein the machine-learning based sensor has been trained by a process comprising:
  • the invention further includes a computer program product stored in a computer readable medium, comprising:
  • a machine learning based sensor that has been trained by a process comprising:
  • FIG. 1 is a flow diagram showing the steps of the present invention.
  • FIG. 1 shows a new method (steps A, B, X, C, Y in FIG. 1 ) for evaluating machine learning model fit that adds two revolutionary steps (X and Y) to the steps of the prior art process (A, B, and C).
  • Steps X and Y fill current model evaluation voids in that step X includes model evaluation using obfuscated attacks, while step Y introduces model evaluation using real-time network traffic.
  • step X after initial data is split into trained and untrained (i.e. known and unknown attacks, that is, training and test data) for model training, which is prior art, a sampling of evasion techniques is then used to obfuscate the training data attacks, to create obfuscated training data. Those obfuscated attacks in obfuscated training data are then added to the test data with the end result being an enhanced set of test data that now includes known, unknown, and obfuscated attacks (from test data and obfuscated training data).
  • trained and untrained i.e. known and unknown attacks, that is, training and test data
  • a sampling of evasion techniques is then used to obfuscate the training data attacks, to create obfuscated training data.
  • Those obfuscated attacks in obfuscated training data are then added to the test data with the end result being an enhanced set of test data that now includes known, unknown, and
  • a validation test then preferably calculates “Intrusion Detection Capability Error” because the test is more accurate for real-time operation of the machine learning model.
  • Intrusion Detection Capability Error analyzes all categories of attack, including obfuscated attacks (which are overlooked in conventional testing):
  • Intrusion Detection Capability Error ⁇ *Test Error[Known Attacks]+ ⁇ *Test Error[Unknown Attacks]+ ⁇ *Test Error[Obfuscated Attacks]
  • step Y real-time traffic and model analysis gives a proper quantitative context (i.e. situational awareness) to cyber defense analysts when making model retraining decisions. It is also a new step that extends quantitative analysis into the sensor live operation arena. This quantitative approach, described below, allows for proactive objective management of machine learning models, instead of the retroactive, human subjective assessment that is current state-of-the-art.
  • model fit on live network models/sensors is non-obvious because conventional model fit testing is done only on static data. Statisticians and machine learning experts would never try to calculate model fit on dynamic, live data, due to the ever-changing baseline and lack of ground truths. Using real-time calculations and comparisons of traffic behaviors to determine model fit is an additional and counter-intuitive step because the results cannot be verified (without static, ground truth data).
  • A. Anomaly detection techniques to compare traffic metrics and protocol utilizations from live network traffic to the original model training data.
  • Model overfitting occurs when a model is too complex compared to the new data, and the predictive performance of the model is poor.
  • real-time model fit can be determined by defining a scale of model fits by assigning thresholds for the mathematical techniques of anomaly detection, measuring similarity, and determining model overfit (and optionally other mathematical techniques), and aggregating the results of whether the model is above or below the assigned threshold for each of the mathematical techniques, for example, as follows:
  • Each of these mathematical techniques can be considered to be a dimension of model fit, so that this is really a multi-dimensional analysis of model fit to real-time data.
  • This resulting real-time model fit identifier allows for in-situ quantitative analysis of the machine learning models while they are deployed on live networks.
  • Network analysts can use only the aggregate 0-3 scale to activate model retraining, or they can rely on whether one or more individual mathematical techniques is above a threshold, to determine whether to activate model retraining, or for additional situational awareness of their networks.
  • this invention can be used in detecting obfuscated DNA sequences in normal DNA, for purposes such as cancer detection or seeking causes for genetic diseases or detecting new virus variants or other genetic analysis, or other areas of bio-informatics.
  • this invention can be used in detecting a falsely attributed work (forgery) from among a composer's or author's or artist's collected works.
  • this invention can be used in detecting obfuscated transactions being used to manipulate financial or other markets, such as insider trading or pump and dump schemes. It is not necessary that the obfuscation of the abnormal data be caused by humans—it could be naturally occurring, such as in evolution of viruses.
  • This invention is applicable wherever it is desired to protect computer networks against attacks. This invention is also applicable wherever it is desired to use machine learning sensors to detect obfuscated abnormal data.

Abstract

Testing machine learning sensors by adding obfuscated training data to test data, and performing real time model fit analysis on live network traffic to determine whether to retrain.

Description

  • This application is a continuation of U.S. patent application Ser. No. 15/373,425, now U.S. Pat. No. 10,733,530, which is incorporated herein by reference.
  • TECHNICAL FIELD
  • The present invention relates to evaluation of machine learning models that are used in sensors for defending computer networks against cyber attacks. The invention is also applicable to any other fields in which machine learning sensors are used to detect obfuscated abnormal data in sets of data containing both normal and abnormal data, such as in so-called data mining and “big data”, including bio-informatics, financial analysis, forgery detection, and other fields.
  • BACKGROUND ART
  • Intrusion detection for cyber defense of networks (defense of computer networks against cyber attacks) has two main pitfalls that result in malicious penetration: computer networks are always evolving, resulting in new, unknown vulnerabilities that are subject to new attacks, and hackers are always obfuscating known (legacy) attack delivery methods to bypass security mechanisms. Intrusion detection sensors that utilize machine learning models are now being deployed to identify these new attacks and these obfuscated legacy attack delivery methods (Peter, M., Sabu, T., Zakirul, B., Ryan, K., Robin, D., & Jose, C., “Security in Computing and Communications”, 4th International Symposium, SSCC 2016 (p. 400). Jaipur, India: Springer(2016), incorporated herein by reference, and U.S. Pat. No. 8,887,285 to Resurgo entitled “Heterogeneous Sensors for Network Defense”, incorporated herein by reference).
  • State-of-the-art machine learning model evaluation uses statistical analysis to determine model fit for a particular set of data. This works well for typical uses of machine learning (e.g. speech recognition, weather forecasting), but fails to meet cyber defense standards when using machine learning models to detect cyber attacks on networks. Currently, statistical analysis of cyber defense models (machine learning models for defense against cyber attacks) does not test for obfuscated attacks, and is only applied to archived data sets, not to (1) real-time, evolving network traffic; or (2) real-time attack detection.
  • A. Background of Obfuscation to Thwart Cyber Defense
  • Cyber attacks, such as malware, should not be thought of as a single unit based on result. Instead, cyber attacks can be broken into their functional components: propagation (i.e. attack delivery) method, exploit, and payload (Herr, T., “PrEP: A Framework for Malware & Cyber Weapons”, George Washington University, Political Science Department & Cyber Security and Policy Research Institute, Washington D.C. (2014), incorporated herein by reference). The propagation (attack delivery) method is the means of transporting malicious code from origin to target. The exploit component takes advantage of vulnerabilities in the target system to enable infection and the operation of the payload. The payload is code written to achieve some desired malicious end, such as deleting data. Using this three-part component framework for analyzing cyber attacks becomes important when trying to detect and protect against cyber attacks. Many signatures (for signature-based intrusion detection systems) are based on the payload portion of the attack, because the propagation method and exploit components can vary substantially from target to target.
  • U.S. Pat. No. 8,887,285 to Resurgo, incorporated herein by reference, discloses how to combine signature-based and machine learning sensors for a more comprehensive computer network defense. This patent details how to build a data set using detection evasion techniques to train the machine learning sensor to cover the “blind-spot” of the signature-based sensor.
  • Detection evasion techniques in all fields (not just network security) can include obfuscation, fragmentation, encryption, other ways to change form or add variant forms (sometimes called polymorphous or polymorphic), and other detection evasion techniques, and are all referred to in this specification and claims collectively and singly as “obfuscating”, “obfuscation” or “obfuscated attacks”.
  • Detection evasion techniques as applied to network security were described in more detail in the Resurgo patent's background section:
  • The blind spot problem for signature-based sensors is compounded by the fact that use of evasion techniques by hackers has proven very effective at enabling known exploits to escape detection. Evasion techniques allow a hacker to sufficiently modify the pattern of an attack so that the signature will fail to produce a match (during intrusion detection). The most common evasion techniques are obfuscation, fragmentation, and encryption. Obfuscation is hiding intended meaning in communication, making communication confusing, willfully ambiguous, and harder to interpret. In network security, obfuscation refers to methods used to obscure an attack payload from inspection by network protection systems. For instance, an attack payload can be hidden in web protocol traffic. Fragmentation is breaking a data stream into segments and sending those segments out of order through the computer network. The segments are reassembled in correct order at the receiving side. The shuffling of the order of data stream segments can change the known attack signature due to the reordering of communication bits. Encryption is the process of encoding messages (or information) in such a way that eavesdroppers or hackers cannot read it, but that authorized parties can. Both the authorized sender and receiver must have an encryption key and a decryption key in order to encode and decode communication. In network attacks, the attack payload can often be encoded/encrypted such that the signature is no longer readable by detection systems. While each evasion technique changes the attack pattern differently, it is important to note that the goal is the same: change the attack pattern enough to no longer match published attack signatures and hence to avoid intrusion detection.
  • Hackers use the above methods, among others, to vary the propagation method, exploit, and payload components to create obfuscated attacks, even though the name of the cyber attack, and its result, may be the same. The process of U.S. Pat. No. 8,887,285 focuses on the training of machine learning models with obfuscated attacks, but does not consider any methods for evaluating the resulting models beyond standard statistical techniques.
  • U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC, incorporated herein by reference, and provisional patent application 61/872,047 (from which U.S. Pat. No. 9,497,204 B2 claims priority), incorporated herein by reference, disclose a semi-supervised learning module connected to a network node. The learning module uses labeled and unlabeled data to train a semi-supervised machine learning sensor. However, semi-supervised learning is a model training technique and not appropriate for model performance evaluation or identifying obfuscated attacks.
  • B. Background of Machine Learning
  • Learning models, statistical models, analytical models and essentially all mathematical models can be used to explain, predict, automate, and analyze information about the nature of things. Fields of study such as machine learning, statistical analysis, and pattern recognition are actively researching new ways to process and handle datasets. To develop these models, a sufficient supply of samples which have been (to the best of ability) properly labeled (i.e. classified) must be supplied. The correctly labeled (i.e. classified) samples are known as the “ground truths”. These samples do not change, so they are static. For machine learning purposes, these samples are not sorted or segregated by their classifications: a set of samples includes both normal network traffic and cyber attacks. The samples of normal network traffic preferably are obtained from the network that is to be protected. The samples of cyber attacks are preferably also obtained from the network that is to be protected, but if there is an insufficient number of samples of cyber attacks on that network, then some or all of the samples of cyber attacks can be provided from an existing repository of cyber attacks.
  • At a basic level, samples of ground truths can be split into two different datasets: training set and test set. The training set is a sufficiently large number of samples, which contains a sufficiently large number of ground truths, used to train (i.e. generate models using some model generating algorithm, and then select from those models). The training set is used to tune the parameters of a model generation algorithm (Bishop, C. (n.d.), “Pattern Recognition and Machine Learning”, New York: Springer, incorporated herein by reference). In machine learning nomenclature, this is known as the training phase, or learning phase. This training phase involves providing a model generating algorithm with the training data; generating multiple models (that is, generating multiple models using multiple different sets of tuned parameters) that each segregate the training data by label or classification; performing a statistical analysis on the performance of each model (or set of parameters) to determine whether the training data has been accurately segregated; and then selecting the model (or set of parameters) that provides the most accurate segregation, which becomes the trained model.
  • The multiple different sets of tuned parameters used to generate multiple models are preferably generated by methods such as exhaustive cross-validation, grid search K-fold cross validation, leave-p-out cross-validation, leave-one-out cross-validation, k-fold cross-validation, or repeated random sub-sampling validation.
  • The training data must have a sufficient number of samples with ground truths (having correct labels or classifications) to train a model that performs well in supervised learning. This is a true statement for all machine learning, however some training techniques try to utilize unlabeled data. Unlabeled data is simply data that has not been classified by an expert and is therefore unknown in content. Supplementing ground truth data with unknown data in the training phase can be a very cost effective approach because creating or classifying ground truths can be an exhaustive process. This approach, called semi-supervised learning, is part of the training phase of machine learning and also disclosed in U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC, incorporated herein by reference, and provisional patent application 61/872,047 (from which U.S. Pat. No. 9,497,204 B2 claims priority), incorporated herein by reference. Machine learning utilizing only unlabeled data (i.e. no ground truths), also known as unsupervised learning, is closer to anomaly detection but still can be referred to as machine learning. No matter what the percentage of ground truth is used in training, the algorithms work to classify the data based on optimizing parameters.
  • Different model generating algorithms generate models in their own unique ways, such as support vector machine learning algorithms. A support vector machine learning algorithm labels or classifies training data by finding the best separating hyperplane (where hyperplane is a point, line, plane, or volume, depending on the number of dimensions) in a multidimensional space into which the training data has been mapped, separating the space into two or more parts, with each part corresponding to a label or class of the ground truths of the training data. Optionally, support vector machine learning algorithms have tunable parameters which can be adjusted to alter how ambiguous data points are labelled or classified.
  • A support vector machine learning algorithm generates a mathematical model which can assign or determine classifications of new data based on the classifications determined during training (Burges, C., “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, 121-167(1998), incorporated herein by reference). Other model generating algorithms, such as Logistic Regression, Decision Trees, Naive Bayes, Random Forests, Adaboost, Neural Networks, and K-Means Clustering, similarly generate models that segregate training data into patterns, which can allow labeling or classification of data.
  • In machine learning, a trained algorithm will output a model that is useful to make better decisions or predictions, discover patterns in data, or describe the nature of something. Generally, the model generation algorithms are designed to minimize an error equation or optimize some other model parameters. The error rate of a model evaluated on a training set is known as the training error.
  • The second set of samples containing ground truths is referred to as the test set. The test set may include the remaining ground truths that were not used in the training set, or only a random portion of the remaining ground truths. The model's performance is evaluated on the test set by measuring the error of the test set. This is known as test error (Friedman, J., Tibshirani, R., & Hastie, T., “The Elements of Statistical Learning”, New York: Springer (2001), incorporated herein by reference). The test error of a model is determined by the same methods as the training error. A test set is important because it determines the model's ability to perform on new data (data that did not train the model). In most applications, comparing test error of different generated models allows selecting an optimal model for real-world application.
  • C. Background of Machine Learning in Cyber Defense
  • Once cyber defense machine learning models/sensors graduate from the previously described static data testing, they are inserted onto actual computer networks that need cyber defense. At this point, the statistical predictive or determinative calculations that were used previously no longer apply. The predictive or determinative calculations of training errors and test errors only work with a given data set containing ground truths. Actual computer network traffic has no ground truths (because it has not been previously classified) or answer keys, and therefore no means by which to assess the fit of the machine learning model. Model fit testing (i.e. goodness of fit) is still required though, because cyber networks can change overnight or over a prolonged timeframe, thus requiring periodic model retraining. In machine learning, model retraining is effectively starting from scratch, and requires building a brand new data set with ground truths (from which new training data and new test data can be obtained) for training and testing, which can be so burdensome as to prohibit using machine learning sensors at all.
  • Model fit, referred to as goodness of fit, is how well a trained model does its designed task (e.g. predicts, determines, decides, classifies, or any other model performing function). Model fit traditionally has been calculated with different techniques such as chi-squared test, Kolmogorov-Smirnov test, coefficient of determination, lack of fit sum of squares, etc. The essence of these techniques is to obtain a value that describes the model fit and to compare that to some desired confidence value (i.e. a model fit value that is better than the desired confidence value is determined to be a good fit for the given data). Model fit can also be determined by the test set error relative to some desired confidence value (i.e. if the model produces a low test error when given a test set, then the model may be a good fit). Of course, model fit has to be recalculated for any new data with ground truths because model fit is relative to a static data set.
  • In current practice, model fit analysis on live networks does not take place, and model retraining is triggered by human subjective assessment based on observables, including the following:
  • Machine learning model output (i.e. attack alert log) quantity becomes too great for human analysis. In other words, the model alerts on attacks (or false attacks) too often for human analysts to be able to research or react to every alert. Model retraining is implemented to reduce the quantity of alerts.
  • Forensic analysis of the network traffic (sometimes initiated by the machine learning model's alert logs) determines that the machine learning model had false-positives. In sensing, false-positive alerts are those in which the sensor (or model) indicates an event when none really occurred. Model retraining is enacted as a reactionary step to reduce the quantity of false-positives.
  • Forensic analysis of the network traffic (sometimes initiated by a signature-based sensor alert or by a network disruption) determines that the machine learning model had false-negatives. In sensing, false-negatives are a lack of alert when an event actually occurred. Model retraining is enacted as a reactionary step to reduce the quantity of false-negatives.
  • These three current methods for determining the need for model retraining are based on human subjective assessment and not on actual model fit to the data (actual network traffic). As a result, decisions are being made in machine learning cyber defense without appropriate context and without knowledge of the cyber defense implications. For example, if hackers know a network is employing machine learning sensors, they could create attacks that generate a large quantity of alerts. The sudden increase and sheer quantity of alerts would cause the analysts to request model retraining while they either ignore the machine learning alerts or set them aside for retroactive, forensic analysis. With the machine learning sensing being ignored, the hackers could then be free to go after their primary goal, until the model has been retrained and reinstalled in the network.
  • D. Need For A New Validation Test for Pre-Deployment (i.e. Laboratory) Model Fit Testing
  • Because most cyber network attacks from sophisticated actors (i.e. the really dangerous hackers) are actually the obfuscated type (Du, H., “Probabilistic Modeling and Inference for Obfuscated Network Attack Sequences”, Thesis, Rochester Institute of Technology (2014), incorporated herein by reference), it was realized that conventional methods of model evaluation, training and test errors, were inadequate for intrusion detection. Conventional use of training errors and test errors only considers trained (attacks based on training data) and untrained attacks (attacks not based on training data), while cyber defense model fit testing must include trained, untrained, and obfuscated attacks (attacks based on training data that has been obfuscated), for a more accurate prediction of model performance on actual cyber networks.
  • Conventional model testing considers obfuscated attacks to be part of untrained test data. However, this skews test error results because these attacks were not truly untrained data points. Effectively, the attacks were indeed used in model training, just not those obfuscated instances of the attacks. An example: a buffer overflow of a website input field is used in training, but an obfuscated version uses fragmentation of the communication stream to reorder the packets of the very same attack. Conventional model testing would consider the obfuscated attack as part of untrained (or new) data. However this categorization is not valid for cyber network defense because hackers have countless methods for obfuscating known attacks: the known attacks can take many different forms, that is, they can be polymorphous or polymorphic. In cyber intrusion detection, machine learning model fit analysis needs to consider training errors and test errors for the following types of attacks:
  • Known attacks contained in training data;
  • Zero-day attacks or unknown attacks, unrelated to training data; and
  • Obfuscated (including polymorphous or polymorphic) attacks that are similar in propagation method, exploit, or payload components to training data. Example: a trained attack contains a website user input field (propagation method), buffer overflow (exploit), and a trojan (payload); an obfuscated attack could contain the same website user input field and buffer overflow exploit, but contain a different payload such as a virus.
  • Disclosure of the Invention
  • The present invention includes a process for improving a machine learning sensor for defending a computer network, comprising:
  • obtaining samples of normal network traffic from the network;
  • providing samples of cyber attacks from the network or from a repository of cyber attacks;
  • classifying each of the samples as either the normal network traffic or the cyber attacks, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal network traffic and of the cyber attacks;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal network traffic and the cyber attacks in the training set;
  • obfuscating a portion of samples of the cyber attacks in the training set to create obfuscated attack samples;
  • adding the obfuscated attack samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
  • selecting based on the intrusion detection capability error one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal network traffic and the cyber attacks; and
  • installing the optimal model in the sensor.
  • Optionally, the process further includes:
  • determining model fit of the sensor by applying anomaly detection, measuring similarity between live network traffic and the training set, and determining model overfit;
  • assigning thresholds and aggregating results of the thresholds to identify model fit in real time on a scale of model fits; and
  • activating model retraining based on the scale of model fits.
  • Optionally, the intrusion detection capability error is the sum of the following:
  • ratio of misclassified known attacks to total misclassified attacks, multiplied by test error of known attacks;
  • ratio of misclassified unknown attacks to total misclassified attacks, multiplied by test error of unknown attacks; and
  • ratio of misclassified obfuscated attacks to total misclassified attacks, multiplied by test error of obfuscated attacks.
  • The invention also includes a process for improving sensors using machine learning models for defending a computer network, comprising:
  • installing the sensors in the computer network;
  • determining model fit of the models by applying techniques of anomaly detection, measuring similarity between live network traffic and training data, and determining model overfit;
  • assigning thresholds and aggregating results of the thresholds for each of the techniques, to identify model fit in real time on a scale of model fits; and
  • activating model retraining based on the scale of model fits.
  • Preferably, the step of determining of model fit is performed using a technique selected from the group consisting of chi-squared test, Kolmogorov-Smirnov test, coefficient of determination, and lack of fit sum of squares.
  • Preferably also, the step of determining model fit is performed using generalization complexity measures selected from the group consisting of Rademacher Complexity, Kolmogorov Complexity, and Vapnik Chervonenkis Entropy.
  • Preferably also, the step of assigning thresholds on a scale of model fits is performed using a scale of model fits in which:
  • model fit is determined to be good if all three of the techniques are below said thresholds;
  • model fit is determined to be deteriorating if one of the three is above said threshold;
  • model fit is determined to be lacking if two of the three are above said threshold;
  • model fit is determined to be extremely lacking if all of the three are above the threshold.
  • Preferably also, the activating of model retraining step is based on the aggregate of the goodness of model fit measurements for all 3 of said techniques.
  • Alternatively, the activating of model retraining is based on goodness of model fit for only one of the techniques.
  • Preferably, the selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
  • The invention also includes a system, comprising:
  • a network of computers; and
  • a machine-learning based sensor deployed in the network, wherein the machine-learning based sensor has been trained by a process comprising:
  • obtaining samples of normal network traffic from the network;
  • providing samples of cyber attacks from the network or from a repository of cyber attacks;
  • classifying each of the samples as either the normal network traffic or the cyber attacks, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal network traffic and of the cyber attacks;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal network traffic and the cyber attacks in the training set;
  • obfuscating a portion of samples of the cyber attacks in the training set to create obfuscated attack samples;
  • adding the obfuscated attack samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
  • selecting based on the intrusion detection capability error one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal network traffic and the cyber attacks.
  • Preferably, the selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
  • The invention further includes a computer program product stored in a computer readable medium for protecting a network, comprising:
  • a machine learning based sensor that has been trained by a process comprising:
  • obtaining samples of normal network traffic from the network;
  • providing samples of cyber attacks from the network or from a repository of cyber attacks;
  • classifying each of the samples as either the normal network traffic or the cyber attacks, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal network traffic and of the cyber attacks;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal network traffic and the cyber attacks in the training set;
  • obfuscating a portion of samples of the cyber attacks in the training set to create obfuscated attack samples;
  • adding the obfuscated attack samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
      • selecting based on the intrusion detection capability error one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal network traffic and the cyber attacks.
  • The invention further includes a process for improving a machine learning sensor, comprising:
  • obtaining samples of normal data from a set of data containing normal and abnormal data;
  • providing samples of abnormal data from the set or from a repository of abnormal data;
  • classifying each of the samples as either the normal data or the abnormal data, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal data and of the abnormal data;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal data and the abnormal data in the training set;
  • obfuscating a portion of samples of the abnormal data in the training set to create obfuscated abnormal data samples;
  • adding the obfuscated abnormal data samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
  • selecting one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal data and the abnormal data; and
  • installing the optimal model in the sensor.
  • The invention further includes a system, comprising:
  • a computer; and
  • a machine-learning based sensor deployed in the computer, wherein the machine-learning based sensor has been trained by a process comprising:
  • obtaining samples of normal data from a set of data containing normal and abnormal data;
  • providing samples of abnormal data from the set or from a repository of abnormal data;
  • classifying each of the samples as either the normal data or the abnormal data, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal data and of the abnormal data;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal data and the abnormal data in the training set;
  • obfuscating a portion of samples of the abnormal data in the training set to create obfuscated abnormal data samples;
  • adding the obfuscated abnormal data samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
  • selecting one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal data and the abnormal data; and
  • installing the optimal model in the computer.
  • The invention further includes a computer program product stored in a computer readable medium, comprising:
  • a machine learning based sensor that has been trained by a process comprising:
  • obtaining samples of normal data from a set of data containing normal and abnormal data;
  • providing samples of abnormal data from the set or from a repository of abnormal data;
  • classifying each of the samples as either the normal data or the abnormal data, to create ground truths for the samples;
  • splitting the samples into a training set and a test set, with each of the sets containing samples of the normal data and of the abnormal data;
  • using a model generating algorithm to generate a variety of models for distinguishing between the normal data and the abnormal data in the training set;
  • obfuscating a portion of samples of the abnormal data in the training set to create obfuscated abnormal data samples;
  • adding the obfuscated abnormal data samples to the test set to form an enhanced test set;
  • performing statistical analysis on performance of the models with the enhanced test set to determine intrusion detection capability error;
  • selecting one of the models that optimizes a desired model parameter as an optimal model for distinguishing between the normal data and the abnormal data.
  • BRIEF DESCRIPTION OF DRAWINGS
  • FIG. 1 is a flow diagram showing the steps of the present invention.
  • BEST MODE FOR CARRYING OUT THE INVENTION
  • The presently preferred best mode for practicing the present invention is presently illustrated by way of example in FIG. 1, which shows a new method (steps A, B, X, C, Y in FIG. 1) for evaluating machine learning model fit that adds two revolutionary steps (X and Y) to the steps of the prior art process (A, B, and C). Steps X and Y fill current model evaluation voids in that step X includes model evaluation using obfuscated attacks, while step Y introduces model evaluation using real-time network traffic.
  • In step X, after initial data is split into trained and untrained (i.e. known and unknown attacks, that is, training and test data) for model training, which is prior art, a sampling of evasion techniques is then used to obfuscate the training data attacks, to create obfuscated training data. Those obfuscated attacks in obfuscated training data are then added to the test data with the end result being an enhanced set of test data that now includes known, unknown, and obfuscated attacks (from test data and obfuscated training data).
  • A validation test then preferably calculates “Intrusion Detection Capability Error” because the test is more accurate for real-time operation of the machine learning model. Intrusion Detection Capability Error analyzes all categories of attack, including obfuscated attacks (which are overlooked in conventional testing):

  • Intrusion Detection Capability Error=α*Test Error[Known Attacks]+β*Test Error[Unknown Attacks]+γ*Test Error[Obfuscated Attacks]
  • Where
  • α = [ Misclassified # Known Attacks ] Misclassified Total # Attacks , β = [ Misclassified # Unknown Attacks ] Misclassified Total # Attacks , γ = [ Misclassified # Obfuscated Attacks ] Misclassified Toal # Attacks
  • The inclusion of obfuscated attacks in the machine learning evaluation process is non-obvious, because new data is created from the training set data to perform model fit analysis with an enhanced test data set.
  • Conventional methods of evaluation create all data first, then assign most (˜70%) as training data, with the rest (˜30%) assigned as test data. Creating new data is non-obvious and creating new data that is a variant of training data is counter-intuitive. In conventional model fit analysis, there is no statistical test to account for variant training data (i.e. obfuscated training attacks), and therefore no previous framework for this procedure or its statistical analysis. Hence, using variant training data for statistical analysis is counter-intuitive because it breaks conventional rules of statistics (such as independence of data). Additionally, data is viewed by machine learning experts and statisticians as black or white (i.e. trained or untrained/unknown).
  • There is no prior disclosure or suggestion of manipulating original training data, and then using the manipulated training data, together with the original training data, for machine learning.
  • Conventional machine learning techniques consider data points permanent and attempt to analyze only those data points. For example, machine learning facial recognition software would consider a person's face as a data point. Drastic changes to a person's face via plastic surgery or mutilation would be uncommon, and the result would be considered a new, untrained data point. By contrast, hackers could morph a cyber data point in innumerable ways, thus requiring a new, counter-intuitive methodology for model fit testing.
  • In step Y, real-time traffic and model analysis gives a proper quantitative context (i.e. situational awareness) to cyber defense analysts when making model retraining decisions. It is also a new step that extends quantitative analysis into the sensor live operation arena. This quantitative approach, described below, allows for proactive objective management of machine learning models, instead of the retroactive, human subjective assessment that is current state-of-the-art.
  • Determining model fit on live network models/sensors is non-obvious because conventional model fit testing is done only on static data. Statisticians and machine learning experts would never try to calculate model fit on dynamic, live data, due to the ever-changing baseline and lack of ground truths. Using real-time calculations and comparisons of traffic behaviors to determine model fit is an additional and counter-intuitive step because the results cannot be verified (without static, ground truth data).
  • To perform real-time traffic evaluation, at least the three following different mathematical techniques that work well with new, live data can be used:
  • A. Anomaly detection techniques to compare traffic metrics and protocol utilizations from live network traffic to the original model training data.
  • B. Measuring the similarity between two data sets: live network traffic and the original training data, using the process disclosed in U.S. Pat. No. 8,799,399 (entitled “Device for and Method of Measuring Similarity Between Sets”, incorporated herein by reference). A lack of similarity between the two indicates a lack of model fit due to changing network conditions, operating usage, computer architecture, or traffic composition (among others).
  • C. Determining model overfit of the original training data compared to new datasets from live network traffic, using generalization complexity measures (such as Rademacher Complexity, Kolmogorov Complexity, Vapnik-Chervonenkis Entropy). Model overfitting occurs when a model is too complex compared to the new data, and the predictive performance of the model is poor.
  • Using at least the above techniques, real-time model fit can be determined by defining a scale of model fits by assigning thresholds for the mathematical techniques of anomaly detection, measuring similarity, and determining model overfit (and optionally other mathematical techniques), and aggregating the results of whether the model is above or below the assigned threshold for each of the mathematical techniques, for example, as follows:

  • Goodness of Model Fit=0=Below Threshold[A]+Below Threshold[B]+Below Threshold[C]

  • Deteriorating Model Fit=1=Above Threshold[A]+Below Threshold[B]+Below Threshold[C]=Below Threshold[A]+Above Threshold[B]+Below Threshold[C]=etc.

  • Lack of Model Fit=2=Above Threshold[A]+Above Threshold[B]+Below Threshold[C]=Above Threshold[A]+Below Threshold[B]+Above Threshold[C]=etc.

  • Extreme Lack of Model Fit=3=Above Threshold[A]+Above Threshold[B]+Above Threshold[C]
  • Each of these mathematical techniques can be considered to be a dimension of model fit, so that this is really a multi-dimensional analysis of model fit to real-time data.
  • This resulting real-time model fit identifier allows for in-situ quantitative analysis of the machine learning models while they are deployed on live networks. Network analysts can use only the aggregate 0-3 scale to activate model retraining, or they can rely on whether one or more individual mathematical techniques is above a threshold, to determine whether to activate model retraining, or for additional situational awareness of their networks.
  • This solution addresses the previously described problem (among others) of hackers using volume based attacks to flood the human based cyber defense analysts. Volume based attacks, while generating a lot of alert logs, do not necessarily alter the network traffic characteristics (although denial of service attacks can, but these are more obvious to cyber defense defenders). On cyber networks, attacks are a very small percentage of the overall traffic volume. The above use of thresholds for each dimension of mathematical analysis would show that the models still have real-time goodness of fit. Cyber defense analysts then can more accurately determine whether and when they actually need to react to the intrusion detection system alert logs (as opposed to ignoring them as in the previous example).
  • While the present invention has been disclosed in connection with the presently preferred best modes described herein, it should be understood that the best modes include words of description and illustration, rather than words of limitation. There may be other embodiments which fall within the spirit and scope of the invention as defined by the claims. For example, this invention can be used in detecting obfuscated DNA sequences in normal DNA, for purposes such as cancer detection or seeking causes for genetic diseases or detecting new virus variants or other genetic analysis, or other areas of bio-informatics. For a further example, this invention can be used in detecting a falsely attributed work (forgery) from among a composer's or author's or artist's collected works. For a further example, this invention can be used in detecting obfuscated transactions being used to manipulate financial or other markets, such as insider trading or pump and dump schemes. It is not necessary that the obfuscation of the abnormal data be caused by humans—it could be naturally occurring, such as in evolution of viruses.
  • Accordingly, no limitations are to be implied or inferred in this invention except as specifically and as explicitly set forth in the claims.
  • INDUSTRIAL APPLICABILITY
  • This invention is applicable wherever it is desired to protect computer networks against attacks. This invention is also applicable wherever it is desired to use machine learning sensors to detect obfuscated abnormal data.

Claims (13)

What is claimed is:
1. A process for improving a machine learning sensor for defending a computer network, comprising:
obtaining samples of normal network traffic from said network;
providing samples of cyber attacks from said network or from a repository of cyber attacks, wherein said samples of cyber attacks constitute known attacks;
sample classifying each of said samples as either said normal network traffic or said known attacks, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal network traffic and of said known attacks;
using a model generating algorithm to generate a variety of models for distinguishing between said normal network traffic and said known attacks in said training set;
obfuscating a portion of samples of said known attacks in said training set to create obfuscated attack samples;
adding said obfuscated attack samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for attack classifying between said normal network traffic and classified cyber attacks consisting of said known attacks, said obfuscated attack samples, and unknown attacks that are unrelated to said training set;
installing said optimal model in said sensor;
determining model fit using live data from real-time network traffic on said computer network, to determine more accurately whether and when to react to intrusion detection alert logs.
2. A process according to claim 1, further comprising:
determining model fit of said sensor by applying anomaly detection, measuring similarity between live network traffic and said training set, and determining model overfit;
assigning thresholds for said anomaly detection, said measuring similarity between live network traffic and said training set, and said determining model overfit, and aggregating results of whether said sensor is above or below said thresholds to identify model fit in real time on a scale of model fits; and
activating model retraining based on said scale of model fits.
3. A process according to claim 1, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
4. A system, comprising:
a network of computers; and
a machine-learning based sensor deployed in said network, wherein said machine-learning based sensor is trained by a process comprising:
obtaining samples of normal network traffic from said network;
providing samples of cyber attacks from said network or from a repository of cyber attacks, wherein said samples of cyber attacks constitute known attacks;
sample classifying each of said samples as either said normal network traffic or said known attacks, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal network traffic and of said known attacks;
using a model generating algorithm to generate a variety of models for distinguishing between said normal network traffic and said known attacks in said training set;
obfuscating a portion of samples of said known attacks in said training set to create obfuscated attack samples;
adding said obfuscated attack samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for attack classifying between said normal network traffic and classified cyber attacks consisting of said known attacks, said obfuscated attack samples, and unknown attacks that are unrelated to said training set;
installing said optimal model in said sensor;
determining model fit using live data from real-time network traffic on said computer network, to determine more accurately whether and when to react to intrusion detection alert logs.
5. A system according to claim 4, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
6. A computer program product stored in a computer readable medium for protecting a computer network, comprising:
a machine learning based sensor that is trained by a process comprising:
obtaining samples of normal network traffic from said network;
providing samples of cyber attacks from said network or from a repository of cyber attacks, wherein said samples of cyber attacks constitute known attacks;
sample classifying each of said samples as either said normal network traffic or said known attacks, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal network traffic and of said known attacks;
using a model generating algorithm to generate a variety of models for distinguishing between said normal network traffic and said known attacks in said training set;
obfuscating a portion of samples of said known attacks in said training set to create obfuscated attack samples;
adding said obfuscated attack samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for attack classifying between said normal network traffic and classified cyber attacks consisting of said known attacks, said obfuscated attack samples, and unknown attacks that are unrelated to said training set;
installing said optimal model in said sensor; and
determining model fit using live data from real-time network traffic on said computer network, to determine more accurately whether and when to react to intrusion detection alert logs.
7. A computer program product according to claim 6, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
8. A process for improving a machine learning sensor, comprising:
obtaining samples of normal data from a set of data containing normal and abnormal data;
providing samples of abnormal data from said set or from a repository of abnormal data, wherein said samples of abnormal data constitute known abnormal data;
sample classifying each of said samples as either said normal data or said known abnormal data, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal data and of said known abnormal data;
using a model generating algorithm to generate a variety of models for distinguishing between said normal data and said known abnormal data in said training set;
obfuscating a portion of samples of said known abnormal data in said training set to create obfuscated abnormal data samples;
adding said obfuscated abnormal data samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for abnormal classifying between said normal data and classified abnormal data consisting of said known abnormal data, said obfuscated abnormal data samples, and unknown abnormal data that is unrelated to said training set;
determining model fit using live data from real-time network traffic on said computer network;
installing said optimal model in said sensor;
wherein said set of data containing normal and abnormal data is selected from the group consisting of network traffic, DNA sequences, a composer's, author's or artist's collected works, and financial transactions; and
wherein said sensor is used for a function selected from the group consisting of detecting cyber attacks in said network traffic, detecting obfuscated DNA sequences in said DNA sequences, detecting a falsely attributed work from among said collected works, or detecting obfuscated transactions in said financial transactions.
9. A system, comprising:
a computer; and
a machine-learning based sensor deployed in said computer, wherein said machine-learning based sensor has been trained by a process comprising:
obtaining samples of normal data from a set of data containing normal and abnormal data;
providing samples of abnormal data from said set or from a repository of abnormal data, wherein said samples of abnormal data constitute known abnormal data;
sample classifying each of said samples as either said normal data or said known abnormal data, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal data and of said known abnormal data;
using a model generating algorithm to generate a variety of models for distinguishing between said normal data and said known abnormal data in said training set;
obfuscating a portion of samples of said known abnormal data in said training set to create obfuscated abnormal data samples;
adding said obfuscated abnormal data samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for abnormal classifying between said normal data and classified abnormal data consisting of said known abnormal data, said obfuscated abnormal data samples, and unknown abnormal data that is unrelated to said training set;
installing said optimal model in said computer;
wherein said set of data containing normal and abnormal data is selected from the group consisting of network traffic, DNA sequences, a composer's, author's or artist's collected works, and financial transactions; and
wherein said sensor is used for a function selected from the group consisting of detecting cyber attacks in said network traffic, detecting obfuscated DNA sequences in said DNA sequences, detecting a falsely attributed work from among said collected works, or detecting obfuscated transactions in said financial transactions.
10. A computer program product stored in a computer readable medium that, when executed by a computer, provides said computer with a machine learning based sensor that has been trained by a process comprising:
obtaining samples of normal data from a set of data containing normal and abnormal data;
providing samples of abnormal data from said set or from a repository of abnormal data, wherein said samples of abnormal data constitute known abnormal data;
sample classifying each of said samples as either said normal data or said abnormal data, to create ground truths for said samples;
splitting said samples into a training set and a test set, with each of said sets containing samples of said normal data and of said known abnormal data;
using a model generating algorithm to generate a variety of models for distinguishing between said normal data and said known abnormal data in said training set;
obfuscating a portion of samples of said known abnormal data in said training set to create obfuscated abnormal data samples;
adding said obfuscated abnormal data samples to said test set to form an enhanced test set;
performing statistical analysis on performance of said models with said enhanced test set to determine intrusion detection capability error;
selecting one of said models that optimizes a desired model parameter as an optimal model for distinguishing between said normal data and classified abnormal data consisting of said known abnormal data, said obfuscated abnormal data samples, and unknown abnormal data that is unrelated to said training set;
wherein said set of data containing normal and abnormal data is selected from the group consisting of network traffic, DNA sequences, a composer's, author's or artist's collected works, and financial transactions; and
wherein said sensor is used for a function selected from the group consisting of detecting cyber attacks in said network traffic, detecting obfuscated DNA sequences in said DNA sequences, detecting a falsely attributed work from among said collected works, or detecting obfuscated transactions in said financial transactions.
11. A process according to claim 8, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
12. A system according to claim 9, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
13. A computer program product according to claim 10, wherein said selecting step is performed using a desired model parameter selected from the group consisting of intrusion detection capability error and test error.
US16/945,697 2016-12-08 2020-07-31 Machine Learning Model Evaluation Abandoned US20200356904A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/945,697 US20200356904A1 (en) 2016-12-08 2020-07-31 Machine Learning Model Evaluation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/373,425 US10733530B2 (en) 2016-12-08 2016-12-08 Machine learning model evaluation in cyber defense
US16/945,697 US20200356904A1 (en) 2016-12-08 2020-07-31 Machine Learning Model Evaluation

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/373,425 Continuation US10733530B2 (en) 2016-12-08 2016-12-08 Machine learning model evaluation in cyber defense

Publications (1)

Publication Number Publication Date
US20200356904A1 true US20200356904A1 (en) 2020-11-12

Family

ID=62489443

Family Applications (3)

Application Number Title Priority Date Filing Date
US15/373,425 Active 2039-01-20 US10733530B2 (en) 2016-12-08 2016-12-08 Machine learning model evaluation in cyber defense
US16/937,568 Pending US20200364620A1 (en) 2016-12-08 2020-07-23 Machine Learning Model Evaluation in Cyber Defense
US16/945,697 Abandoned US20200356904A1 (en) 2016-12-08 2020-07-31 Machine Learning Model Evaluation

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US15/373,425 Active 2039-01-20 US10733530B2 (en) 2016-12-08 2016-12-08 Machine learning model evaluation in cyber defense
US16/937,568 Pending US20200364620A1 (en) 2016-12-08 2020-07-23 Machine Learning Model Evaluation in Cyber Defense

Country Status (1)

Country Link
US (3) US10733530B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210374565A1 (en) * 2020-05-31 2021-12-02 International Business Machines Corporation Automated combination of predictions made by different prediction systems
TWI770992B (en) * 2021-05-07 2022-07-11 宏茂光電股份有限公司 Fitting method to prevent overfitting

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106603531A (en) * 2016-12-15 2017-04-26 中国科学院沈阳自动化研究所 Automatic establishing method of intrusion detection model based on industrial control network and apparatus thereof
CN108875327A (en) * 2018-05-28 2018-11-23 阿里巴巴集团控股有限公司 One seed nucleus body method and apparatus
CN109299741B (en) * 2018-06-15 2022-03-04 北京理工大学 Network attack type identification method based on multi-layer detection
KR102035796B1 (en) * 2018-07-26 2019-10-24 주식회사 딥핑소스 Method, system and non-transitory computer-readable recording medium for processing data to be anonymized
US11258809B2 (en) * 2018-07-26 2022-02-22 Wallarm, Inc. Targeted attack detection system
CN109255234B (en) * 2018-08-15 2023-03-24 腾讯科技(深圳)有限公司 Processing method, device, medium and electronic equipment of machine learning model
US11520900B2 (en) * 2018-08-22 2022-12-06 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a text mining approach for predicting exploitation of vulnerabilities
CN108965340B (en) * 2018-09-25 2020-05-05 网御安全技术(深圳)有限公司 Industrial control system intrusion detection method and system
CN111181897A (en) * 2018-11-13 2020-05-19 中移(杭州)信息技术有限公司 Attack detection model training method, attack detection method and system
TWI709922B (en) * 2018-12-21 2020-11-11 財團法人工業技術研究院 A model-based machine learning system
KR20210099564A (en) * 2018-12-31 2021-08-12 인텔 코포레이션 Security system using artificial intelligence
CN109871855B (en) * 2019-02-26 2022-09-20 中南大学 Self-adaptive deep multi-core learning method
US11669779B2 (en) * 2019-04-05 2023-06-06 Zscaler, Inc. Prudent ensemble models in machine learning with high precision for use in network security
KR20200142374A (en) * 2019-06-12 2020-12-22 삼성전자주식회사 Method for selecting artificial intelligience model based on input data and disaply apparatus for performing the same method thereof
CN110287447A (en) * 2019-06-18 2019-09-27 浙江工业大学 A kind of networking multi-shaft motion control system sine attack detection method based on one-class support vector machines
KR20210010184A (en) * 2019-07-19 2021-01-27 한국전자통신연구원 Appartus and method for abnormal situation detection
CN110458209B (en) * 2019-07-24 2021-12-28 东莞理工学院 Attack evasion method and device for integrated tree classifier
US10621379B1 (en) 2019-10-24 2020-04-14 Deeping Source Inc. Method for training and testing adaption network corresponding to obfuscation network capable of processing data to be concealed for privacy, and training device and testing device using the same
US11893111B2 (en) * 2019-11-26 2024-02-06 Harman International Industries, Incorporated Defending machine learning systems from adversarial attacks
US11503061B1 (en) 2020-02-03 2022-11-15 Rapid7, Inc. Automatic evalution of remediation plans using exploitability risk modeling
US11470106B1 (en) 2020-02-03 2022-10-11 Rapid7, Inc. Exploitability risk model for assessing risk of cyberattacks
CN111343165B (en) * 2020-02-16 2022-08-05 重庆邮电大学 Network intrusion detection method and system based on BIRCH and SMOTE
CN111556014B (en) * 2020-03-24 2022-07-15 华东电力试验研究院有限公司 Network attack intrusion detection method adopting full-text index
WO2021225262A1 (en) * 2020-05-07 2021-11-11 Samsung Electronics Co., Ltd. Neural architecture search based optimized dnn model generation for execution of tasks in electronic device
CN111614665A (en) * 2020-05-20 2020-09-01 重庆邮电大学 Intrusion detection method based on deep residual hash network
CN111537893A (en) * 2020-05-27 2020-08-14 中国科学院上海高等研究院 Method and system for evaluating operation safety of lithium ion battery module and electronic equipment
JP2022021203A (en) * 2020-07-21 2022-02-02 富士通株式会社 Learning program, learning device and learning method
CN114070899B (en) * 2020-07-27 2023-05-12 深信服科技股份有限公司 Message detection method, device and readable storage medium
CN111935127B (en) * 2020-08-05 2023-06-27 无锡航天江南数据系统科技有限公司 Malicious behavior detection, identification and security encryption device in cloud computing
CN112422531A (en) * 2020-11-05 2021-02-26 博智安全科技股份有限公司 CNN and XGboost-based network traffic abnormal behavior detection method
CN112491820B (en) * 2020-11-12 2022-07-29 新华三技术有限公司 Abnormity detection method, device and equipment
CN112583820B (en) * 2020-12-09 2022-06-17 南方电网科学研究院有限责任公司 Power attack testing system based on attack topology
US11924169B1 (en) 2021-01-29 2024-03-05 Joinesty, Inc. Configuring a system for selectively obfuscating data transmitted between servers and end-user devices
CN112968891B (en) * 2021-02-19 2022-07-08 山东英信计算机技术有限公司 Network attack defense method and device and computer readable storage medium
CN113206824B (en) * 2021-03-23 2022-06-24 中国科学院信息工程研究所 Dynamic network abnormal attack detection method and device, electronic equipment and storage medium
CN113034001B (en) * 2021-03-24 2022-03-22 西南石油大学 Evaluation data processing method and system based on underground engineering parameters
CN115225294A (en) * 2021-04-16 2022-10-21 深信服科技股份有限公司 Confusion script collection method, device, equipment and medium
CN113242240B (en) * 2021-05-10 2022-07-01 北京交通大学 Method and device capable of detecting DDoS attacks of multiple types of application layers
CN113973008B (en) * 2021-09-28 2023-06-02 佳源科技股份有限公司 Detection system, method, equipment and medium based on mimicry technology and machine learning
CN115296851A (en) * 2022-07-06 2022-11-04 国网山西省电力公司信息通信分公司 Network intrusion detection method based on mutual information and gray wolf promotion algorithm
CN115580487B (en) * 2022-11-21 2023-04-07 深圳市华曦达科技股份有限公司 Method and device for constructing network abnormal flow detection model
CN115987687B (en) * 2023-03-17 2023-05-26 鹏城实验室 Network attack evidence obtaining method, device, equipment and storage medium
CN117336195B (en) * 2023-12-01 2024-02-06 中国西安卫星测控中心 Comprehensive performance evaluation method for intrusion detection model based on radar graph method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192824A1 (en) * 2003-07-25 2005-09-01 Enkata Technologies System and method for determining a behavior of a classifier for use with business data
US20140283052A1 (en) * 2013-03-14 2014-09-18 Eamon Hirata Jordan Heterogeneous sensors for network defense

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10356111B2 (en) * 2014-01-06 2019-07-16 Cisco Technology, Inc. Scheduling a network attack to train a machine learning model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192824A1 (en) * 2003-07-25 2005-09-01 Enkata Technologies System and method for determining a behavior of a classifier for use with business data
US20140283052A1 (en) * 2013-03-14 2014-09-18 Eamon Hirata Jordan Heterogeneous sensors for network defense

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Harry Wechsler ("Cyberspace Security Using Adversarial Learning and Conformal Prediction", Intelligent Information Management, 7, 2015, pp. 195-222) (Year: 2015) *
Jemili et al. ("A Framework for an Adaptive Intrusion DetectioN System using Bayesian Network," 2007 IEEE Intelligence and se 2007, pp. 66-70) (Year: 2007) *
O’Reilly et al. ("IEEE Communications Surveys and Tutorials", Vol., 16, No. 3, 2014, pp. 1413-1432) (Year: 2014) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210374565A1 (en) * 2020-05-31 2021-12-02 International Business Machines Corporation Automated combination of predictions made by different prediction systems
US11704580B2 (en) * 2020-05-31 2023-07-18 International Business Machines Corporation Automated combination of predictions made by different prediction systems
TWI770992B (en) * 2021-05-07 2022-07-11 宏茂光電股份有限公司 Fitting method to prevent overfitting

Also Published As

Publication number Publication date
US10733530B2 (en) 2020-08-04
US20200364620A1 (en) 2020-11-19
US20180165597A1 (en) 2018-06-14

Similar Documents

Publication Publication Date Title
US20200356904A1 (en) Machine Learning Model Evaluation
Disha et al. Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique
Ibitoye et al. The Threat of Adversarial Attacks on Machine Learning in Network Security--A Survey
Roy et al. A lightweight supervised intrusion detection mechanism for IoT networks
Bharadiya Machine learning in cybersecurity: Techniques and challenges
Sarhan et al. Feature extraction for machine learning-based intrusion detection in IoT networks
Gogoi et al. A survey of outlier detection methods in network anomaly identification
Chapaneri et al. A comprehensive survey of machine learning-based network intrusion detection
Grill et al. Learning combination of anomaly detectors for security domain
Dey et al. A metaheuristic-based ensemble feature selection framework for cyber threat detection in IoT-enabled networks
Raihan-Al-Masud et al. Network intrusion detection system using voting ensemble machine learning
Azam et al. Comparative analysis of intrusion detection systems and machine learning based model analysis through decision tree
Abualkibash Machine learning in network security using KNIME analytics
Khan et al. Performance evaluation of advanced machine learning algorithms for network intrusion detection system
Mwitondi et al. A robust domain partitioning intrusion detection method
Sharma et al. Optimization of ids using filter-based feature selection and machine learning algorithms
Zhang et al. A real-time intrusion detection system based on OC-SVM for containerized applications
Lee et al. CoNN-IDS: Intrusion detection system based on collaborative neural networks and agile training
Kim et al. Automated, reliable zero-day malware detection based on autoencoding architecture
Rajasekaran et al. Malicious attacks detection using GRU-BWFA classifier in pervasive computing
Rastogi et al. An analysis of intrusion detection classification using supervised machine learning algorithms on NSL-KDD dataset
Singh et al. A hybrid approach for intrusion detection based on machine learning
Islam et al. Real-Time Detection Schemes for Memory DoS (M-DoS) Attacks on Cloud Computing Applications
Abusitta et al. Robust: Deep learning for malware detection under changing environments
Alotaibi Network Intrusion Detection Model Using Fused Machine Learning Technique.

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION