US10749881B2 - Comparing unsupervised algorithms for anomaly detection - Google Patents
Comparing unsupervised algorithms for anomaly detection Download PDFInfo
- Publication number
- US10749881B2 US10749881B2 US15/637,471 US201715637471A US10749881B2 US 10749881 B2 US10749881 B2 US 10749881B2 US 201715637471 A US201715637471 A US 201715637471A US 10749881 B2 US10749881 B2 US 10749881B2
- Authority
- US
- United States
- Prior art keywords
- anomaly detection
- data
- quantile
- detection algorithms
- anomalies
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 108
- 238000009826 distribution Methods 0.000 claims abstract description 62
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000012545 processing Methods 0.000 claims description 13
- 230000004069 differentiation Effects 0.000 claims description 9
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 description 22
- 238000013459 approach Methods 0.000 description 15
- 238000011002 quantification Methods 0.000 description 14
- 230000015654 memory Effects 0.000 description 12
- 238000004590 computer program Methods 0.000 description 7
- 238000002790 cross-validation Methods 0.000 description 6
- 230000010354 integration Effects 0.000 description 6
- 238000000513 principal component analysis Methods 0.000 description 6
- 238000010224 classification analysis Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000000611 regression analysis Methods 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000012417 linear regression Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000002547 anomalous effect Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000013442 quality metrics Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
-
- G06K9/6215—
-
- G06K9/6277—
Definitions
- the Internet of Things is a network of physical objects, or “things,” embedded within electronics, software, sensors, and connectivity to enable and achieve greater value and service by exchanging data with the manufacturer, operator, and/or other connected devices or systems.
- the IoT provides application gateways for data aggregation and distribution that are located between application servers and numerous devices. Because the data amount in the IoT is very large and unlabeled, it can be difficult to determine data that is anomalous.
- Implementations of the present disclosure include computer-implemented methods for ranking anomaly detection algorithms.
- actions include receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.
- Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
- actions further include processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms; actions further include normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms; the second quantile can be based on the first quantile and a parameter, wherein the parameter is based on a width of a respective data distribution; the first quantile and the second quantile can be above 0.95; and actions further include comparing an anomaly score corresponding to a first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms to an alert threshold.
- the present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
- the present disclosure further provides a system for implementing the methods provided herein.
- the system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
- FIG. 1 depicts an example architecture that can be used to execute implementations of the present disclosure.
- FIG. 2 depicts an example architecture that can be used to execute implementations of the present disclosure.
- FIGS. 3A and 3B depict an example graphical representations in accordance with implementations of the present disclosure.
- FIG. 4 depicts an example process that can be executed in accordance with implementations of the present disclosure.
- FIG. 5 is a schematic illustration of example computer systems that can be used to execute implementations of the present disclosure.
- Implementations of the present disclosure are generally directed to ranking anomaly detection algorithms. More particularly, implementations of the present disclosure are directed to identifying best matching algorithm for differentiating anomalies in IoT data.
- cloud platforms can store large amounts of unlabeled measurement data from numerous sensors. The IoT data can then be used to remotely control and manage the corresponding devices and/or to trigger object-related processes.
- Unlabeled IoT data can include normal and abnormal data that are not differentiated by any labels.
- IoT data anomalies are different from normal IoT with respect to their features and are rare (e.g., less than 50%) in a dataset compared to normal instances. IoT data anomalies could affect associated IoT processes. Detection and removal of data anomalies can improve IoT processes.
- Implementations can include actions of receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.
- FIG. 1 depicts an example architecture 100 that can be used to execute implementations of the present disclosure.
- the example architecture 100 includes one or more client devices 102 , a server system 104 and a network 106 .
- the server system 104 includes one or more server devices 108 .
- a user 110 interacts with the client device 102 .
- the user 110 can include a user, who interacts with an application that is hosted by the server system 104 , such as an application for ranking anomaly detection algorithms.
- the client device 102 can communicate with one or more of the server devices 108 over the network 106 .
- the client device 102 can include any appropriate type of computing device such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a personal digital assistant (PDA), a cellular telephone, a network appliance, a camera, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or an appropriate combination of any two or more of these devices or other data processing devices.
- PDA personal digital assistant
- EGPS enhanced general packet radio service
- the network 106 can include a large computer network, such as a local area network (LAN), a wide area network (WAN), the Internet, a cellular network, a telephone network (e.g., PSTN) or an appropriate combination thereof connecting any number of communication devices, mobile computing devices, fixed computing devices and server systems.
- LAN local area network
- WAN wide area network
- PSTN public switched telephone network
- each server device 108 includes at least one server and at least one data store.
- the server devices 108 are intended to represent various forms of servers including, but not limited to an IoT server, a web server, an application server, a proxy server, a network server, and/or a server pool.
- server systems accept requests for application services and provides such services to any number of client devices (e.g., the client device 102 ) over the network 106 .
- the server system 104 can be an IoT system configured to host a service for ranking anomaly detection algorithms (e.g., provided as one or more computer-executable programs executed by one or more computing devices).
- input data can be provided to the server system 104 (e.g., from an IoT device), and the server system can process the input data through the service for ranking anomaly detection algorithms and provide result data.
- the server system 104 can send the result data to the client device 102 over the network 106 for display to the user 110 .
- the example context includes ranking anomaly detection algorithms for IoT data.
- Example IoT data can include a metric that represents a series of values recorded by a sensor.
- metrics can include acoustical, optical, thermal, electrical, mechanical, chemical, biological, positional information and other various information that can be measured by sensors.
- the IoT data can include unlabeled data, such that anomalies in the data are not flagged or differentiated in any way from normal data before being processed with anomaly detection algorithms.
- Anomaly detection algorithms can include statistical methods for monitoring dissimilarities between current and past sensor values of the recorded metrics to identify data anomalies. Some anomaly detection algorithms can be better than others depending on one or more characteristics of the datasets, such as the distribution of anomalies within the dataset. Ranking of anomaly detection algorithms can be applied to each dataset to identify the best anomaly detection algorithm for a particular time interval. For example, a first anomaly detection algorithm can be the best in identifying anomalies of a dataset measured by a sensor during a first time interval and a second anomaly detection algorithm, different from the first anomaly detection algorithm, can be the best in identifying anomalies of a dataset measured by the same sensor during a second time interval.
- the data amount in the IoT can be very large and the data is processed such that only a section of the metrics is analyzed by anomaly detection algorithms at a time. For example, metrics can be filtered based on one or more rules or truncated to a particular size based on a time interval. All the recorded metrics or a portion of the metrics (e.g., corresponding to a particular time interval) can be processed by multiple anomaly detection algorithms.
- the anomaly detection algorithms include statistical functions, such as principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
- the data is displayed by anomaly detection algorithms as histograms that include quantile plots.
- a portion of the quantiles can be selected to rank the anomaly detection algorithms.
- the portion of the quantiles can be between approximately 0.97 and 0.999 in steps of 0.001.
- user 110 can modify the quantile range for one or more sensors.
- the users 110 can modify the quantile range based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms.
- the anomaly detection algorithms can be ranked using a cross-validation based approach or a distribution-based approach.
- the cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis.
- the regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance.
- the classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
- the distribution based approach is based on the assumption that algorithms that produce score distributions that are bi-modal or fat tailed are better. Bi-modal distributions can be classified based on the value of the ratios. For example, a normal/anomaly ratio of approximately 50/50 can be classified as bad (or unrealistic) and larger rations, such as 90/10 can be classified as good (or realistic).
- the distribution based approach can be configured to consider only unilateral tails (e.g., higher quantile tails).
- the distribution based approach can include a quantification method.
- the quantification method can include one or more criteria or a combination of the criteria.
- the quantification method can include a differentiation between one-sided and bi-modal distribution, an identification of benchmark anomalies and a clustering method.
- the identification of benchmark anomalies can include determining the distance between a first quantile and a second quantile.
- the first quantile and the second quantile can be provided by the user 110 or can be set to a reasonable range of quantiles (e.g., 0.95 and 0.99).
- the identification of benchmark anomalies can be based on one of the three quality measures M 1 , M 2 , and M 3 .
- the method based on quality measure M 1 determines the distance between a first quantile, q and a second quantile, q ⁇ , that makes results constant for uniform distributions.
- M 1 can be defined by:
- M 1 max q ⁇ Q (quantile(s A ,q) ⁇ quantile(s A ,q ⁇ )), where ⁇ could be a free parameter in the order of 1e ⁇ 3 and s A represents the set of normalized scores from the interval [0, 1] for a particular algorithm.
- the method based on quality measure M 2 determines the distance between a first quantile, q and a second quantile, q ⁇ , and weights the results by number of scores s bigger than quantile q.
- the distance between quantile 0.99 and quantile 0.98 is calculated in score space, and the result is weighted by the number of scores bigger than 0.99.
- the method based on quality measure M 3 determines how far to find same number as scores as in an interval above current point.
- M 3 can be defined by:
- the score s 99 corresponding to the 99 th quantile, can be determined from quantile(scores, 0.99).
- the parameter ⁇ is a free parameter that can be set automatically based on width of distribution. It is determined how far to go in score space to find as many measurements with a score lower than score s99 based on the selected interval.
- FIG. 2 is a block diagram illustrating an example cloud IoT platform 200 for ranking anomaly detection algorithms.
- the example IoT platform 200 can be a cloud IoT platform, configured to collect data from numerous devices and sensors, to perform anomaly detection algorithms on the collected data, to process the data based on the results of the anomaly detection algorithms and to store the data in the cloud (e.g., scalable server system).
- the cloud e.g., scalable server system
- the system 200 can include one or more customer device entities 202 , a device integration component 204 , a cloud 206 , a process integration component 208 , and on premise entity 210 .
- Each customer device entity 202 includes a device 212 and a cloud connector 214 .
- the device 212 can be a physical object including, or attached to one or more sensors.
- the sensors can be a part of the device 212 or external objects that use the device 212 as a hub.
- a metric can represent a series of values recorded by a sensor. For example, metrics can include temperature, humidity, wind, speed, geographic coordinates, sound, etc.
- the data transmitted by the customer device entities 202 to the cloud 206 can include the metrics.
- the cloud connector 214 can integrate the device 212 to the cloud 206 using the device integration component 204 .
- the cloud 206 supports process integration 208 of processes associated with the devices 204 with an on premise entity 210 that includes systems 216 and a common metadata repository 218 .
- the cloud 206 includes IoT applications 220 .
- the IoT applications 220 can be executed on a database cloud platform 222 .
- the IoT applications 220 can use database cloud IoT services 224 to communicate with the devices 204 , and can use database cloud integration services 226 to communicate with the on premise entities 210 .
- the IoT applications 220 can include an application for ranking anomaly detection algorithms that can be used to process the data received from devices 204 .
- the cloud 206 includes a database big data platform 228 that can serve as a platform for the IoT applications 220 and includes a data processing component 230 and a streaming component 232 .
- the data processing component 230 can include in-memory engines 234 for executing instructions, an extended storage component 236 for storing data, and a Hadoop framework 238 that supports distributed storage and data processing.
- the results generated by the cloud 206 can be transmitted to the on premise entity 210 using the process integration component 208 .
- the on premise entity 210 can include a plurality of systems 216 that are associated to a common metadata repository 218 .
- the common metadata repository 218 can be based on a meta-model that supports visualization of results generated by ranking anomaly detection algorithms for one or more data sets.
- FIG. 3A depicts an example of a graphical representation 300 of a step of ranking anomaly detection algorithms for a single dataset.
- the data set can include a metric recorded by an IoT device, as described with reference to FIGS. 1 and 2 .
- the graphical representation 300 includes multiple examples of statistical distributions 302 , 304 , 306 , 308 , 310 .
- the examples of statistical distributions 302 , 304 , 306 , 308 , 310 can be histograms associated to a plurality of anomaly detection algorithms that display the single dataset as value 301 per count (or density) 303 .
- Some examples of anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
- the histograms can include quantile plots.
- the quantile includes a portion of the dataset such that each portion contains the same amount of data.
- the quantiles correspond to percentiles, such that the dataset is displayed as a histogram formed of 100 parts of equal size.
- the anomalies are expected to be in the tail region of the examples of statistical distributions 302 , 304 , 306 , 308 , 310 .
- the examples of statistical distributions 302 , 304 , 306 , 308 , 310 illustrated in FIG. 3A indicate that some statistical methods are better than others at differentiating normal data from anomalies.
- statistical distribution 302 identifies a minimal portion of the data as potentially being abnormal and has a short tail.
- Statistical distribution 304 has a longer tail than statistical distribution 302 , indicating a larger portion of the data as potentially being abnormal, which is normal-distributed.
- statistical distributions 306 and 308 are bi-modal.
- the example statistical distribution 306 illustrates the anomalies as being normally distributed.
- the example statistical distribution 308 identifies a large majority of anomalies as having a constant value.
- the example statistical distribution 310 identifies the anomalies as being uniformly distributed.
- FIG. 3B depicts an example of a graphical representation 350 of another step of ranking anomaly detection algorithms for a single dataset.
- the data set corresponds to the dataset used for the graphical representation 300 , as described with reference to FIG. 3A .
- the graphical representation 350 includes multiple examples of quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 .
- the examples of quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 illustrate a plurality of anomaly scores 313 within a quantile interval 311 for previously determined statistical distributions, such as example statistical distributions 302 , 304 , 306 , 308 , 310 described with reference to FIG. 3A .
- the anomaly scores 313 can be determined using any of the measures M 1 , M 2 , and M 3 described with reference to FIG. 1 .
- the quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 can be compared between each other to identify best anomaly detection algorithm for the analyzed dataset according to one or more classification criteria.
- the quantification of statistical distributions 312 and 314 present similar profiles. Quantitative comparison between the quantification of statistical distributions 312 and 314 indicates that statistical distribution 314 is better at identifying anomalies than statistical distribution 312 . In particular, the average, the maximum, and/or total value of the quantification of statistical distribution 314 is higher than the average, the maximum, and/or total value of the quantification of statistical distribution 312 .
- the higher differentiation between anomalies and normal data points in a dataset the better anomaly detection algorithm.
- Differentiation can be defined by assigning high scores to the first group and comparable low scores to the latter group.
- the quantifications of statistical distributions 316 and 318 each include a peak, which indicates that the statistical distributions 316 and 318 are better than statistical distributions 312 and 314 .
- the quantification of statistical distribution 318 includes the highest peak, which indicates that statistical distribution 318 is the best anomaly detection algorithm from the analyzed statistical distributions for the selected dataset.
- FIG. 4 depicts an example process 400 that can be provided by one or more computer-executable programs executed using one or more computing devices, as described with reference to FIGS. 1-3 .
- the example process 400 is executed to rank anomaly detection algorithms in accordance with implementations of the present disclosure.
- the process 400 can be based on a distribution-based approach, which is based on the assumption that anomaly detection algorithms that produce score distributions that are bi-modal/fat tailed are ‘better.’
- a set of unlabeled data is received by one or more processors from one or more sensors in a plurality of sensors of an internet of things (IoT) ( 402 ).
- the plurality of sensors can include a part of an IoT device or external objects attached to an IoT device.
- the data can be multi-variate (e.g., multiple different sensors can be used by an anomaly detection algorithm to determine anomaly scores).
- the data can include a metric, such as a series of values recorded by the sensor.
- the metrics can include incoming/outgoing data volume, temperature, humidity, wind, speed, geographic coordinates, sound, or any other values reflecting a functionality of an IoT device.
- the metrics are processed by a variety of anomaly detection algorithms ( 404 ).
- anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
- the processing can include using the anomaly detection algorithms to calculate scores from the sensor data.
- Per anomaly detection algorithm metric processing results in one set of univariate scores.
- the scores can be normalized using normalization constants so that normalized scores of all anomaly detection algorithms are within preselected intervals (e.g., interval [0, 1]). Normalization constants may be calibrated for each sensor in the modeling stage or for a combination of sensors.
- a plurality of data distributions corresponding to the set of unlabeled data is generated by using a plurality of anomaly detection algorithms ( 406 ).
- anomaly detection algorithms For example, normalized data can be used to generate the data distributions.
- Data distributions can be based on anomaly detection algorithms configured to illustrate dissimilarities in recorded sensor values to identify data anomalies.
- the data distributions can be displayed as histograms that include quantile plots. The distance between the quantiles can be selected in steps of predefined size (e.g., 0.001) that are constant between the anomaly detection algorithms.
- a first quantile and a second quantile of each of the plurality of data distributions are selected to determine a distance between them ( 408 ).
- the one or more processors can receive an input from a user indicating the first quantile and the second quantile for one or more sensors. For example, users can modify the distance based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms.
- the one or more processors can retrieve the selection of the first quantile and the second quantile from a database. The first quantile and the second quantile can be selected based on one or more conditions. One condition can include a requirement for the first quantile and the second quantile to be above a preset threshold (e.g., 0.95).
- the first quantile can be approximately 0.97 and the second quantile can be approximately 0.999.
- the distance between the first quantile and the second quantile can be calculated by defining the second quantile as the first quantile and a parameter, where the parameter indicates the width of the respective data distribution.
- the anomaly detection algorithms are ranked to determine how suitable each of the anomaly detection algorithms is for detecting anomalies in the data set ( 410 ).
- the anomaly detection algorithms can be ranked based on comparing the distances between the first quantile and the second quantile of the anomaly detection algorithms. For example, the best (first ranked) anomaly detection algorithm corresponds to the largest distance between the first quantile and the second quantile and the worse anomaly detection algorithm corresponds to the smallest distance between the first quantile and the second quantile.
- the process 400 can include determining an anomaly score for the first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms.
- the anomaly score can include an amount of data identified as being anomalous by the first ranked anomaly detection algorithm.
- the process 400 can include comparing the anomaly score to an alert threshold. If the maximum anomaly score exceeds the alert threshold, an alert can be generated indicating the anomaly and the associated IoT device.
- the process 400 can include removing the data anomaly based on the anomaly identification of the best anomaly detection algorithm before transmitting the data within the IoT domain for remotely controlling and managing the corresponding devices and/or to trigger object-related processes.
- the process 400 can include updating a setting (e.g., software component) of a device and/or upgrading (e.g., replacing) an element (e.g., hardware component) of the device that generated data anomaly to prevent future anomalies.
- Process 400 can be repeated for each data type (corresponding to each sensor) at particular time intervals considering that an anomaly detection algorithm identified as the best match for a data set can be different from the best match for another data set (e.g., a data set measured at a different time or by a different sensor).
- the process 400 can be based on a cross-validation based approach.
- the cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis.
- the regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance.
- the classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
- Implementations of the present disclosure provide one or more of the following example advantages.
- Methods for anomaly detection, particularly in the IoT space can use un-labelled data.
- the use of un-labelled data makes it very difficult to compare the performance of different algorithms on a given dataset, and consequently, to choose the most suitable algorithm from a set of possible methods.
- Ranking anomaly detection algorithms for IoT data can provide an “automated mode” to identify a best matching anomaly detection algorithm for a particular data set. Automatic identification of the best anomaly detection algorithm could eliminate manual data analysis for anomaly detection.
- An IoT analytics platform based on open source software may be designed to greatly minimize the complexities of ingesting and processing massive amounts of data generated in IoT scenarios. Detection and removal of data anomalies can improve IoT processes and the functionality of one or more IoT devices that can depend on the received data.
- the system 500 can be used for the operations described in association with the implementations described herein.
- the system 500 may be included in any or all of the server components discussed herein.
- the system 500 includes a processor 510 , a memory 520 , a storage device 530 , and an input/output device 540 .
- the components 510 , 520 , 530 , 540 are interconnected using a system bus 550 .
- the processor 510 is capable of processing instructions for execution within the system 500 .
- the processor 510 is a single-threaded processor.
- the processor 510 is a multi-threaded processor.
- the processor 510 is capable of processing instructions stored in the memory 520 or on the storage device 530 to display graphical information for a user interface on the input/output device 540 .
- the memory 520 stores information within the system 500 .
- the memory 520 is a computer-readable medium.
- the memory 520 is a volatile memory unit.
- the memory 520 is a non-volatile memory unit.
- the storage device 530 is capable of providing mass storage for the system 500 .
- the storage device 530 is a computer-readable medium.
- the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device.
- the input/output device 540 provides input/output operations for the system 500 .
- the input/output device 540 includes a keyboard and/or pointing device.
- the input/output device 540 includes a display unit for displaying graphical user interfaces.
- the features described can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them.
- the apparatus can be implemented in a computer program product tangibly embodied in an information carrier (e.g., in a machine-readable storage device, for execution by a programmable processor), and method steps can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.
- the described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device.
- a computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result.
- a computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- Elements of a computer can include a processor for executing instructions and one or more memories for storing instructions and data.
- a computer can also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks.
- Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
- magnetic disks such as internal hard disks and removable disks
- magneto-optical disks and CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
- ASICs application-specific integrated circuits
- the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
- a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
- the features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them.
- the components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, for example, a LAN, a WAN, and the computers and networks forming the Internet.
- the computer system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a network, such as the described one.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Hardware Design (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computing Systems (AREA)
- Operations Research (AREA)
- Algebra (AREA)
- Computational Linguistics (AREA)
- Computer And Data Communications (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
Description
M 2=maxq∈Q((quantile(s A ,q)−quantile(s A ,q−ε))*count(s>q))
s 0=quantile(s A ,q) and N=count(s A >s 0 ;s A <s 0+σ)
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/637,471 US10749881B2 (en) | 2017-06-29 | 2017-06-29 | Comparing unsupervised algorithms for anomaly detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/637,471 US10749881B2 (en) | 2017-06-29 | 2017-06-29 | Comparing unsupervised algorithms for anomaly detection |
Publications (2)
Publication Number | Publication Date |
---|---|
US20190007432A1 US20190007432A1 (en) | 2019-01-03 |
US10749881B2 true US10749881B2 (en) | 2020-08-18 |
Family
ID=64739312
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/637,471 Active 2038-04-26 US10749881B2 (en) | 2017-06-29 | 2017-06-29 | Comparing unsupervised algorithms for anomaly detection |
Country Status (1)
Country | Link |
---|---|
US (1) | US10749881B2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11227236B2 (en) * | 2020-04-15 | 2022-01-18 | SparkCognition, Inc. | Detection of deviation from an operating state of a device |
US11681284B2 (en) | 2021-08-04 | 2023-06-20 | Sap Se | Learning method and system for determining prediction horizon for machinery |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11055162B2 (en) * | 2018-10-31 | 2021-07-06 | Salesforce.Com, Inc. | Database system performance degradation detection |
WO2020167539A1 (en) * | 2019-02-05 | 2020-08-20 | Qomplx, Inc. | System and method for complex it process annotation, tracing, analysis, and simulation |
US11507563B2 (en) * | 2019-08-26 | 2022-11-22 | Kyndryl, Inc. | Unsupervised anomaly detection |
CN111694819A (en) * | 2020-04-27 | 2020-09-22 | 深圳华工能源技术有限公司 | Electric power abnormal data filtering method and device based on sub-bit distance algorithm |
CN112347078A (en) * | 2020-11-06 | 2021-02-09 | 国网江西省电力有限公司信息通信分公司 | Power distribution Internet of things data product construction method and device |
CN112383431A (en) * | 2020-11-13 | 2021-02-19 | 武汉虹旭信息技术有限责任公司 | Method and device for identifying data of internet of things in internet |
US11743273B2 (en) * | 2021-02-25 | 2023-08-29 | T-Mobile Usa, Inc. | Bot hunting system and method |
CN113687989B (en) * | 2021-08-09 | 2024-08-16 | 华东师范大学 | Internet of things data anomaly detection method and system based on server-free architecture |
Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100179930A1 (en) | 2009-01-13 | 2010-07-15 | Eric Teller | Method and System for Developing Predictions from Disparate Data Sources Using Intelligent Processing |
US20110078106A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Method and system for it resources performance analysis |
US20130079938A1 (en) | 2011-09-22 | 2013-03-28 | Sap Ag | Customer segmentation based on smart meter data |
US20130110761A1 (en) * | 2011-10-31 | 2013-05-02 | Krishnamurthy Viswanathan | System and method for ranking anomalies |
US8660868B2 (en) | 2011-09-22 | 2014-02-25 | Sap Ag | Energy benchmarking analytics |
US20140200952A1 (en) | 2013-01-11 | 2014-07-17 | International Business Machines Corporation | Scalable rule logicalization for asset health prediction |
US20140365191A1 (en) | 2013-06-10 | 2014-12-11 | Abb Technology Ltd. | Industrial asset health model update |
US20150058982A1 (en) * | 2001-12-14 | 2015-02-26 | Eleazar Eskin | Methods of unsupervised anomaly detection using a geometric framework |
US20150178865A1 (en) | 2011-09-20 | 2015-06-25 | The Trustees Of Columbia University In The City Of New York | Total property optimization system for energy efficiency and smart buildings |
US20160217384A1 (en) * | 2015-01-26 | 2016-07-28 | Sas Institute Inc. | Systems and methods for time series analysis techniques utilizing count data sets |
US20160350671A1 (en) | 2015-05-28 | 2016-12-01 | Predikto, Inc | Dynamically updated predictive modeling of systems and processes |
US20170011382A1 (en) * | 2015-07-10 | 2017-01-12 | Fair Isaac Corporation | Mobile attribute time-series profiling analytics |
US20170228660A1 (en) * | 2016-02-05 | 2017-08-10 | Nec Europe Ltd. | Scalable system and method for real-time predictions and anomaly detection |
US20170230392A1 (en) * | 2016-02-09 | 2017-08-10 | Darktrace Limited | Anomaly alert system for cyber threat detection |
US20180047071A1 (en) * | 2012-07-24 | 2018-02-15 | Ebay Inc. | System and methods for aggregating past and predicting future product ratings |
US20180096261A1 (en) * | 2016-10-01 | 2018-04-05 | Intel Corporation | Unsupervised machine learning ensemble for anomaly detection |
US20180211176A1 (en) * | 2017-01-20 | 2018-07-26 | Alchemy IoT | Blended IoT Device Health Index |
US20180241762A1 (en) * | 2017-02-23 | 2018-08-23 | Cisco Technology, Inc. | Anomaly selection using distance metric-based diversity and relevance |
US20180247220A1 (en) * | 2017-02-28 | 2018-08-30 | International Business Machines Corporation | Detecting data anomalies |
US20180248905A1 (en) * | 2017-02-24 | 2018-08-30 | Ciena Corporation | Systems and methods to detect abnormal behavior in networks |
US20180374104A1 (en) | 2017-06-26 | 2018-12-27 | Sap Se | Automated learning of data aggregation for analytics |
US20190012351A1 (en) * | 2015-08-10 | 2019-01-10 | Hewlett Packard Enterprise Development Lp | Evaluating system behaviour |
US10200262B1 (en) * | 2016-07-08 | 2019-02-05 | Splunk Inc. | Continuous anomaly detection service |
-
2017
- 2017-06-29 US US15/637,471 patent/US10749881B2/en active Active
Patent Citations (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150058982A1 (en) * | 2001-12-14 | 2015-02-26 | Eleazar Eskin | Methods of unsupervised anomaly detection using a geometric framework |
US9306966B2 (en) * | 2001-12-14 | 2016-04-05 | The Trustees Of Columbia University In The City Of New York | Methods of unsupervised anomaly detection using a geometric framework |
US20100179930A1 (en) | 2009-01-13 | 2010-07-15 | Eric Teller | Method and System for Developing Predictions from Disparate Data Sources Using Intelligent Processing |
US20110078106A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Method and system for it resources performance analysis |
US20150178865A1 (en) | 2011-09-20 | 2015-06-25 | The Trustees Of Columbia University In The City Of New York | Total property optimization system for energy efficiency and smart buildings |
US20130079938A1 (en) | 2011-09-22 | 2013-03-28 | Sap Ag | Customer segmentation based on smart meter data |
US8660868B2 (en) | 2011-09-22 | 2014-02-25 | Sap Ag | Energy benchmarking analytics |
US20130110761A1 (en) * | 2011-10-31 | 2013-05-02 | Krishnamurthy Viswanathan | System and method for ranking anomalies |
US20180047071A1 (en) * | 2012-07-24 | 2018-02-15 | Ebay Inc. | System and methods for aggregating past and predicting future product ratings |
US20140200952A1 (en) | 2013-01-11 | 2014-07-17 | International Business Machines Corporation | Scalable rule logicalization for asset health prediction |
US20140365191A1 (en) | 2013-06-10 | 2014-12-11 | Abb Technology Ltd. | Industrial asset health model update |
US20160217384A1 (en) * | 2015-01-26 | 2016-07-28 | Sas Institute Inc. | Systems and methods for time series analysis techniques utilizing count data sets |
US20160350671A1 (en) | 2015-05-28 | 2016-12-01 | Predikto, Inc | Dynamically updated predictive modeling of systems and processes |
US20170011382A1 (en) * | 2015-07-10 | 2017-01-12 | Fair Isaac Corporation | Mobile attribute time-series profiling analytics |
US20190012351A1 (en) * | 2015-08-10 | 2019-01-10 | Hewlett Packard Enterprise Development Lp | Evaluating system behaviour |
US20170228660A1 (en) * | 2016-02-05 | 2017-08-10 | Nec Europe Ltd. | Scalable system and method for real-time predictions and anomaly detection |
US20170230392A1 (en) * | 2016-02-09 | 2017-08-10 | Darktrace Limited | Anomaly alert system for cyber threat detection |
US10200262B1 (en) * | 2016-07-08 | 2019-02-05 | Splunk Inc. | Continuous anomaly detection service |
US20180096261A1 (en) * | 2016-10-01 | 2018-04-05 | Intel Corporation | Unsupervised machine learning ensemble for anomaly detection |
US20180211176A1 (en) * | 2017-01-20 | 2018-07-26 | Alchemy IoT | Blended IoT Device Health Index |
US20180241762A1 (en) * | 2017-02-23 | 2018-08-23 | Cisco Technology, Inc. | Anomaly selection using distance metric-based diversity and relevance |
US20180248905A1 (en) * | 2017-02-24 | 2018-08-30 | Ciena Corporation | Systems and methods to detect abnormal behavior in networks |
US20180247220A1 (en) * | 2017-02-28 | 2018-08-30 | International Business Machines Corporation | Detecting data anomalies |
US20180374104A1 (en) | 2017-06-26 | 2018-12-27 | Sap Se | Automated learning of data aggregation for analytics |
Non-Patent Citations (1)
Title |
---|
Non-final office action issued in U.S. Appl. No. 15/633,401 dated Jan. 9, 2020, 23 pages. |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11227236B2 (en) * | 2020-04-15 | 2022-01-18 | SparkCognition, Inc. | Detection of deviation from an operating state of a device |
US11880750B2 (en) | 2020-04-15 | 2024-01-23 | SparkCognition, Inc. | Anomaly detection based on device vibration |
US11681284B2 (en) | 2021-08-04 | 2023-06-20 | Sap Se | Learning method and system for determining prediction horizon for machinery |
Also Published As
Publication number | Publication date |
---|---|
US20190007432A1 (en) | 2019-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10749881B2 (en) | Comparing unsupervised algorithms for anomaly detection | |
US11645581B2 (en) | Meaningfully explaining black-box machine learning models | |
US10353961B2 (en) | Systems and methods for conducting and terminating a technology-assisted review | |
US8966036B1 (en) | Method and system for website user account management based on event transition matrixes | |
US10191966B2 (en) | Enabling advanced analytics with large data sets | |
US20170018030A1 (en) | System and Method for Determining Credit Worthiness of a User | |
US10504028B1 (en) | Techniques to use machine learning for risk management | |
US11915311B2 (en) | User score model training and calculation | |
CN109993627B (en) | Recommendation method, recommendation model training device and storage medium | |
WO2019061664A1 (en) | Electronic device, user's internet surfing data-based product recommendation method, and storage medium | |
US20150356163A1 (en) | Methods and systems for analyzing datasets | |
CN114090601B (en) | Data screening method, device, equipment and storage medium | |
TW201928771A (en) | Method and device for classifying samples to be assessed | |
CN111708942B (en) | Multimedia resource pushing method, device, server and storage medium | |
CN110968802B (en) | Analysis method and analysis device for user characteristics and readable storage medium | |
US11461696B2 (en) | Efficacy measures for unsupervised learning in a cyber security environment | |
TW202111592A (en) | Learning model application system, learning model application method, and program | |
US20230118341A1 (en) | Inline validation of machine learning models | |
CN115659411A (en) | Method and device for data analysis | |
CN114202256A (en) | Architecture upgrading early warning method and device, intelligent terminal and readable storage medium | |
US20220172086A1 (en) | System and method for providing unsupervised model health monitoring | |
US11868860B1 (en) | Systems and methods for cohort-based predictions in clustered time-series data in order to detect significant rate-of-change events | |
CN111309706A (en) | Model training method and device, readable storage medium and electronic equipment | |
Bayram et al. | DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications | |
CN113312554B (en) | Method and device for evaluating recommendation system, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAP SE, GERMANY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TAUSCHINSKY, ATREJU FLORIAN;MEUSEL, ROBERT;FRENDO, OLIVER;SIGNING DATES FROM 20170626 TO 20170628;REEL/FRAME:042867/0987 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |