US10749881B2 - Comparing unsupervised algorithms for anomaly detection - Google Patents

Comparing unsupervised algorithms for anomaly detection Download PDF

Info

Publication number
US10749881B2
US10749881B2 US15/637,471 US201715637471A US10749881B2 US 10749881 B2 US10749881 B2 US 10749881B2 US 201715637471 A US201715637471 A US 201715637471A US 10749881 B2 US10749881 B2 US 10749881B2
Authority
US
United States
Prior art keywords
anomaly detection
data
quantile
detection algorithms
anomalies
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US15/637,471
Other versions
US20190007432A1 (en
Inventor
Atreju Florian Tauschinsky
Robert Meusel
Oliver Frendo
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SAP SE
Original Assignee
SAP SE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SAP SE filed Critical SAP SE
Priority to US15/637,471 priority Critical patent/US10749881B2/en
Assigned to SAP SE reassignment SAP SE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MEUSEL, ROBERT, FRENDO, OLIVER, Tauschinsky, Atreju Florian
Publication of US20190007432A1 publication Critical patent/US20190007432A1/en
Application granted granted Critical
Publication of US10749881B2 publication Critical patent/US10749881B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06K9/6215
    • G06K9/6277

Definitions

  • the Internet of Things is a network of physical objects, or “things,” embedded within electronics, software, sensors, and connectivity to enable and achieve greater value and service by exchanging data with the manufacturer, operator, and/or other connected devices or systems.
  • the IoT provides application gateways for data aggregation and distribution that are located between application servers and numerous devices. Because the data amount in the IoT is very large and unlabeled, it can be difficult to determine data that is anomalous.
  • Implementations of the present disclosure include computer-implemented methods for ranking anomaly detection algorithms.
  • actions include receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.
  • Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
  • actions further include processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms; actions further include normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms; the second quantile can be based on the first quantile and a parameter, wherein the parameter is based on a width of a respective data distribution; the first quantile and the second quantile can be above 0.95; and actions further include comparing an anomaly score corresponding to a first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms to an alert threshold.
  • the present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
  • the present disclosure further provides a system for implementing the methods provided herein.
  • the system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
  • FIG. 1 depicts an example architecture that can be used to execute implementations of the present disclosure.
  • FIG. 2 depicts an example architecture that can be used to execute implementations of the present disclosure.
  • FIGS. 3A and 3B depict an example graphical representations in accordance with implementations of the present disclosure.
  • FIG. 4 depicts an example process that can be executed in accordance with implementations of the present disclosure.
  • FIG. 5 is a schematic illustration of example computer systems that can be used to execute implementations of the present disclosure.
  • Implementations of the present disclosure are generally directed to ranking anomaly detection algorithms. More particularly, implementations of the present disclosure are directed to identifying best matching algorithm for differentiating anomalies in IoT data.
  • cloud platforms can store large amounts of unlabeled measurement data from numerous sensors. The IoT data can then be used to remotely control and manage the corresponding devices and/or to trigger object-related processes.
  • Unlabeled IoT data can include normal and abnormal data that are not differentiated by any labels.
  • IoT data anomalies are different from normal IoT with respect to their features and are rare (e.g., less than 50%) in a dataset compared to normal instances. IoT data anomalies could affect associated IoT processes. Detection and removal of data anomalies can improve IoT processes.
  • Implementations can include actions of receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.
  • FIG. 1 depicts an example architecture 100 that can be used to execute implementations of the present disclosure.
  • the example architecture 100 includes one or more client devices 102 , a server system 104 and a network 106 .
  • the server system 104 includes one or more server devices 108 .
  • a user 110 interacts with the client device 102 .
  • the user 110 can include a user, who interacts with an application that is hosted by the server system 104 , such as an application for ranking anomaly detection algorithms.
  • the client device 102 can communicate with one or more of the server devices 108 over the network 106 .
  • the client device 102 can include any appropriate type of computing device such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a personal digital assistant (PDA), a cellular telephone, a network appliance, a camera, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or an appropriate combination of any two or more of these devices or other data processing devices.
  • PDA personal digital assistant
  • EGPS enhanced general packet radio service
  • the network 106 can include a large computer network, such as a local area network (LAN), a wide area network (WAN), the Internet, a cellular network, a telephone network (e.g., PSTN) or an appropriate combination thereof connecting any number of communication devices, mobile computing devices, fixed computing devices and server systems.
  • LAN local area network
  • WAN wide area network
  • PSTN public switched telephone network
  • each server device 108 includes at least one server and at least one data store.
  • the server devices 108 are intended to represent various forms of servers including, but not limited to an IoT server, a web server, an application server, a proxy server, a network server, and/or a server pool.
  • server systems accept requests for application services and provides such services to any number of client devices (e.g., the client device 102 ) over the network 106 .
  • the server system 104 can be an IoT system configured to host a service for ranking anomaly detection algorithms (e.g., provided as one or more computer-executable programs executed by one or more computing devices).
  • input data can be provided to the server system 104 (e.g., from an IoT device), and the server system can process the input data through the service for ranking anomaly detection algorithms and provide result data.
  • the server system 104 can send the result data to the client device 102 over the network 106 for display to the user 110 .
  • the example context includes ranking anomaly detection algorithms for IoT data.
  • Example IoT data can include a metric that represents a series of values recorded by a sensor.
  • metrics can include acoustical, optical, thermal, electrical, mechanical, chemical, biological, positional information and other various information that can be measured by sensors.
  • the IoT data can include unlabeled data, such that anomalies in the data are not flagged or differentiated in any way from normal data before being processed with anomaly detection algorithms.
  • Anomaly detection algorithms can include statistical methods for monitoring dissimilarities between current and past sensor values of the recorded metrics to identify data anomalies. Some anomaly detection algorithms can be better than others depending on one or more characteristics of the datasets, such as the distribution of anomalies within the dataset. Ranking of anomaly detection algorithms can be applied to each dataset to identify the best anomaly detection algorithm for a particular time interval. For example, a first anomaly detection algorithm can be the best in identifying anomalies of a dataset measured by a sensor during a first time interval and a second anomaly detection algorithm, different from the first anomaly detection algorithm, can be the best in identifying anomalies of a dataset measured by the same sensor during a second time interval.
  • the data amount in the IoT can be very large and the data is processed such that only a section of the metrics is analyzed by anomaly detection algorithms at a time. For example, metrics can be filtered based on one or more rules or truncated to a particular size based on a time interval. All the recorded metrics or a portion of the metrics (e.g., corresponding to a particular time interval) can be processed by multiple anomaly detection algorithms.
  • the anomaly detection algorithms include statistical functions, such as principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
  • the data is displayed by anomaly detection algorithms as histograms that include quantile plots.
  • a portion of the quantiles can be selected to rank the anomaly detection algorithms.
  • the portion of the quantiles can be between approximately 0.97 and 0.999 in steps of 0.001.
  • user 110 can modify the quantile range for one or more sensors.
  • the users 110 can modify the quantile range based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms.
  • the anomaly detection algorithms can be ranked using a cross-validation based approach or a distribution-based approach.
  • the cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis.
  • the regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance.
  • the classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
  • the distribution based approach is based on the assumption that algorithms that produce score distributions that are bi-modal or fat tailed are better. Bi-modal distributions can be classified based on the value of the ratios. For example, a normal/anomaly ratio of approximately 50/50 can be classified as bad (or unrealistic) and larger rations, such as 90/10 can be classified as good (or realistic).
  • the distribution based approach can be configured to consider only unilateral tails (e.g., higher quantile tails).
  • the distribution based approach can include a quantification method.
  • the quantification method can include one or more criteria or a combination of the criteria.
  • the quantification method can include a differentiation between one-sided and bi-modal distribution, an identification of benchmark anomalies and a clustering method.
  • the identification of benchmark anomalies can include determining the distance between a first quantile and a second quantile.
  • the first quantile and the second quantile can be provided by the user 110 or can be set to a reasonable range of quantiles (e.g., 0.95 and 0.99).
  • the identification of benchmark anomalies can be based on one of the three quality measures M 1 , M 2 , and M 3 .
  • the method based on quality measure M 1 determines the distance between a first quantile, q and a second quantile, q ⁇ , that makes results constant for uniform distributions.
  • M 1 can be defined by:
  • M 1 max q ⁇ Q (quantile(s A ,q) ⁇ quantile(s A ,q ⁇ )), where ⁇ could be a free parameter in the order of 1e ⁇ 3 and s A represents the set of normalized scores from the interval [0, 1] for a particular algorithm.
  • the method based on quality measure M 2 determines the distance between a first quantile, q and a second quantile, q ⁇ , and weights the results by number of scores s bigger than quantile q.
  • the distance between quantile 0.99 and quantile 0.98 is calculated in score space, and the result is weighted by the number of scores bigger than 0.99.
  • the method based on quality measure M 3 determines how far to find same number as scores as in an interval above current point.
  • M 3 can be defined by:
  • the score s 99 corresponding to the 99 th quantile, can be determined from quantile(scores, 0.99).
  • the parameter ⁇ is a free parameter that can be set automatically based on width of distribution. It is determined how far to go in score space to find as many measurements with a score lower than score s99 based on the selected interval.
  • FIG. 2 is a block diagram illustrating an example cloud IoT platform 200 for ranking anomaly detection algorithms.
  • the example IoT platform 200 can be a cloud IoT platform, configured to collect data from numerous devices and sensors, to perform anomaly detection algorithms on the collected data, to process the data based on the results of the anomaly detection algorithms and to store the data in the cloud (e.g., scalable server system).
  • the cloud e.g., scalable server system
  • the system 200 can include one or more customer device entities 202 , a device integration component 204 , a cloud 206 , a process integration component 208 , and on premise entity 210 .
  • Each customer device entity 202 includes a device 212 and a cloud connector 214 .
  • the device 212 can be a physical object including, or attached to one or more sensors.
  • the sensors can be a part of the device 212 or external objects that use the device 212 as a hub.
  • a metric can represent a series of values recorded by a sensor. For example, metrics can include temperature, humidity, wind, speed, geographic coordinates, sound, etc.
  • the data transmitted by the customer device entities 202 to the cloud 206 can include the metrics.
  • the cloud connector 214 can integrate the device 212 to the cloud 206 using the device integration component 204 .
  • the cloud 206 supports process integration 208 of processes associated with the devices 204 with an on premise entity 210 that includes systems 216 and a common metadata repository 218 .
  • the cloud 206 includes IoT applications 220 .
  • the IoT applications 220 can be executed on a database cloud platform 222 .
  • the IoT applications 220 can use database cloud IoT services 224 to communicate with the devices 204 , and can use database cloud integration services 226 to communicate with the on premise entities 210 .
  • the IoT applications 220 can include an application for ranking anomaly detection algorithms that can be used to process the data received from devices 204 .
  • the cloud 206 includes a database big data platform 228 that can serve as a platform for the IoT applications 220 and includes a data processing component 230 and a streaming component 232 .
  • the data processing component 230 can include in-memory engines 234 for executing instructions, an extended storage component 236 for storing data, and a Hadoop framework 238 that supports distributed storage and data processing.
  • the results generated by the cloud 206 can be transmitted to the on premise entity 210 using the process integration component 208 .
  • the on premise entity 210 can include a plurality of systems 216 that are associated to a common metadata repository 218 .
  • the common metadata repository 218 can be based on a meta-model that supports visualization of results generated by ranking anomaly detection algorithms for one or more data sets.
  • FIG. 3A depicts an example of a graphical representation 300 of a step of ranking anomaly detection algorithms for a single dataset.
  • the data set can include a metric recorded by an IoT device, as described with reference to FIGS. 1 and 2 .
  • the graphical representation 300 includes multiple examples of statistical distributions 302 , 304 , 306 , 308 , 310 .
  • the examples of statistical distributions 302 , 304 , 306 , 308 , 310 can be histograms associated to a plurality of anomaly detection algorithms that display the single dataset as value 301 per count (or density) 303 .
  • Some examples of anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
  • the histograms can include quantile plots.
  • the quantile includes a portion of the dataset such that each portion contains the same amount of data.
  • the quantiles correspond to percentiles, such that the dataset is displayed as a histogram formed of 100 parts of equal size.
  • the anomalies are expected to be in the tail region of the examples of statistical distributions 302 , 304 , 306 , 308 , 310 .
  • the examples of statistical distributions 302 , 304 , 306 , 308 , 310 illustrated in FIG. 3A indicate that some statistical methods are better than others at differentiating normal data from anomalies.
  • statistical distribution 302 identifies a minimal portion of the data as potentially being abnormal and has a short tail.
  • Statistical distribution 304 has a longer tail than statistical distribution 302 , indicating a larger portion of the data as potentially being abnormal, which is normal-distributed.
  • statistical distributions 306 and 308 are bi-modal.
  • the example statistical distribution 306 illustrates the anomalies as being normally distributed.
  • the example statistical distribution 308 identifies a large majority of anomalies as having a constant value.
  • the example statistical distribution 310 identifies the anomalies as being uniformly distributed.
  • FIG. 3B depicts an example of a graphical representation 350 of another step of ranking anomaly detection algorithms for a single dataset.
  • the data set corresponds to the dataset used for the graphical representation 300 , as described with reference to FIG. 3A .
  • the graphical representation 350 includes multiple examples of quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 .
  • the examples of quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 illustrate a plurality of anomaly scores 313 within a quantile interval 311 for previously determined statistical distributions, such as example statistical distributions 302 , 304 , 306 , 308 , 310 described with reference to FIG. 3A .
  • the anomaly scores 313 can be determined using any of the measures M 1 , M 2 , and M 3 described with reference to FIG. 1 .
  • the quantifications of statistical distributions 312 , 314 , 316 , 318 , 320 can be compared between each other to identify best anomaly detection algorithm for the analyzed dataset according to one or more classification criteria.
  • the quantification of statistical distributions 312 and 314 present similar profiles. Quantitative comparison between the quantification of statistical distributions 312 and 314 indicates that statistical distribution 314 is better at identifying anomalies than statistical distribution 312 . In particular, the average, the maximum, and/or total value of the quantification of statistical distribution 314 is higher than the average, the maximum, and/or total value of the quantification of statistical distribution 312 .
  • the higher differentiation between anomalies and normal data points in a dataset the better anomaly detection algorithm.
  • Differentiation can be defined by assigning high scores to the first group and comparable low scores to the latter group.
  • the quantifications of statistical distributions 316 and 318 each include a peak, which indicates that the statistical distributions 316 and 318 are better than statistical distributions 312 and 314 .
  • the quantification of statistical distribution 318 includes the highest peak, which indicates that statistical distribution 318 is the best anomaly detection algorithm from the analyzed statistical distributions for the selected dataset.
  • FIG. 4 depicts an example process 400 that can be provided by one or more computer-executable programs executed using one or more computing devices, as described with reference to FIGS. 1-3 .
  • the example process 400 is executed to rank anomaly detection algorithms in accordance with implementations of the present disclosure.
  • the process 400 can be based on a distribution-based approach, which is based on the assumption that anomaly detection algorithms that produce score distributions that are bi-modal/fat tailed are ‘better.’
  • a set of unlabeled data is received by one or more processors from one or more sensors in a plurality of sensors of an internet of things (IoT) ( 402 ).
  • the plurality of sensors can include a part of an IoT device or external objects attached to an IoT device.
  • the data can be multi-variate (e.g., multiple different sensors can be used by an anomaly detection algorithm to determine anomaly scores).
  • the data can include a metric, such as a series of values recorded by the sensor.
  • the metrics can include incoming/outgoing data volume, temperature, humidity, wind, speed, geographic coordinates, sound, or any other values reflecting a functionality of an IoT device.
  • the metrics are processed by a variety of anomaly detection algorithms ( 404 ).
  • anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
  • the processing can include using the anomaly detection algorithms to calculate scores from the sensor data.
  • Per anomaly detection algorithm metric processing results in one set of univariate scores.
  • the scores can be normalized using normalization constants so that normalized scores of all anomaly detection algorithms are within preselected intervals (e.g., interval [0, 1]). Normalization constants may be calibrated for each sensor in the modeling stage or for a combination of sensors.
  • a plurality of data distributions corresponding to the set of unlabeled data is generated by using a plurality of anomaly detection algorithms ( 406 ).
  • anomaly detection algorithms For example, normalized data can be used to generate the data distributions.
  • Data distributions can be based on anomaly detection algorithms configured to illustrate dissimilarities in recorded sensor values to identify data anomalies.
  • the data distributions can be displayed as histograms that include quantile plots. The distance between the quantiles can be selected in steps of predefined size (e.g., 0.001) that are constant between the anomaly detection algorithms.
  • a first quantile and a second quantile of each of the plurality of data distributions are selected to determine a distance between them ( 408 ).
  • the one or more processors can receive an input from a user indicating the first quantile and the second quantile for one or more sensors. For example, users can modify the distance based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms.
  • the one or more processors can retrieve the selection of the first quantile and the second quantile from a database. The first quantile and the second quantile can be selected based on one or more conditions. One condition can include a requirement for the first quantile and the second quantile to be above a preset threshold (e.g., 0.95).
  • the first quantile can be approximately 0.97 and the second quantile can be approximately 0.999.
  • the distance between the first quantile and the second quantile can be calculated by defining the second quantile as the first quantile and a parameter, where the parameter indicates the width of the respective data distribution.
  • the anomaly detection algorithms are ranked to determine how suitable each of the anomaly detection algorithms is for detecting anomalies in the data set ( 410 ).
  • the anomaly detection algorithms can be ranked based on comparing the distances between the first quantile and the second quantile of the anomaly detection algorithms. For example, the best (first ranked) anomaly detection algorithm corresponds to the largest distance between the first quantile and the second quantile and the worse anomaly detection algorithm corresponds to the smallest distance between the first quantile and the second quantile.
  • the process 400 can include determining an anomaly score for the first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms.
  • the anomaly score can include an amount of data identified as being anomalous by the first ranked anomaly detection algorithm.
  • the process 400 can include comparing the anomaly score to an alert threshold. If the maximum anomaly score exceeds the alert threshold, an alert can be generated indicating the anomaly and the associated IoT device.
  • the process 400 can include removing the data anomaly based on the anomaly identification of the best anomaly detection algorithm before transmitting the data within the IoT domain for remotely controlling and managing the corresponding devices and/or to trigger object-related processes.
  • the process 400 can include updating a setting (e.g., software component) of a device and/or upgrading (e.g., replacing) an element (e.g., hardware component) of the device that generated data anomaly to prevent future anomalies.
  • Process 400 can be repeated for each data type (corresponding to each sensor) at particular time intervals considering that an anomaly detection algorithm identified as the best match for a data set can be different from the best match for another data set (e.g., a data set measured at a different time or by a different sensor).
  • the process 400 can be based on a cross-validation based approach.
  • the cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis.
  • the regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance.
  • the classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
  • Implementations of the present disclosure provide one or more of the following example advantages.
  • Methods for anomaly detection, particularly in the IoT space can use un-labelled data.
  • the use of un-labelled data makes it very difficult to compare the performance of different algorithms on a given dataset, and consequently, to choose the most suitable algorithm from a set of possible methods.
  • Ranking anomaly detection algorithms for IoT data can provide an “automated mode” to identify a best matching anomaly detection algorithm for a particular data set. Automatic identification of the best anomaly detection algorithm could eliminate manual data analysis for anomaly detection.
  • An IoT analytics platform based on open source software may be designed to greatly minimize the complexities of ingesting and processing massive amounts of data generated in IoT scenarios. Detection and removal of data anomalies can improve IoT processes and the functionality of one or more IoT devices that can depend on the received data.
  • the system 500 can be used for the operations described in association with the implementations described herein.
  • the system 500 may be included in any or all of the server components discussed herein.
  • the system 500 includes a processor 510 , a memory 520 , a storage device 530 , and an input/output device 540 .
  • the components 510 , 520 , 530 , 540 are interconnected using a system bus 550 .
  • the processor 510 is capable of processing instructions for execution within the system 500 .
  • the processor 510 is a single-threaded processor.
  • the processor 510 is a multi-threaded processor.
  • the processor 510 is capable of processing instructions stored in the memory 520 or on the storage device 530 to display graphical information for a user interface on the input/output device 540 .
  • the memory 520 stores information within the system 500 .
  • the memory 520 is a computer-readable medium.
  • the memory 520 is a volatile memory unit.
  • the memory 520 is a non-volatile memory unit.
  • the storage device 530 is capable of providing mass storage for the system 500 .
  • the storage device 530 is a computer-readable medium.
  • the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device.
  • the input/output device 540 provides input/output operations for the system 500 .
  • the input/output device 540 includes a keyboard and/or pointing device.
  • the input/output device 540 includes a display unit for displaying graphical user interfaces.
  • the features described can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them.
  • the apparatus can be implemented in a computer program product tangibly embodied in an information carrier (e.g., in a machine-readable storage device, for execution by a programmable processor), and method steps can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.
  • the described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device.
  • a computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result.
  • a computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer.
  • a processor will receive instructions and data from a read-only memory or a random access memory or both.
  • Elements of a computer can include a processor for executing instructions and one or more memories for storing instructions and data.
  • a computer can also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks.
  • Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks and removable disks
  • magneto-optical disks and CD-ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
  • ASICs application-specific integrated circuits
  • the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
  • a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
  • the features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them.
  • the components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, for example, a LAN, a WAN, and the computers and networks forming the Internet.
  • the computer system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a network, such as the described one.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Computational Linguistics (AREA)
  • Computer And Data Communications (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

Methods, systems, and computer-readable storage media for ranking anomaly detection algorithms, including operations of receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.

Description

BACKGROUND
The Internet of Things (IoT) is a network of physical objects, or “things,” embedded within electronics, software, sensors, and connectivity to enable and achieve greater value and service by exchanging data with the manufacturer, operator, and/or other connected devices or systems. The IoT provides application gateways for data aggregation and distribution that are located between application servers and numerous devices. Because the data amount in the IoT is very large and unlabeled, it can be difficult to determine data that is anomalous.
SUMMARY
Implementations of the present disclosure include computer-implemented methods for ranking anomaly detection algorithms. In some implementations, actions include receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions. Other implementations of this aspect include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
These and other implementations can each optionally include one or more of the following features: actions further include processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms; actions further include normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms; the second quantile can be based on the first quantile and a parameter, wherein the parameter is based on a width of a respective data distribution; the first quantile and the second quantile can be above 0.95; and actions further include comparing an anomaly score corresponding to a first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms to an alert threshold.
The present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
The present disclosure further provides a system for implementing the methods provided herein. The system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.
It is appreciated that methods in accordance with the present disclosure can include any combination of the aspects and features described herein. That is, methods in accordance with the present disclosure are not limited to the combinations of aspects and features specifically described herein, but also include any combination of the aspects and features provided.
The details of one or more implementations of the present disclosure are set forth in the accompanying drawings and the description below. Other features and advantages of the present disclosure will be apparent from the description and drawings, and from the claims.
DESCRIPTION OF DRAWINGS
FIG. 1 depicts an example architecture that can be used to execute implementations of the present disclosure.
FIG. 2 depicts an example architecture that can be used to execute implementations of the present disclosure.
FIGS. 3A and 3B depict an example graphical representations in accordance with implementations of the present disclosure.
FIG. 4 depicts an example process that can be executed in accordance with implementations of the present disclosure.
FIG. 5 is a schematic illustration of example computer systems that can be used to execute implementations of the present disclosure.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTION
Implementations of the present disclosure are generally directed to ranking anomaly detection algorithms. More particularly, implementations of the present disclosure are directed to identifying best matching algorithm for differentiating anomalies in IoT data. In the context of the IoT, cloud platforms can store large amounts of unlabeled measurement data from numerous sensors. The IoT data can then be used to remotely control and manage the corresponding devices and/or to trigger object-related processes. Unlabeled IoT data can include normal and abnormal data that are not differentiated by any labels. IoT data anomalies are different from normal IoT with respect to their features and are rare (e.g., less than 50%) in a dataset compared to normal instances. IoT data anomalies could affect associated IoT processes. Detection and removal of data anomalies can improve IoT processes.
Implementations can include actions of receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, generating a plurality of data distributions corresponding to the set of unlabeled data by using a plurality of anomaly detection algorithms, and ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions.
FIG. 1 depicts an example architecture 100 that can be used to execute implementations of the present disclosure. In the depicted example, the example architecture 100 includes one or more client devices 102, a server system 104 and a network 106. The server system 104 includes one or more server devices 108. In the depicted example, a user 110 interacts with the client device 102. In an example context, the user 110 can include a user, who interacts with an application that is hosted by the server system 104, such as an application for ranking anomaly detection algorithms.
In some examples, the client device 102 can communicate with one or more of the server devices 108 over the network 106. In some examples, the client device 102 can include any appropriate type of computing device such as a desktop computer, a laptop computer, a handheld computer, a tablet computer, a personal digital assistant (PDA), a cellular telephone, a network appliance, a camera, a smart phone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, or an appropriate combination of any two or more of these devices or other data processing devices.
In some implementations, the network 106 can include a large computer network, such as a local area network (LAN), a wide area network (WAN), the Internet, a cellular network, a telephone network (e.g., PSTN) or an appropriate combination thereof connecting any number of communication devices, mobile computing devices, fixed computing devices and server systems.
In some implementations, each server device 108 includes at least one server and at least one data store. In the example of FIG. 1, the server devices 108 are intended to represent various forms of servers including, but not limited to an IoT server, a web server, an application server, a proxy server, a network server, and/or a server pool. In general, server systems accept requests for application services and provides such services to any number of client devices (e.g., the client device 102) over the network 106.
In accordance with implementations of the present disclosure, the server system 104 can be an IoT system configured to host a service for ranking anomaly detection algorithms (e.g., provided as one or more computer-executable programs executed by one or more computing devices). For example, input data can be provided to the server system 104 (e.g., from an IoT device), and the server system can process the input data through the service for ranking anomaly detection algorithms and provide result data. For example, the server system 104 can send the result data to the client device 102 over the network 106 for display to the user 110.
Implementations of the present disclosure are described in detail herein with reference to an example context. The example context includes ranking anomaly detection algorithms for IoT data. Example IoT data can include a metric that represents a series of values recorded by a sensor. For example, metrics can include acoustical, optical, thermal, electrical, mechanical, chemical, biological, positional information and other various information that can be measured by sensors. The IoT data can include unlabeled data, such that anomalies in the data are not flagged or differentiated in any way from normal data before being processed with anomaly detection algorithms.
Anomaly detection algorithms can include statistical methods for monitoring dissimilarities between current and past sensor values of the recorded metrics to identify data anomalies. Some anomaly detection algorithms can be better than others depending on one or more characteristics of the datasets, such as the distribution of anomalies within the dataset. Ranking of anomaly detection algorithms can be applied to each dataset to identify the best anomaly detection algorithm for a particular time interval. For example, a first anomaly detection algorithm can be the best in identifying anomalies of a dataset measured by a sensor during a first time interval and a second anomaly detection algorithm, different from the first anomaly detection algorithm, can be the best in identifying anomalies of a dataset measured by the same sensor during a second time interval.
In some implementations, the data amount in the IoT can be very large and the data is processed such that only a section of the metrics is analyzed by anomaly detection algorithms at a time. For example, metrics can be filtered based on one or more rules or truncated to a particular size based on a time interval. All the recorded metrics or a portion of the metrics (e.g., corresponding to a particular time interval) can be processed by multiple anomaly detection algorithms. The anomaly detection algorithms include statistical functions, such as principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
The data is displayed by anomaly detection algorithms as histograms that include quantile plots. A portion of the quantiles can be selected to rank the anomaly detection algorithms. For typical IoT anomaly detection workloads the portion of the quantiles can be between approximately 0.97 and 0.999 in steps of 0.001. In some implementations user 110 can modify the quantile range for one or more sensors. For example, the users 110 can modify the quantile range based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms.
The anomaly detection algorithms can be ranked using a cross-validation based approach or a distribution-based approach. The cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis. The regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance. The classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
The distribution based approach is based on the assumption that algorithms that produce score distributions that are bi-modal or fat tailed are better. Bi-modal distributions can be classified based on the value of the ratios. For example, a normal/anomaly ratio of approximately 50/50 can be classified as bad (or unrealistic) and larger rations, such as 90/10 can be classified as good (or realistic). The distribution based approach can be configured to consider only unilateral tails (e.g., higher quantile tails).
The distribution based approach can include a quantification method. The quantification method can include one or more criteria or a combination of the criteria. For example, the quantification method can include a differentiation between one-sided and bi-modal distribution, an identification of benchmark anomalies and a clustering method. The identification of benchmark anomalies can include determining the distance between a first quantile and a second quantile. In some implementations, the first quantile and the second quantile can be provided by the user 110 or can be set to a reasonable range of quantiles (e.g., 0.95 and 0.99).
The identification of benchmark anomalies can be based on one of the three quality measures M1, M2, and M3.
The method based on quality measure M1 determines the distance between a first quantile, q and a second quantile, q−ε, that makes results constant for uniform distributions. M1 can be defined by:
M1=maxq∈Q(quantile(sA,q)−quantile(sA,q−ε)), where ε could be a free parameter in the order of 1e−3 and sA represents the set of normalized scores from the interval [0, 1] for a particular algorithm.
For example, when looking at quantile 0.99, the quantification score (e.g., quality metric) is the distance, in score space, between the score corresponding to quantile 0.99 and the score corresponding to quantile 0.98=0.99−ε.
The method based on quality measure M2 determines the distance between a first quantile, q and a second quantile, q−ε, and weights the results by number of scores s bigger than quantile q. M2 can be defined by:
M 2=maxq∈Q((quantile(s A ,q)−quantile(s A ,q−ε))*count(s>q))
For example, when looking at quantile 0.99, the distance between quantile 0.99 and quantile 0.98 is calculated in score space, and the result is weighted by the number of scores bigger than 0.99.
The method based on quality measure M3 determines how far to find same number as scores as in an interval above current point. M3 can be defined by:
M3=maxq∈Q(s0−quantile(sA,q−N/sA))), where σ can be determined from the standard deviation of sA or pre-determined by the user and s0 and N are defined as:
s 0=quantile(s A ,q) and N=count(s A >s 0 ;s A <s 0+σ)
For example, when looking at quantile 0.99, it is determined how many scores are in an interval s99<score<s99+σ. The score s99, corresponding to the 99th quantile, can be determined from quantile(scores, 0.99). The parameter σ is a free parameter that can be set automatically based on width of distribution. It is determined how far to go in score space to find as many measurements with a score lower than score s99 based on the selected interval.
FIG. 2 is a block diagram illustrating an example cloud IoT platform 200 for ranking anomaly detection algorithms. The example IoT platform 200 can be a cloud IoT platform, configured to collect data from numerous devices and sensors, to perform anomaly detection algorithms on the collected data, to process the data based on the results of the anomaly detection algorithms and to store the data in the cloud (e.g., scalable server system).
The system 200 can include one or more customer device entities 202, a device integration component 204, a cloud 206, a process integration component 208, and on premise entity 210. Each customer device entity 202 includes a device 212 and a cloud connector 214. The device 212 can be a physical object including, or attached to one or more sensors. The sensors can be a part of the device 212 or external objects that use the device 212 as a hub. A metric can represent a series of values recorded by a sensor. For example, metrics can include temperature, humidity, wind, speed, geographic coordinates, sound, etc. The data transmitted by the customer device entities 202 to the cloud 206 can include the metrics. The cloud connector 214 can integrate the device 212 to the cloud 206 using the device integration component 204. The cloud 206 supports process integration 208 of processes associated with the devices 204 with an on premise entity 210 that includes systems 216 and a common metadata repository 218.
The cloud 206 includes IoT applications 220. The IoT applications 220 can be executed on a database cloud platform 222. The IoT applications 220 can use database cloud IoT services 224 to communicate with the devices 204, and can use database cloud integration services 226 to communicate with the on premise entities 210. The IoT applications 220 can include an application for ranking anomaly detection algorithms that can be used to process the data received from devices 204.
The cloud 206 includes a database big data platform 228 that can serve as a platform for the IoT applications 220 and includes a data processing component 230 and a streaming component 232. The data processing component 230 can include in-memory engines 234 for executing instructions, an extended storage component 236 for storing data, and a Hadoop framework 238 that supports distributed storage and data processing. The results generated by the cloud 206 can be transmitted to the on premise entity 210 using the process integration component 208. The on premise entity 210 can include a plurality of systems 216 that are associated to a common metadata repository 218. The common metadata repository 218 can be based on a meta-model that supports visualization of results generated by ranking anomaly detection algorithms for one or more data sets.
FIG. 3A depicts an example of a graphical representation 300 of a step of ranking anomaly detection algorithms for a single dataset. The data set can include a metric recorded by an IoT device, as described with reference to FIGS. 1 and 2. The graphical representation 300 includes multiple examples of statistical distributions 302, 304, 306, 308, 310.
The examples of statistical distributions 302, 304, 306, 308, 310 can be histograms associated to a plurality of anomaly detection algorithms that display the single dataset as value 301 per count (or density) 303. Some examples of anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others.
The histograms can include quantile plots. The quantile includes a portion of the dataset such that each portion contains the same amount of data. In the example of FIG. 3A, the quantiles correspond to percentiles, such that the dataset is displayed as a histogram formed of 100 parts of equal size. The anomalies are expected to be in the tail region of the examples of statistical distributions 302, 304, 306, 308, 310. The examples of statistical distributions 302, 304, 306, 308, 310 illustrated in FIG. 3A indicate that some statistical methods are better than others at differentiating normal data from anomalies. For example, statistical distribution 302 identifies a minimal portion of the data as potentially being abnormal and has a short tail. Statistical distribution 304 has a longer tail than statistical distribution 302, indicating a larger portion of the data as potentially being abnormal, which is normal-distributed. As another example, statistical distributions 306 and 308 are bi-modal. The example statistical distribution 306 illustrates the anomalies as being normally distributed. The example statistical distribution 308 identifies a large majority of anomalies as having a constant value. The example statistical distribution 310 identifies the anomalies as being uniformly distributed.
FIG. 3B depicts an example of a graphical representation 350 of another step of ranking anomaly detection algorithms for a single dataset. The data set corresponds to the dataset used for the graphical representation 300, as described with reference to FIG. 3A. The graphical representation 350 includes multiple examples of quantifications of statistical distributions 312, 314, 316, 318, 320.
The examples of quantifications of statistical distributions 312, 314, 316, 318, 320 illustrate a plurality of anomaly scores 313 within a quantile interval 311 for previously determined statistical distributions, such as example statistical distributions 302, 304, 306, 308, 310 described with reference to FIG. 3A. The anomaly scores 313 can be determined using any of the measures M1, M2, and M3 described with reference to FIG. 1. The quantifications of statistical distributions 312, 314, 316, 318, 320 can be compared between each other to identify best anomaly detection algorithm for the analyzed dataset according to one or more classification criteria.
According to one classification criterion, the higher the overall quantification scores, the better the anomaly detection algorithm. In the illustrated example of FIG. 3B, the quantification of statistical distributions 312 and 314 present similar profiles. Quantitative comparison between the quantification of statistical distributions 312 and 314 indicates that statistical distribution 314 is better at identifying anomalies than statistical distribution 312. In particular, the average, the maximum, and/or total value of the quantification of statistical distribution 314 is higher than the average, the maximum, and/or total value of the quantification of statistical distribution 312.
According to one classification criterion, the higher differentiation between anomalies and normal data points in a dataset, the better anomaly detection algorithm. Differentiation can be defined by assigning high scores to the first group and comparable low scores to the latter group. In the illustrated example of FIG. 3B, the quantifications of statistical distributions 316 and 318 each include a peak, which indicates that the statistical distributions 316 and 318 are better than statistical distributions 312 and 314. The quantification of statistical distribution 318 includes the highest peak, which indicates that statistical distribution 318 is the best anomaly detection algorithm from the analyzed statistical distributions for the selected dataset.
FIG. 4 depicts an example process 400 that can be provided by one or more computer-executable programs executed using one or more computing devices, as described with reference to FIGS. 1-3. In some implementations, the example process 400 is executed to rank anomaly detection algorithms in accordance with implementations of the present disclosure. In some implementations, the process 400 can be based on a distribution-based approach, which is based on the assumption that anomaly detection algorithms that produce score distributions that are bi-modal/fat tailed are ‘better.’
A set of unlabeled data is received by one or more processors from one or more sensors in a plurality of sensors of an internet of things (IoT) (402). The plurality of sensors can include a part of an IoT device or external objects attached to an IoT device. The data can be multi-variate (e.g., multiple different sensors can be used by an anomaly detection algorithm to determine anomaly scores). The data can include a metric, such as a series of values recorded by the sensor. The metrics can include incoming/outgoing data volume, temperature, humidity, wind, speed, geographic coordinates, sound, or any other values reflecting a functionality of an IoT device.
The metrics are processed by a variety of anomaly detection algorithms (404). Examples of anomaly detection algorithms include principal component analysis (PCA)-based approaches, linear regression, neural network approaches and others. The processing can include using the anomaly detection algorithms to calculate scores from the sensor data. Per anomaly detection algorithm, metric processing results in one set of univariate scores. The scores can be normalized using normalization constants so that normalized scores of all anomaly detection algorithms are within preselected intervals (e.g., interval [0, 1]). Normalization constants may be calibrated for each sensor in the modeling stage or for a combination of sensors.
A plurality of data distributions corresponding to the set of unlabeled data is generated by using a plurality of anomaly detection algorithms (406). For example, normalized data can be used to generate the data distributions. Data distributions can be based on anomaly detection algorithms configured to illustrate dissimilarities in recorded sensor values to identify data anomalies. The data distributions can be displayed as histograms that include quantile plots. The distance between the quantiles can be selected in steps of predefined size (e.g., 0.001) that are constant between the anomaly detection algorithms.
A first quantile and a second quantile of each of the plurality of data distributions are selected to determine a distance between them (408). In some implementations, the one or more processors can receive an input from a user indicating the first quantile and the second quantile for one or more sensors. For example, users can modify the distance based on knowledge about the expected anomaly rate, which can improve the performance of ranking the anomaly detection algorithms. In some implementations, the one or more processors can retrieve the selection of the first quantile and the second quantile from a database. The first quantile and the second quantile can be selected based on one or more conditions. One condition can include a requirement for the first quantile and the second quantile to be above a preset threshold (e.g., 0.95). For example, the first quantile can be approximately 0.97 and the second quantile can be approximately 0.999. The distance between the first quantile and the second quantile can be calculated by defining the second quantile as the first quantile and a parameter, where the parameter indicates the width of the respective data distribution.
The anomaly detection algorithms are ranked to determine how suitable each of the anomaly detection algorithms is for detecting anomalies in the data set (410). The anomaly detection algorithms can be ranked based on comparing the distances between the first quantile and the second quantile of the anomaly detection algorithms. For example, the best (first ranked) anomaly detection algorithm corresponds to the largest distance between the first quantile and the second quantile and the worse anomaly detection algorithm corresponds to the smallest distance between the first quantile and the second quantile.
In some implementations, the process 400 can include determining an anomaly score for the first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms. The anomaly score can include an amount of data identified as being anomalous by the first ranked anomaly detection algorithm. The process 400 can include comparing the anomaly score to an alert threshold. If the maximum anomaly score exceeds the alert threshold, an alert can be generated indicating the anomaly and the associated IoT device. In some implementations, the process 400 can include removing the data anomaly based on the anomaly identification of the best anomaly detection algorithm before transmitting the data within the IoT domain for remotely controlling and managing the corresponding devices and/or to trigger object-related processes. In some implementations, the process 400 can include updating a setting (e.g., software component) of a device and/or upgrading (e.g., replacing) an element (e.g., hardware component) of the device that generated data anomaly to prevent future anomalies. Process 400 can be repeated for each data type (corresponding to each sensor) at particular time intervals considering that an anomaly detection algorithm identified as the best match for a data set can be different from the best match for another data set (e.g., a data set measured at a different time or by a different sensor).
In some implementations, the process 400 can be based on a cross-validation based approach. The cross-validation based approach includes determining a k-fold cross validation on data, a regression analysis and a classification analysis. The regression analysis includes normalizing the scores using a hard box or quantiles as threshold and calculating the proportion of variance in data for each point relative to the total variance. The classification analysis includes using results from the algorithm trained on a fold as reference value for each fold, calculating discrete scores based on a selected threshold, and determining a classification matrix and a derived quantity per matrix.
Implementations of the present disclosure provide one or more of the following example advantages. Methods for anomaly detection, particularly in the IoT space can use un-labelled data. The use of un-labelled data makes it very difficult to compare the performance of different algorithms on a given dataset, and consequently, to choose the most suitable algorithm from a set of possible methods. Ranking anomaly detection algorithms for IoT data can provide an “automated mode” to identify a best matching anomaly detection algorithm for a particular data set. Automatic identification of the best anomaly detection algorithm could eliminate manual data analysis for anomaly detection. An IoT analytics platform based on open source software may be designed to greatly minimize the complexities of ingesting and processing massive amounts of data generated in IoT scenarios. Detection and removal of data anomalies can improve IoT processes and the functionality of one or more IoT devices that can depend on the received data.
Referring now to FIG. 5, a schematic diagram of an example computing system 500 is provided. The system 500 can be used for the operations described in association with the implementations described herein. For example, the system 500 may be included in any or all of the server components discussed herein. The system 500 includes a processor 510, a memory 520, a storage device 530, and an input/output device 540. The components 510, 520, 530, 540 are interconnected using a system bus 550. The processor 510 is capable of processing instructions for execution within the system 500. In one implementation, the processor 510 is a single-threaded processor. In another implementation, the processor 510 is a multi-threaded processor. The processor 510 is capable of processing instructions stored in the memory 520 or on the storage device 530 to display graphical information for a user interface on the input/output device 540.
The memory 520 stores information within the system 500. In one implementation, the memory 520 is a computer-readable medium. In one implementation, the memory 520 is a volatile memory unit. In another implementation, the memory 520 is a non-volatile memory unit. The storage device 530 is capable of providing mass storage for the system 500. In one implementation, the storage device 530 is a computer-readable medium. In various different implementations, the storage device 530 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device. The input/output device 540 provides input/output operations for the system 500. In one implementation, the input/output device 540 includes a keyboard and/or pointing device. In another implementation, the input/output device 540 includes a display unit for displaying graphical user interfaces.
The features described can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The apparatus can be implemented in a computer program product tangibly embodied in an information carrier (e.g., in a machine-readable storage device, for execution by a programmable processor), and method steps can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output. The described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer can include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer can also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
To provide for interaction with a user, the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
The features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, for example, a LAN, a WAN, and the computers and networks forming the Internet.
The computer system can include clients and servers. A client and server are generally remote from each other and typically interact through a network, such as the described one. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
A number of implementations of the present disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other implementations are within the scope of the following claims.

Claims (20)

What is claimed is:
1. A computer-implemented method for ranking anomaly detection algorithms, the method being executed by one or more processors and comprising:
receiving, by the one or more processors, a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, the set of unlabeled data comprising anomalies that are undifferentiated from normal data before being processed by any one of a plurality of anomaly detection algorithms;
generating, by the one or more processors, a plurality of data distributions corresponding to the set of unlabeled data by using the plurality of anomaly detection algorithms;
ranking, by the one or more processors, the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions, wherein a higher ranking anomaly detection algorithm provides a first differentiation between the anomalies and the normal data in the set of unlabeled data that is higher than a second differentiation between the anomalies and the normal data in the set of unlabeled data provided by a lower ranking anomaly detection algorithm; and
triggering, by the one or more processors, a modification of a setting of a system from where the anomalies were generated to correct the anomalies detected by the higher ranking anomaly detection algorithm.
2. The method of claim 1, further comprising processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms.
3. The method of claim 2, further comprising normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms.
4. The method of claim 1, wherein the second quantile is based on the first quantile and a parameter.
5. The method of claim 4, wherein the parameter is based on a width of a respective data distribution.
6. The method of claim 1, wherein the first quantile and the second quantile are above 0.95.
7. The method of claim 1, further comprising comparing an anomaly score corresponding to a first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms to an alert threshold.
8. A non-transitory computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for ranking anomaly detection algorithms, the operations comprising:
receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, the set of unlabeled data comprising anomalies that are undifferentiated from normal data before being processed by any one of a plurality of anomaly detection algorithms;
generating a plurality of data distributions corresponding to the set of unlabeled data by using the plurality of anomaly detection algorithms;
ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions, wherein a higher ranking anomaly detection algorithm provides a first differentiation between the anomalies and the normal data in the set of unlabeled data that is higher than a second differentiation between the anomalies and the normal data in the set of unlabeled data provided by a lower ranking anomaly detection algorithm; and
triggering a modification of a setting of a system from where the anomalies were generated to correct the anomalies detected by the higher ranking anomaly detection algorithm.
9. The non-transitory computer-readable storage medium of claim 8, further comprising processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms.
10. The non-transitory computer-readable storage medium of claim 9, further comprising normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms.
11. The non-transitory computer-readable storage medium of claim 8, wherein the second quantile is based on the first quantile and a parameter.
12. The non-transitory computer-readable storage medium of claim 11, wherein the parameter is based on a width of a respective data distribution.
13. The non-transitory computer-readable storage medium of claim 8, wherein the first quantile and the second quantile are above 0.95.
14. The non-transitory computer-readable storage medium of claim 8, further comprising comparing an anomaly score corresponding to a first ranked anomaly detection algorithm of the plurality of anomaly detection algorithms to an alert threshold.
15. A system, comprising:
a computing device; and
a computer-readable storage device coupled to the computing device and having instructions stored thereon which, when executed by the computing device, cause the computing device to perform operations for ranking anomaly detection algorithms, the operations comprising:
receiving a set of unlabeled data from one or more sensors in a plurality of sensors of an internet of things, the set of unlabeled data comprising anomalies that are undifferentiated from normal data before being processed by any one of a plurality of anomaly detection algorithms;
generating a plurality of data distributions corresponding to the set of unlabeled data by using the plurality of anomaly detection algorithms;
ranking the plurality of anomaly detection algorithms relative to the set of unlabeled data based on a distance between a first quantile and a second quantile of each of the plurality of data distributions, wherein a higher ranking anomaly detection algorithm provides a first differentiation between the anomalies and the normal data in the set of unlabeled data that is higher than a second differentiation between the anomalies and the normal data in the set of unlabeled data provided by a lower ranking anomaly detection algorithm; and
triggering a modification of a setting of a system from where the anomalies were generated to correct the anomalies detected by the higher ranking anomaly detection algorithm.
16. The system of claim 15, further comprising processing the set of unlabeled data to determine a set of univariate scores for each of the plurality of anomaly detection algorithms.
17. The system of claim 16, further comprising normalizing the set of univariate scores for each of the plurality of anomaly detection algorithms.
18. The system of claim 15, wherein the second quantile is based on the first quantile and a parameter.
19. The system of claim 18, wherein the parameter is based on a width of a respective data distribution.
20. The system of claim 15, wherein the first quantile and the second quantile are above 0.95.
US15/637,471 2017-06-29 2017-06-29 Comparing unsupervised algorithms for anomaly detection Active 2038-04-26 US10749881B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/637,471 US10749881B2 (en) 2017-06-29 2017-06-29 Comparing unsupervised algorithms for anomaly detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/637,471 US10749881B2 (en) 2017-06-29 2017-06-29 Comparing unsupervised algorithms for anomaly detection

Publications (2)

Publication Number Publication Date
US20190007432A1 US20190007432A1 (en) 2019-01-03
US10749881B2 true US10749881B2 (en) 2020-08-18

Family

ID=64739312

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/637,471 Active 2038-04-26 US10749881B2 (en) 2017-06-29 2017-06-29 Comparing unsupervised algorithms for anomaly detection

Country Status (1)

Country Link
US (1) US10749881B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11227236B2 (en) * 2020-04-15 2022-01-18 SparkCognition, Inc. Detection of deviation from an operating state of a device
US11681284B2 (en) 2021-08-04 2023-06-20 Sap Se Learning method and system for determining prediction horizon for machinery

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055162B2 (en) * 2018-10-31 2021-07-06 Salesforce.Com, Inc. Database system performance degradation detection
WO2020167539A1 (en) * 2019-02-05 2020-08-20 Qomplx, Inc. System and method for complex it process annotation, tracing, analysis, and simulation
US11507563B2 (en) * 2019-08-26 2022-11-22 Kyndryl, Inc. Unsupervised anomaly detection
CN111694819A (en) * 2020-04-27 2020-09-22 深圳华工能源技术有限公司 Electric power abnormal data filtering method and device based on sub-bit distance algorithm
CN112347078A (en) * 2020-11-06 2021-02-09 国网江西省电力有限公司信息通信分公司 Power distribution Internet of things data product construction method and device
CN112383431A (en) * 2020-11-13 2021-02-19 武汉虹旭信息技术有限责任公司 Method and device for identifying data of internet of things in internet
US11743273B2 (en) * 2021-02-25 2023-08-29 T-Mobile Usa, Inc. Bot hunting system and method
CN113687989B (en) * 2021-08-09 2024-08-16 华东师范大学 Internet of things data anomaly detection method and system based on server-free architecture

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179930A1 (en) 2009-01-13 2010-07-15 Eric Teller Method and System for Developing Predictions from Disparate Data Sources Using Intelligent Processing
US20110078106A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Method and system for it resources performance analysis
US20130079938A1 (en) 2011-09-22 2013-03-28 Sap Ag Customer segmentation based on smart meter data
US20130110761A1 (en) * 2011-10-31 2013-05-02 Krishnamurthy Viswanathan System and method for ranking anomalies
US8660868B2 (en) 2011-09-22 2014-02-25 Sap Ag Energy benchmarking analytics
US20140200952A1 (en) 2013-01-11 2014-07-17 International Business Machines Corporation Scalable rule logicalization for asset health prediction
US20140365191A1 (en) 2013-06-10 2014-12-11 Abb Technology Ltd. Industrial asset health model update
US20150058982A1 (en) * 2001-12-14 2015-02-26 Eleazar Eskin Methods of unsupervised anomaly detection using a geometric framework
US20150178865A1 (en) 2011-09-20 2015-06-25 The Trustees Of Columbia University In The City Of New York Total property optimization system for energy efficiency and smart buildings
US20160217384A1 (en) * 2015-01-26 2016-07-28 Sas Institute Inc. Systems and methods for time series analysis techniques utilizing count data sets
US20160350671A1 (en) 2015-05-28 2016-12-01 Predikto, Inc Dynamically updated predictive modeling of systems and processes
US20170011382A1 (en) * 2015-07-10 2017-01-12 Fair Isaac Corporation Mobile attribute time-series profiling analytics
US20170228660A1 (en) * 2016-02-05 2017-08-10 Nec Europe Ltd. Scalable system and method for real-time predictions and anomaly detection
US20170230392A1 (en) * 2016-02-09 2017-08-10 Darktrace Limited Anomaly alert system for cyber threat detection
US20180047071A1 (en) * 2012-07-24 2018-02-15 Ebay Inc. System and methods for aggregating past and predicting future product ratings
US20180096261A1 (en) * 2016-10-01 2018-04-05 Intel Corporation Unsupervised machine learning ensemble for anomaly detection
US20180211176A1 (en) * 2017-01-20 2018-07-26 Alchemy IoT Blended IoT Device Health Index
US20180241762A1 (en) * 2017-02-23 2018-08-23 Cisco Technology, Inc. Anomaly selection using distance metric-based diversity and relevance
US20180247220A1 (en) * 2017-02-28 2018-08-30 International Business Machines Corporation Detecting data anomalies
US20180248905A1 (en) * 2017-02-24 2018-08-30 Ciena Corporation Systems and methods to detect abnormal behavior in networks
US20180374104A1 (en) 2017-06-26 2018-12-27 Sap Se Automated learning of data aggregation for analytics
US20190012351A1 (en) * 2015-08-10 2019-01-10 Hewlett Packard Enterprise Development Lp Evaluating system behaviour
US10200262B1 (en) * 2016-07-08 2019-02-05 Splunk Inc. Continuous anomaly detection service

Patent Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150058982A1 (en) * 2001-12-14 2015-02-26 Eleazar Eskin Methods of unsupervised anomaly detection using a geometric framework
US9306966B2 (en) * 2001-12-14 2016-04-05 The Trustees Of Columbia University In The City Of New York Methods of unsupervised anomaly detection using a geometric framework
US20100179930A1 (en) 2009-01-13 2010-07-15 Eric Teller Method and System for Developing Predictions from Disparate Data Sources Using Intelligent Processing
US20110078106A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Method and system for it resources performance analysis
US20150178865A1 (en) 2011-09-20 2015-06-25 The Trustees Of Columbia University In The City Of New York Total property optimization system for energy efficiency and smart buildings
US20130079938A1 (en) 2011-09-22 2013-03-28 Sap Ag Customer segmentation based on smart meter data
US8660868B2 (en) 2011-09-22 2014-02-25 Sap Ag Energy benchmarking analytics
US20130110761A1 (en) * 2011-10-31 2013-05-02 Krishnamurthy Viswanathan System and method for ranking anomalies
US20180047071A1 (en) * 2012-07-24 2018-02-15 Ebay Inc. System and methods for aggregating past and predicting future product ratings
US20140200952A1 (en) 2013-01-11 2014-07-17 International Business Machines Corporation Scalable rule logicalization for asset health prediction
US20140365191A1 (en) 2013-06-10 2014-12-11 Abb Technology Ltd. Industrial asset health model update
US20160217384A1 (en) * 2015-01-26 2016-07-28 Sas Institute Inc. Systems and methods for time series analysis techniques utilizing count data sets
US20160350671A1 (en) 2015-05-28 2016-12-01 Predikto, Inc Dynamically updated predictive modeling of systems and processes
US20170011382A1 (en) * 2015-07-10 2017-01-12 Fair Isaac Corporation Mobile attribute time-series profiling analytics
US20190012351A1 (en) * 2015-08-10 2019-01-10 Hewlett Packard Enterprise Development Lp Evaluating system behaviour
US20170228660A1 (en) * 2016-02-05 2017-08-10 Nec Europe Ltd. Scalable system and method for real-time predictions and anomaly detection
US20170230392A1 (en) * 2016-02-09 2017-08-10 Darktrace Limited Anomaly alert system for cyber threat detection
US10200262B1 (en) * 2016-07-08 2019-02-05 Splunk Inc. Continuous anomaly detection service
US20180096261A1 (en) * 2016-10-01 2018-04-05 Intel Corporation Unsupervised machine learning ensemble for anomaly detection
US20180211176A1 (en) * 2017-01-20 2018-07-26 Alchemy IoT Blended IoT Device Health Index
US20180241762A1 (en) * 2017-02-23 2018-08-23 Cisco Technology, Inc. Anomaly selection using distance metric-based diversity and relevance
US20180248905A1 (en) * 2017-02-24 2018-08-30 Ciena Corporation Systems and methods to detect abnormal behavior in networks
US20180247220A1 (en) * 2017-02-28 2018-08-30 International Business Machines Corporation Detecting data anomalies
US20180374104A1 (en) 2017-06-26 2018-12-27 Sap Se Automated learning of data aggregation for analytics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Non-final office action issued in U.S. Appl. No. 15/633,401 dated Jan. 9, 2020, 23 pages.

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11227236B2 (en) * 2020-04-15 2022-01-18 SparkCognition, Inc. Detection of deviation from an operating state of a device
US11880750B2 (en) 2020-04-15 2024-01-23 SparkCognition, Inc. Anomaly detection based on device vibration
US11681284B2 (en) 2021-08-04 2023-06-20 Sap Se Learning method and system for determining prediction horizon for machinery

Also Published As

Publication number Publication date
US20190007432A1 (en) 2019-01-03

Similar Documents

Publication Publication Date Title
US10749881B2 (en) Comparing unsupervised algorithms for anomaly detection
US11645581B2 (en) Meaningfully explaining black-box machine learning models
US10353961B2 (en) Systems and methods for conducting and terminating a technology-assisted review
US8966036B1 (en) Method and system for website user account management based on event transition matrixes
US10191966B2 (en) Enabling advanced analytics with large data sets
US20170018030A1 (en) System and Method for Determining Credit Worthiness of a User
US10504028B1 (en) Techniques to use machine learning for risk management
US11915311B2 (en) User score model training and calculation
CN109993627B (en) Recommendation method, recommendation model training device and storage medium
WO2019061664A1 (en) Electronic device, user&#39;s internet surfing data-based product recommendation method, and storage medium
US20150356163A1 (en) Methods and systems for analyzing datasets
CN114090601B (en) Data screening method, device, equipment and storage medium
TW201928771A (en) Method and device for classifying samples to be assessed
CN111708942B (en) Multimedia resource pushing method, device, server and storage medium
CN110968802B (en) Analysis method and analysis device for user characteristics and readable storage medium
US11461696B2 (en) Efficacy measures for unsupervised learning in a cyber security environment
TW202111592A (en) Learning model application system, learning model application method, and program
US20230118341A1 (en) Inline validation of machine learning models
CN115659411A (en) Method and device for data analysis
CN114202256A (en) Architecture upgrading early warning method and device, intelligent terminal and readable storage medium
US20220172086A1 (en) System and method for providing unsupervised model health monitoring
US11868860B1 (en) Systems and methods for cohort-based predictions in clustered time-series data in order to detect significant rate-of-change events
CN111309706A (en) Model training method and device, readable storage medium and electronic equipment
Bayram et al. DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications
CN113312554B (en) Method and device for evaluating recommendation system, electronic equipment and medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: SAP SE, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TAUSCHINSKY, ATREJU FLORIAN;MEUSEL, ROBERT;FRENDO, OLIVER;SIGNING DATES FROM 20170626 TO 20170628;REEL/FRAME:042867/0987

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4