US20140379301A1 - Systems and methods for data-driven anomaly detection - Google Patents

Systems and methods for data-driven anomaly detection Download PDF

Info

Publication number
US20140379301A1
US20140379301A1 US14/220,050 US201414220050A US2014379301A1 US 20140379301 A1 US20140379301 A1 US 20140379301A1 US 201414220050 A US201414220050 A US 201414220050A US 2014379301 A1 US2014379301 A1 US 2014379301A1
Authority
US
United States
Prior art keywords
data
region
interest
control limit
outside
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US14/220,050
Other versions
US10552511B2 (en
Inventor
Lokendra Shastri
K. Antony Arokia Durai Raj
Balasubramanian Kanagasabapathi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Infosys Ltd
Original Assignee
Infosys Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infosys Ltd filed Critical Infosys Ltd
Publication of US20140379301A1 publication Critical patent/US20140379301A1/en
Assigned to Infosys Limited reassignment Infosys Limited ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KANAGASABAPATHI, BALASUBRAMANIAN, RAJ, KOLANDAISWAMY ANTONY AROKIA DURAI, SHASTRI, LOKENDRA
Application granted granted Critical
Publication of US10552511B2 publication Critical patent/US10552511B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01MTESTING STATIC OR DYNAMIC BALANCE OF MACHINES OR STRUCTURES; TESTING OF STRUCTURES OR APPARATUS, NOT OTHERWISE PROVIDED FOR
    • G01M99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • G05B23/0218Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterised by the fault detection method dealing with either existing or incipient faults
    • G05B23/0224Process history based detection method, e.g. whereby history implies the availability of large amounts of data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present disclosure relates generally to anomaly detection, and in particular, to a system and method for detecting at least one abnormal event in a system from data associated with functioning of the system.
  • the present technique can overcome the limitations mentioned above by using statistical models, data mining techniques and heuristic search methods to detect anomalies in a system. This technique is automatic and can be used to monitor the automated system in real time and it reduces the number of false positives.
  • a method for data-driven anomaly detection includes identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • the data within the region of interest is mapped with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system.
  • at least one abnormal event is detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • the method includes identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • Reference data are classified into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest.
  • a control limit is determined for each of the one or more groups by analyzing the reference data.
  • the data within the region of interest are mapped with the one or more groups. Then, it is determined if the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups. Finally, at least one abnormal event is detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • a system for data-driven anomaly detection includes a region of interest identification module, a mapping module, a data analysis module and an abnormal event detection module.
  • the region of interest identification module is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • the mapping module is configured to map the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of the system, wherein the reference data represent normal operating condition of a system.
  • the data analysis module is configured to determine whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups and the abnormal event detection module is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • the system includes a region of interest identification module, a reference data classification module, a control limit determination module, a mapping module, a data analysis module and an abnormal event detection module.
  • the region of interest identification module is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • the reference data classification module is configured to classify reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest.
  • the control limit determination module is configured to determine a control limit for each of the one or more groups by analyzing the reference data.
  • the mapping module is configured to map the data within the region of interest with the one or more groups.
  • the data analysis module is configured to determine whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups and the abnormal event detection module is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • a computer-readable storage medium for data-driven anomaly detection which is not a signal stores computer executable instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm, mapping the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system, determining whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups and detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • the computer-readable storage medium which is not a signal stores computer executable instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm, classifying reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest, determining a control limit for each of the one or more groups by analyzing the reference data, mapping the data within the region of interest with the one or more groups, determining whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups and detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • FIG. 1 is a computer architecture diagram illustrating a computing system capable of implementing the embodiments presented herein.
  • FIG. 2 is a flowchart, illustrating a method for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • FIG. 3 is a flowchart, illustrating a method for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • FIG. 4 is a plot of measure to identify the region of interest, in accordance with an embodiment of the present invention.
  • FIG. 5 is a block diagram illustrating a system for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • FIG. 6 is a block diagram illustrating a system for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • Exemplary embodiments of the present invention provide a system and method for data-driven anomaly detection. This involves identifying a region of interest in the data based on dimensionality reduction technique and change point detection algorithm. If no reference data is available, wherein the reference data represent normal operating condition of a system, then the reference data is obtained from the test data itself. In this case, the region outside the region of interest acts as the reference data. The data within the region of interest are mapped with one or more groups of reference data, wherein the one or more groups represent one or more modes of operation of the system. Each of the one or more groups has a control limit defined. If it is determined that the data within the region of interest is outside of the control limit of the corresponding mapped group then it indicates the anomaly. The abnormal event is then detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • FIG. 1 illustrates a generalized example of a suitable computing environment 100 in which all embodiments, techniques, and technologies of this invention may be implemented.
  • the computing environment 100 is not intended to suggest any limitation as to scope of use or functionality of the technology, as the technology may be implemented in diverse general-purpose or special-purpose computing environments.
  • the disclosed technology may be implemented using a computing device (e.g., a server, desktop, laptop, hand-held device, mobile device, PDA, etc.) comprising a processing unit, memory, and storage storing computer-executable instructions implementing the service level management technologies described herein.
  • a computing device e.g., a server, desktop, laptop, hand-held device, mobile device, PDA, etc.
  • the disclosed technology may also be implemented with other computer system configurations, including hand held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, a collection of client/server systems, and the like.
  • the computing environment 100 includes at least one central processing unit 102 and memory 104 .
  • the central processing unit 102 executes computer-executable instructions. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power and as such, multiple processors can be running simultaneously.
  • the memory 104 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two.
  • the memory 104 stores software 116 that can implement the technologies described herein.
  • a computing environment may have additional features.
  • the computing environment 100 includes storage 108 , one or more input devices 110 , one or more output devices 112 , and one or more communication connections 114 .
  • An interconnection mechanism such as a bus, a controller, or a network, interconnects the components of the computing environment 100 .
  • operating system software provides an operating environment for other software executing in the computing environment 100 , and coordinates activities of the components of the computing environment 100 .
  • FIG. 2 is a flowchart, illustrating a method for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • the reference data represent the normal condition of the system and test data represent abnormal condition of the system.
  • the abnormal event satisfies the following condition:
  • test data is to be construed as “data” mentioned in the claims.
  • a region of interest from the test data is identified based on dimensionality reduction technique and change point detection algorithm, as in step 202 .
  • the data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure.
  • the data can be stored and extracted from a database or can be obtained from sensors directly in real time. The data may be preprocessed to remove incomplete and irrelevant data.
  • the preprocessing step may include removing the entire column of sensor readings if the sensor readings of the corresponding column are entirely zero in both reference data and test data and/or removing the entire row of sensor readings at each time instance if the sensor reading of the corresponding row is entirely zero and/or removing the entire column of sensor reading if the sensor reading of the corresponding column is same in both the reference data and test data and/or removing the columns which are linearly dependent or correlated with other columns.
  • the dimensionality reduction technique may include but is not limited to T 2 statistic. The T 2 statistic for the i th sampling time instance is calculated as follow:
  • T i 2 ( S j ( i ) ⁇ m j ) S ⁇ 1 ( S j ( i ) ⁇ m j )
  • S j (i) is the sensor reading for j th sensor at time instance i;
  • m j is the mean of sensor values over time for the j th sensor;
  • S ⁇ 1 is the inverse of standard covariance matrix using successive difference.
  • I is the number of sampling time period and rest of the notations shall be construed as mentioned above.
  • T 2 -chart (in the y-axis) is plotted against sampling time (in the x-axis) to identify the region of interest using the Lavielle's change-point detection algorithm.
  • FIG. 4 is a plot of measure to identify the region of interest, in accordance with an embodiment of the present invention and 402 in the figure shows the region of interest.
  • the region of interest can be calculated from multi-modal pattern by using a statistic which is based on cumulative sums of differences from the mean. Steady increase in the obtained statistic indicates that the T-square statistic values are above the overall mean. Steady decrease in the obtained statistic indicates that the T-square statistic values are below the mean.
  • the change in the pattern will be indicated by abrupt changes in the slope.
  • the slope is computed for all pairs of peak and trough.
  • the pair with farthest slope from the mean of all slopes will contain the region of interest. Hence, determine the region of interesting by finding out the pair of peak and trough for which slope is farthest away from the mean of all slopes.
  • the data within the region of interest is mapped with one or more predefined groups of reference data, as in step 204 .
  • the reference data can be obtained in the same way as the test data is collected.
  • the reference data can be preprocessed like the test data and may be normalized by using any normalizing measure which may include but is not limited to mean and relative proportion.
  • an appropriate clustering technique is used based on the type of reference data to classify the reference data into different groups, wherein the said different groups represent one or more modes of operation of the system.
  • two types of clustering approaches are used namely, partition based clustering and hierarchical clustering. Based on the type of reference data the clustering algorithm is selected.
  • k-means and expectation maximization clustering algorithm is used if the data is continuous and if the data is categorical then robust clustering for categorical attributes is applied.
  • a control limit is determined for each group of the reference data based on the type of the reference data.
  • the central tendency and dispersion of each of the groups of reference data is measured. Measurement of central tendency and dispersion may include but are not limited to mean and standard deviation of each group of reference data.
  • the overall mean vector is the mean sensor values over time for each sensor. The mean value can be calculated as follows:
  • the standard deviation can be calculated as follows:
  • the upper control limit and lower control limit of each group of the reference data can be obtained from the said mean and standard deviation.
  • the upper control limit can be calculated as:
  • the lower control limit can be determined as:
  • the mapping in step 204 is done based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data.
  • step 206 of FIG. 2 it is determined if the data or sensor readings within the region of interest of the test data fall outside of the said control limit.
  • the sensor reading is flagged with ⁇ 1, 0, and 1 for each scenario as follows: if the sensor reading is lying outside the lower control limit then the corresponding value is flagged with ⁇ 1, else if the sensor reading is lying within the control limits then the corresponding value is flagged with 0, and if the sensor reading is lying outside the upper control limit then the corresponding value is flagged with 1.
  • the resulting value can be called as flagged value.
  • the flagging is performed for the reference data set. Thereafter, the at least one abnormal event is detected by applying a heuristic algorithm, as in step 208 .
  • the data can be preprocessed before applying heuristic algorithm.
  • the heuristic algorithm uses the flagged values to determine the anomaly or abnormal event.
  • the flagged values are scanned to check whether candidate sensor(s) with different combination of sensor state (i.e. ⁇ 1, 0, +1 ⁇ ) results in an abnormal run in the region of interest of the test data. If it is an abnormal run (run longer than N time instances and the run occurring only once) then the candidate sensor(s) along with the respective sensor state can be referred as a pattern.
  • the reference data is scanned to check if the said pattern is also present in the reference data. If the pattern exists, then it is a normal event and can be discarded, else the pattern is stored as a candidate pattern as well as the start and end time of the abnormal event.
  • the sensors also identified associated with the abnormal event by using the flagged values.
  • the flagged values in the reference data are also analyzed to determine the longest streak of out of control limits event or abnormal event. The total time instance of longest streak is considered as the earliest detection time. If the candidate pattern has more than one abnormal sensor, then the latest start time among the abnormal run sensors is used as the earliest detection time. If the pattern appears in the reference data set then the run length of several runs (which does not satisfy the constraints of abnormal event) for that candidate pattern is identified.
  • the earliest detection time is offset by a time period defined as a function of run length denoted by ⁇ (r) or ⁇ (m) as follows:
  • r is the run length for the candidate pattern in the reference data
  • is the mean value of run length for the candidate pattern in the reference data
  • R is the total number of runs for the candidate pattern in the reference data; ⁇ is a constant; ⁇ is the standard deviation among run length of candidate pattern in the reference data:
  • r g is the run length of g th run for the candidate pattern in the reference data.
  • the offset time can be computed as a function of maximum run length if the cost of false alarming is not defined.
  • the function ⁇ (m) is defined as given below:
  • m is the maximum run length for the candidate pattern in the reference data
  • is the factor of safety
  • r g is the run length of g th run for the candidate pattern in the reference data.
  • an alert is generated for the users about the anomaly at the earliest and the results are consolidated to prepare a brief and detailed summary about the anomaly in the system.
  • FIG. 3 is a flowchart, illustrating a method for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • a region of interest from the test data is identified based on dimensionality reduction technique and change point detection algorithm, as in step 302 .
  • the term “test data” is to be construed as “data” mentioned in the claims.
  • the data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure.
  • the data can be stored and extracted from a database or can be obtained from sensors directly in real time.
  • the data may be preprocessed to remove incomplete and irrelevant data.
  • the detail about the preprocessing of data is described hereinabove.
  • the dimensionality reduction technique may include but is not limited to T 2 statistic.
  • the details about computing T 2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above.
  • the reference data can be obtained by removing the data points corresponding to the region of interest (RoI) from the test data.
  • the reference data is classified based on the appropriate clustering technique into different groups, wherein the different groups represent different modes of operation of the system, as in step 304 .
  • the details about the clustering technique are provided herein above.
  • a control limit for each of the said groups is determined, as in step 306 .
  • the details about the control limit determination are described herein above.
  • the data within the region of interest are mapped with the said groups of the reference data based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data, as in step 308 .
  • the sensor reading is flagged with ⁇ 1, 0, and 1 for each scenario as follows: if the sensor reading is lying outside the lower control limit then the corresponding value is flagged with ⁇ 1, else if the sensor reading is lying within the control limits then the corresponding value is flagged with 0, and if the sensor reading is lying outside the upper control limit then the corresponding value is flagged with 1.
  • the resulting value can be called as flagged value.
  • the flagging is performed for the reference data set. Thereafter, the at least one abnormal event is detected by applying a heuristic algorithm, as in step 312 .
  • the sensors also identified associated with the abnormal event by using the flagged values.
  • the flagged values in the reference data are also analyzed to determine the earliest detection time of the abnormal event which is described in great detail herein above.
  • An alert is also generated for the users about the anomaly at the earliest and the results are consolidated to prepare a brief and detailed summary about the anomaly in the system.
  • FIG. 5 is a block diagram illustrating a system for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • the system includes region of interest identification module 502 , mapping module 504 , data analysis module 506 , abnormal event detection module 508 , earliest time detection module 510 , alert generation module 512 and Rule Engine Database 514 .
  • the region of interest identification module 502 is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • the dimensionality reduction technique may include but is not limited to T 2 statistic. The details about computing T 2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above.
  • test data is to be construed as “data” mentioned in the claims.
  • the data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system.
  • the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure.
  • the data can be stored and extracted from a database or can be obtained from sensors directly in real time.
  • the data may be preprocessed to remove incomplete and irrelevant data. The detail about the preprocessing of data is described hereinabove.
  • the mapping module 504 is configured to map the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system.
  • An appropriate clustering technique is used based on the type of reference data to classify the reference data into different groups which is described in great detail herein above.
  • a control limit is determined for each group of the reference data based on the type of the reference data.
  • the data analysis module 506 is configured to determine whether the data within the region of interest is outside of the predefined control limit of the corresponding mapped group. The sensor reading is flagged with ⁇ 1, 0, and 1 as described in detail herein above.
  • the abnormal event detection module 508 is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. The details about the abnormal event detection method are described herein above. According with an embodiment of the present disclosure, the sensors also identified associated with the abnormal event by using the flagged values.
  • the earliest time detection module 510 is configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit. The computing method of earliest detection time is described herein above in detail.
  • the alert generation module 512 is configured to generate an alert on the occurrence of the at least one abnormal event.
  • the Rule Engine Database 514 is configured to store predefined rules and rules given as an input by the user. The rules are segregated based on the different patterns exhibited by the sensors using the reference data. These rules are essential to identify the anomalies as well as to determine the earliest detection time.
  • FIG. 6 is a block diagram illustrating a system for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • the system includes region of interest identification module 602 , reference data classification module 604 , control limit determination module 606 , mapping module 608 , data analysis module 610 , abnormal event detection module 612 , earliest time detection module 614 , alert generation module 616 and Rule Engine Database 618 .
  • the region of interest identification module 602 is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm.
  • the dimensionality reduction technique may include but is not limited to T 2 statistic. The details about computing T 2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above.
  • test data is to be construed as “data” mentioned in the claims.
  • the data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system.
  • the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure.
  • the data can be stored and extracted from a database or can be obtained from sensors directly in real time.
  • the data may be preprocessed to remove incomplete and irrelevant data. The detail about the preprocessing of data is described hereinabove.
  • the reference data classification module 604 is configured to classify reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest. The reference data is classified based on the appropriate clustering technique into different groups and the details about the clustering technique are provided herein above.
  • the control limit determination module 606 is configured to determine a control limit for each of the one or more groups by analyzing the reference data. The details about the control limit determination are described herein above.
  • the mapping module 608 is configured to map the data within the region of interest with the one or more groups based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data.
  • the data analysis module 610 is configured to determine whether the data within the region of interest is outside the control limit of the mapped group.
  • the sensor reading is flagged with ⁇ 1, 0, and 1 as described above.
  • the abnormal event detection module 612 is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. The details about the abnormal event detection method are described herein above. According with an embodiment of the present disclosure, the sensors also identified associated with the abnormal event by using the flagged values.
  • the earliest time detection module 614 is configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit. The computing method of earliest detection time is described herein above in detail.
  • the alert generation module 616 is configured to generate an alert on the occurrence of the at least one abnormal event.
  • the Rule Engine Database 618 is configured to store predefined rules and rules given as an input by the user. The rules are segregated based on the different patterns exhibited by the sensors using the reference data. These rules are essential to identify the anomalies as well as to determine the earliest detection time.
  • One or more computer-readable media can comprise computer-executable instructions causing a computing system (e.g., comprising one or more processors coupled to memory) (e.g., computing environment 100 or the like) to perform any of the methods described herein.
  • a computing system e.g., comprising one or more processors coupled to memory
  • Examples of such computer-readable or processor-readable media include magnetic media, optical media, and memory (e.g., volatile or non-volatile memory, including solid state drives or the like).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Testing And Monitoring For Control Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)

Abstract

The technique relates to a system and method for data-driven anomaly detection. This technique involves identifying region of interest from the data based on dimensionality reduction technique and change point detection algorithm. A reference data can be obtained separately or can be obtained from the test data also, wherein the reference data represent the normal operating condition of a system. The reference data are classified into different groups representing different modes of operation of the system. A control limit is determined for the different groups. The data within the region of interest are mapped with the different groups of the reference data and it is determined if the mapped data fall outside of the control limit of the mapped group. Finally, at least one abnormal event is detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.

Description

    FIELD
  • The present disclosure relates generally to anomaly detection, and in particular, to a system and method for detecting at least one abnormal event in a system from data associated with functioning of the system.
  • BACKGROUND
  • Most industrial systems are automated in order to operate efficiently. Monitoring the state of the system in real-time is essential for smooth functioning of automated systems. This monitoring function can be done manually, or multiple sensors may be employed to record reading about the state of the system at various instances of time, which results in a very large amount of data. These sensor readings or manually monitored readings are analyzed to detect anomalies in the system. At present, the data analysis is carried out either manually or semi-automatically. An anomaly is detected by using a probability model, three sigma models, regression models, time series models, covariance matrix and QR decomposition method. But there are limitations of using these models. The existing methods use only statistical methods to detect anomalies, which may report large number of false positives. The existing methods are developed mostly considering real value sensor reading and not for other data types (e.g. categorical).
  • SUMMARY
  • The present technique can overcome the limitations mentioned above by using statistical models, data mining techniques and heuristic search methods to detect anomalies in a system. This technique is automatic and can be used to monitor the automated system in real time and it reduces the number of false positives.
  • According to an embodiment, a method for data-driven anomaly detection is disclosed. The method includes identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. The data within the region of interest is mapped with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system. Thereafter, it is determined if the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups. Finally, at least one abnormal event is detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. In an alternate embodiment, the method includes identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. Reference data are classified into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest. A control limit is determined for each of the one or more groups by analyzing the reference data. The data within the region of interest are mapped with the one or more groups. Then, it is determined if the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups. Finally, at least one abnormal event is detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • In an additional embodiment, a system for data-driven anomaly detection is disclosed. The system includes a region of interest identification module, a mapping module, a data analysis module and an abnormal event detection module. The region of interest identification module is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. The mapping module is configured to map the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of the system, wherein the reference data represent normal operating condition of a system. The data analysis module is configured to determine whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups and the abnormal event detection module is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. In an alternate embodiment, the system includes a region of interest identification module, a reference data classification module, a control limit determination module, a mapping module, a data analysis module and an abnormal event detection module. The region of interest identification module is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. The reference data classification module is configured to classify reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest. The control limit determination module is configured to determine a control limit for each of the one or more groups by analyzing the reference data. The mapping module is configured to map the data within the region of interest with the one or more groups. The data analysis module is configured to determine whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups and the abnormal event detection module is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • In another embodiment, a computer-readable storage medium for data-driven anomaly detection is disclosed. The computer-readable storage medium which is not a signal stores computer executable instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm, mapping the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system, determining whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups and detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. In an alternate embodiment, the computer-readable storage medium which is not a signal stores computer executable instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm, classifying reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest, determining a control limit for each of the one or more groups by analyzing the reference data, mapping the data within the region of interest with the one or more groups, determining whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups and detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • DRAWINGS
  • Various embodiments of the invention will, hereinafter, be described in conjunction with the appended drawings provided to illustrate, and not to limit the invention, wherein like designations denote like elements, and in which:
  • FIG. 1 is a computer architecture diagram illustrating a computing system capable of implementing the embodiments presented herein.
  • FIG. 2 is a flowchart, illustrating a method for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • FIG. 3 is a flowchart, illustrating a method for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • FIG. 4 is a plot of measure to identify the region of interest, in accordance with an embodiment of the present invention.
  • FIG. 5 is a block diagram illustrating a system for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention.
  • FIG. 6 is a block diagram illustrating a system for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • The foregoing has broadly outlined the features and technical advantages of the present disclosure in order that the detailed description of the disclosure that follows may be better understood. Additional features and advantages of the disclosure will be described hereinafter which form the subject of the claims of the disclosure. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the disclosure as set forth in the appended claims. The novel features which are believed to be characteristic of the disclosure, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present disclosure.
  • Exemplary embodiments of the present invention provide a system and method for data-driven anomaly detection. This involves identifying a region of interest in the data based on dimensionality reduction technique and change point detection algorithm. If no reference data is available, wherein the reference data represent normal operating condition of a system, then the reference data is obtained from the test data itself. In this case, the region outside the region of interest acts as the reference data. The data within the region of interest are mapped with one or more groups of reference data, wherein the one or more groups represent one or more modes of operation of the system. Each of the one or more groups has a control limit defined. If it is determined that the data within the region of interest is outside of the control limit of the corresponding mapped group then it indicates the anomaly. The abnormal event is then detected by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
  • FIG. 1 illustrates a generalized example of a suitable computing environment 100 in which all embodiments, techniques, and technologies of this invention may be implemented. The computing environment 100 is not intended to suggest any limitation as to scope of use or functionality of the technology, as the technology may be implemented in diverse general-purpose or special-purpose computing environments.
  • For example, the disclosed technology may be implemented using a computing device (e.g., a server, desktop, laptop, hand-held device, mobile device, PDA, etc.) comprising a processing unit, memory, and storage storing computer-executable instructions implementing the service level management technologies described herein. The disclosed technology may also be implemented with other computer system configurations, including hand held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, a collection of client/server systems, and the like.
  • With reference to FIG. 1, the computing environment 100 includes at least one central processing unit 102 and memory 104. The central processing unit 102 executes computer-executable instructions. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power and as such, multiple processors can be running simultaneously. The memory 104 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two. The memory 104 stores software 116 that can implement the technologies described herein. A computing environment may have additional features. For example, the computing environment 100 includes storage 108, one or more input devices 110, one or more output devices 112, and one or more communication connections 114. An interconnection mechanism (not shown) such as a bus, a controller, or a network, interconnects the components of the computing environment 100. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 100, and coordinates activities of the components of the computing environment 100.
  • FIG. 2 is a flowchart, illustrating a method for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention. In various embodiments of the present disclosure the reference data represent the normal condition of the system and test data represent abnormal condition of the system. In this disclosure the abnormal event satisfies the following condition:
      • a) the length of the abnormal run is >N data points (continuous);
      • b) an abnormal run occurs once and only once in the test data file;
      • c) the abnormal run does not occur in the reference data file.
  • In all the embodiments of the present disclosure the term “test data” is to be construed as “data” mentioned in the claims. A region of interest from the test data is identified based on dimensionality reduction technique and change point detection algorithm, as in step 202. The data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure. The data can be stored and extracted from a database or can be obtained from sensors directly in real time. The data may be preprocessed to remove incomplete and irrelevant data. The preprocessing step may include removing the entire column of sensor readings if the sensor readings of the corresponding column are entirely zero in both reference data and test data and/or removing the entire row of sensor readings at each time instance if the sensor reading of the corresponding row is entirely zero and/or removing the entire column of sensor reading if the sensor reading of the corresponding column is same in both the reference data and test data and/or removing the columns which are linearly dependent or correlated with other columns. The dimensionality reduction technique may include but is not limited to T2 statistic. The T2 statistic for the ith sampling time instance is calculated as follow:

  • T i 2=(S j(i)−m j)S −1(S j(i)−m j)
  • Where,
  • Sj(i) is the sensor reading for jth sensor at time instance i;
    mj is the mean of sensor values over time for the jth sensor;
    S−1 is the inverse of standard covariance matrix using successive difference.
  • Covariance between sensor j1 and j2 is given by,
  • S j 1 j 2 = 1 2 ( I - 1 ) i = 1 I ( S j 1 ( i ) - S j 1 ( i - 1 ) ) ( S j 2 ( i ) - S j 2 ( i - 1 ) )
  • Where,
  • I is the number of sampling time period and rest of the notations shall be construed as mentioned above.
  • After computing T2Statistic for each time instance i, T2-chart (in the y-axis) is plotted against sampling time (in the x-axis) to identify the region of interest using the Lavielle's change-point detection algorithm. FIG. 4 is a plot of measure to identify the region of interest, in accordance with an embodiment of the present invention and 402 in the figure shows the region of interest. In an alternate embodiment, the region of interest can be calculated from multi-modal pattern by using a statistic which is based on cumulative sums of differences from the mean. Steady increase in the obtained statistic indicates that the T-square statistic values are above the overall mean. Steady decrease in the obtained statistic indicates that the T-square statistic values are below the mean. The change in the pattern will be indicated by abrupt changes in the slope. Then the slope is computed for all pairs of peak and trough. The pair with farthest slope from the mean of all slopes will contain the region of interest. Hence, determine the region of interesting by finding out the pair of peak and trough for which slope is farthest away from the mean of all slopes.
  • Referring back to FIG. 2, the data within the region of interest is mapped with one or more predefined groups of reference data, as in step 204. The reference data can be obtained in the same way as the test data is collected. The reference data can be preprocessed like the test data and may be normalized by using any normalizing measure which may include but is not limited to mean and relative proportion. Then, an appropriate clustering technique is used based on the type of reference data to classify the reference data into different groups, wherein the said different groups represent one or more modes of operation of the system. In a preferred embodiment two types of clustering approaches are used namely, partition based clustering and hierarchical clustering. Based on the type of reference data the clustering algorithm is selected. For example, k-means and expectation maximization clustering algorithm is used if the data is continuous and if the data is categorical then robust clustering for categorical attributes is applied. A control limit is determined for each group of the reference data based on the type of the reference data. In a preferred embodiment, the central tendency and dispersion of each of the groups of reference data is measured. Measurement of central tendency and dispersion may include but are not limited to mean and standard deviation of each group of reference data. The overall mean vector is the mean sensor values over time for each sensor. The mean value can be calculated as follows:
  • m j = i = 1 I S j ( i ) I j
  • Where,
  • mj is the mean of sensor values over time for the jth sensor;
    i=index the sampling time;
    I=number of sampling time period;
    sj(i)=Sensor reading for jth sensor at time instance I;
    j=index the sensor number.
  • The standard deviation can be calculated as follows:
  • d j = i = 1 I ( m j - S j ( i ) ) 2 I - 1 j
  • Where,
  • dj is the standard deviation of sensor values over time for the jth sensor;
    mj is the mean of sensor values over time for the jth sensor;
    i=index the sampling time;
    I=number of sampling time period;
    sj(i)=Sensor reading for jth sensor at time instance I.
  • The upper control limit and lower control limit of each group of the reference data can be obtained from the said mean and standard deviation. The upper control limit can be calculated as:

  • u j =m j +b×d j
  • Where,
  • uj is the upper control limit for sensor j;
    mj is the mean of sensor values over time for the jth sensor;
    dj is the standard deviation of sensor values over time for the jth sensor;
    b is a constant. E.g. b=0.5, 1, 1.5, 2, 2.5, 3 and so on.
  • Similarly, the lower control limit can be determined as:

  • l j =m j −b×d j
  • Where,
  • lj is the lower control limit for sensor j;
    mj is the mean of sensor values over time for the jth sensor;
    dj is the standard deviation of sensor values over time for the jth sensor;
    b is a constant. E.g. b=0.5, 1, 1.5, 2, 2.5, 3 and so on.
  • The mapping in step 204 is done based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data.
  • In step 206 of FIG. 2 it is determined if the data or sensor readings within the region of interest of the test data fall outside of the said control limit. The sensor reading is flagged with −1, 0, and 1 for each scenario as follows: if the sensor reading is lying outside the lower control limit then the corresponding value is flagged with −1, else if the sensor reading is lying within the control limits then the corresponding value is flagged with 0, and if the sensor reading is lying outside the upper control limit then the corresponding value is flagged with 1. The resulting value can be called as flagged value. Similarly, the flagging is performed for the reference data set. Thereafter, the at least one abnormal event is detected by applying a heuristic algorithm, as in step 208. The data can be preprocessed before applying heuristic algorithm. The heuristic algorithm uses the flagged values to determine the anomaly or abnormal event. The flagged values are scanned to check whether candidate sensor(s) with different combination of sensor state (i.e. {−1, 0, +1}) results in an abnormal run in the region of interest of the test data. If it is an abnormal run (run longer than N time instances and the run occurring only once) then the candidate sensor(s) along with the respective sensor state can be referred as a pattern. Then, the reference data is scanned to check if the said pattern is also present in the reference data. If the pattern exists, then it is a normal event and can be discarded, else the pattern is stored as a candidate pattern as well as the start and end time of the abnormal event. The sensors also identified associated with the abnormal event by using the flagged values. The flagged values in the reference data are also analyzed to determine the longest streak of out of control limits event or abnormal event. The total time instance of longest streak is considered as the earliest detection time. If the candidate pattern has more than one abnormal sensor, then the latest start time among the abnormal run sensors is used as the earliest detection time. If the pattern appears in the reference data set then the run length of several runs (which does not satisfy the constraints of abnormal event) for that candidate pattern is identified. The earliest detection time is offset by a time period defined as a function of run length denoted by ƒ(r) or ƒ(m) as follows:

  • ƒ(r)=β+τ×δ
  • Where,
  • r is the run length for the candidate pattern in the reference data;
    β is the mean value of run length for the candidate pattern in the reference data:
  • β = g = 1 R r g R
  • R is the total number of runs for the candidate pattern in the reference data;
    τ is a constant;
    ∂ is the standard deviation among run length of candidate pattern in the reference data:
  • δ = g = 1 R ( β - r ) 2 R - 1
  • rg is the run length of gth run for the candidate pattern in the reference data.
  • The offset time can be computed as a function of maximum run length if the cost of false alarming is not defined. The function ƒ(m) is defined as given below:

  • ƒ(m)=α×{max(r g)}
  • Where,
  • m is the maximum run length for the candidate pattern in the reference data;
    α is the factor of safety;
    rg is the run length of gth run for the candidate pattern in the reference data.
  • According to an embodiment of the present disclosure, an alert is generated for the users about the anomaly at the earliest and the results are consolidated to prepare a brief and detailed summary about the anomaly in the system.
  • FIG. 3 is a flowchart, illustrating a method for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention. A region of interest from the test data is identified based on dimensionality reduction technique and change point detection algorithm, as in step 302. In all the embodiments of the present disclosure the term “test data” is to be construed as “data” mentioned in the claims. The data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure. The data can be stored and extracted from a database or can be obtained from sensors directly in real time. The data may be preprocessed to remove incomplete and irrelevant data. The detail about the preprocessing of data is described hereinabove. The dimensionality reduction technique may include but is not limited to T2 statistic. The details about computing T2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above. The reference data can be obtained by removing the data points corresponding to the region of interest (RoI) from the test data. The reference data is classified based on the appropriate clustering technique into different groups, wherein the different groups represent different modes of operation of the system, as in step 304. The details about the clustering technique are provided herein above. A control limit for each of the said groups is determined, as in step 306. The details about the control limit determination are described herein above. The data within the region of interest are mapped with the said groups of the reference data based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data, as in step 308. In step 310, it is determined if the data or sensor readings within the region of interest of the test data fall outside of the said control limit. The sensor reading is flagged with −1, 0, and 1 for each scenario as follows: if the sensor reading is lying outside the lower control limit then the corresponding value is flagged with −1, else if the sensor reading is lying within the control limits then the corresponding value is flagged with 0, and if the sensor reading is lying outside the upper control limit then the corresponding value is flagged with 1. The resulting value can be called as flagged value. Similarly, the flagging is performed for the reference data set. Thereafter, the at least one abnormal event is detected by applying a heuristic algorithm, as in step 312. The description of the detection step is mentioned herein above in great detail. According with an embodiment of the present disclosure, the sensors also identified associated with the abnormal event by using the flagged values. The flagged values in the reference data are also analyzed to determine the earliest detection time of the abnormal event which is described in great detail herein above. An alert is also generated for the users about the anomaly at the earliest and the results are consolidated to prepare a brief and detailed summary about the anomaly in the system.
  • FIG. 5 is a block diagram illustrating a system for data-driven anomaly detection if two data sets namely reference data and test data are available, in accordance with an embodiment of the present invention. The system includes region of interest identification module 502, mapping module 504, data analysis module 506, abnormal event detection module 508, earliest time detection module 510, alert generation module 512 and Rule Engine Database 514. The region of interest identification module 502 is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. The dimensionality reduction technique may include but is not limited to T2 statistic. The details about computing T2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above. In all the embodiments of the present disclosure the term “test data” is to be construed as “data” mentioned in the claims. The data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure. The data can be stored and extracted from a database or can be obtained from sensors directly in real time. The data may be preprocessed to remove incomplete and irrelevant data. The detail about the preprocessing of data is described hereinabove. The mapping module 504 is configured to map the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system. An appropriate clustering technique is used based on the type of reference data to classify the reference data into different groups which is described in great detail herein above. As described above with reference to FIG. 2, a control limit is determined for each group of the reference data based on the type of the reference data. The data analysis module 506 is configured to determine whether the data within the region of interest is outside of the predefined control limit of the corresponding mapped group. The sensor reading is flagged with −1, 0, and 1 as described in detail herein above. The abnormal event detection module 508 is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. The details about the abnormal event detection method are described herein above. According with an embodiment of the present disclosure, the sensors also identified associated with the abnormal event by using the flagged values. The earliest time detection module 510 is configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit. The computing method of earliest detection time is described herein above in detail. The alert generation module 512 is configured to generate an alert on the occurrence of the at least one abnormal event. The Rule Engine Database 514 is configured to store predefined rules and rules given as an input by the user. The rules are segregated based on the different patterns exhibited by the sensors using the reference data. These rules are essential to identify the anomalies as well as to determine the earliest detection time.
  • FIG. 6 is a block diagram illustrating a system for data-driven anomaly detection if only test data is available, in accordance with an embodiment of the present invention. The system includes region of interest identification module 602, reference data classification module 604, control limit determination module 606, mapping module 608, data analysis module 610, abnormal event detection module 612, earliest time detection module 614, alert generation module 616 and Rule Engine Database 618. The region of interest identification module 602 is configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm. The dimensionality reduction technique may include but is not limited to T2 statistic. The details about computing T2 statistic and identifying region of interest using change point algorithm is mentioned in detail herein above. In all the embodiments of the present disclosure the term “test data” is to be construed as “data” mentioned in the claims. The data or test data can be obtained from sensors attached to the system or from manual observation of the functioning of the system. In a preferred embodiment, the data is obtained from the sensors and this will be taken into consideration for describing the present technique but this is only for understanding purpose and does not intend to limit the scope of the disclosure. The data can be stored and extracted from a database or can be obtained from sensors directly in real time. The data may be preprocessed to remove incomplete and irrelevant data. The detail about the preprocessing of data is described hereinabove. The reference data classification module 604 is configured to classify reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest. The reference data is classified based on the appropriate clustering technique into different groups and the details about the clustering technique are provided herein above. The control limit determination module 606 is configured to determine a control limit for each of the one or more groups by analyzing the reference data. The details about the control limit determination are described herein above. The mapping module 608 is configured to map the data within the region of interest with the one or more groups based on the closeness between the mean vector of each group of the reference data and the actual sensor reading of the test data. The data analysis module 610 is configured to determine whether the data within the region of interest is outside the control limit of the mapped group. The sensor reading is flagged with −1, 0, and 1 as described above. The abnormal event detection module 612 is configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit. The details about the abnormal event detection method are described herein above. According with an embodiment of the present disclosure, the sensors also identified associated with the abnormal event by using the flagged values. The earliest time detection module 614 is configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit. The computing method of earliest detection time is described herein above in detail. The alert generation module 616 is configured to generate an alert on the occurrence of the at least one abnormal event. The Rule Engine Database 618 is configured to store predefined rules and rules given as an input by the user. The rules are segregated based on the different patterns exhibited by the sensors using the reference data. These rules are essential to identify the anomalies as well as to determine the earliest detection time.
  • One or more computer-readable media (e.g., storage media) or one or more processor-readable media (e.g., storage media) can comprise computer-executable instructions causing a computing system (e.g., comprising one or more processors coupled to memory) (e.g., computing environment 100 or the like) to perform any of the methods described herein. Examples of such computer-readable or processor-readable media include magnetic media, optical media, and memory (e.g., volatile or non-volatile memory, including solid state drives or the like).
  • The above mentioned description is presented to enable a person of ordinary skill in the art to make and use the invention and is provided in the context of the requirement for obtaining a patent. Various modifications to the preferred embodiment will be readily apparent to those skilled in the art and the generic principles of the present invention may be applied to other embodiments, and some features of the present invention may be used without the corresponding use of other features. Accordingly, the present invention is not intended to be limited to the embodiment shown but is to be accorded the widest scope consistent with the principles and features described herein.

Claims (33)

What is claimed is:
1. A computer-implemented method for data-driven anomaly detection, the said method comprising:
identifying, by a processor, a region of interest from the data based on a dimensionality reduction technique and a change point detection algorithm;
mapping, by the processor, the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system;
determining, by the processor, whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups; and
detecting, by the processor, at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
2. The method as claimed in claim 1 further comprising:
determining an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
3. The method as claimed in claim 1, wherein the data is captured from one or more sensors.
4. The method as claimed in claim 3, wherein at least one of the one or more sensors is identified which corresponds to the data within the region of interest and falls outside of the control limit by applying the heuristic algorithm.
5. The method as claimed in claim 1, wherein the dimensionality reduction technique includes T2 statistic.
6. The method as claimed in claim 1, wherein the one or more groups are classified based on one or more clustering algorithms.
7. The method as claimed in claim 1 further comprising:
generating an alert on the occurrence of the at least one abnormal event.
8. A computer-implemented method for data-driven anomaly detection, the said method comprising:
identifying, by a processor, a region of interest from the data based on a dimensionality reduction technique and a change point detection algorithm;
classifying, by the processor, reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest;
determining, by the processor, a control limit for each of the one or more groups by analyzing the reference data;
mapping, by the processor, the data within the region of interest with the one or more groups;
determining, by the processor, whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups; and
detecting, by the processor, at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
9. The method as claimed in claim 8, further comprising:
detecting an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
10. The method as claimed in claim 8, wherein the data is captured from one or more sensors.
11. The method as claimed in claim 10, wherein at least one of the one or more sensors is identified which corresponds to the data within the region of interest and falls outside of the control limit by applying the heuristic algorithm.
12. The method as claimed in claim 8, wherein the dimensionality reduction technique includes T2 statistic.
13. The method as claimed in claim 8, wherein the one or more groups are classified based on one or more clustering algorithms.
14. The method as claimed in claim 8 further comprising:
generating an alert on the occurrence of the at least one abnormal event.
15. A system for data-driven anomaly detection, comprising:
a processor in operable communication with a processor-readable storage medium, the processor-readable storage medium containing one or more programming instructions whereby the processor is configured to implement:
a region of interest identification module configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm;
a mapping module configured to map the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system;
a data analysis module configured to determine whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups; and
an abnormal event detection module configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
16. The system as claimed in claim 15 further comprising:
an earliest time detection module configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
17. The system as claimed in claim 15, wherein the data is captured from one or more sensors.
18. The system as claimed in claim 17, wherein at least one of the one or more sensors is identified which corresponds to the data within the region of interest and falls outside of the control limit by applying the heuristic algorithm.
19. The system as claimed in claim 15, wherein the dimensionality reduction technique includes T2 statistic.
20. The system as claimed in claim 15, wherein, the one or more groups are classified based on one or more clustering algorithms
21. The system as claimed in claim 15 further comprising:
an alert generation module configured to generate an alert on the occurrence of the at least one abnormal event.
22. A system for data-driven anomaly detection, comprising:
a processor in operable communication with a processor-readable storage medium, the processor-readable storage medium containing one or more programming instructions whereby the processor is configured to implement:
a region of interest identification module configured to identify a region of interest from the data based on dimensionality reduction technique and change point detection algorithm;
a reference data classification module configured to classify reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest;
a control limit determination module configured to determine a control limit for each of the one or more groups by analyzing the reference data;
a mapping module configured to map the data within the region of interest with the one or more groups;
a data analysis module configured to determine whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups; and
an abnormal event detection module configured to detect at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
23. The system as claimed in claim 22 further comprising:
an earliest time detection module configured to determine an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
24. The system as claimed in claim 22, wherein the data is captured from one or more sensors.
25. The system as claimed in claim 24, wherein at least one of the one or more sensors is identified which corresponds to the data within the region of interest and falls outside of the control limit by applying the heuristic algorithm.
26. The system as claimed in claim 22, wherein the dimensionality reduction technique includes T2 statistic.
27. The system as claimed in claim 22 further comprising:
an alert generation module configured to generate an alert on the occurrence of the at least one abnormal event.
28. A computer-readable storage medium, that is not a signal, having computer-executable instructions stored thereon for data-driven anomaly detection, the said instructions comprising:
instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm;
instructions for mapping the data within the region of interest with one or more predefined groups of reference data representing one or more modes of operation of a system, wherein the reference data represent normal operating condition of the system;
instructions for determining whether the data within the region of interest is outside of a predefined control limit of the corresponding mapped group of the one or more predefined groups; and
instructions for detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
29. The computer-readable storage medium as claimed in claim 28 further comprising:
instructions for determining an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
30. The computer-readable storage medium as claimed in claim 28 further comprising:
instructions for generating an alert on the occurrence of the at least one abnormal event.
31. A computer-readable storage medium, that is not a signal, having computer executable instructions stored thereon for data-driven anomaly detection, the said instructions comprising:
instructions for identifying a region of interest from the data based on dimensionality reduction technique and change point detection algorithm;
instructions for classifying reference data into one or more groups representing one or more modes of operation of a system, wherein the reference data is obtained from a region outside the region of interest;
instructions for determining a control limit for each of the one or more groups by analyzing the reference data;
instructions for mapping the data within the region of interest with the one or more groups;
instructions for determining whether the data within the region of interest is outside the control limit of the mapped group of the one or more predefined groups; and
instructions for detecting at least one abnormal event by applying a heuristic algorithm on the data within the region of interest which are outside the control limit.
32. The computer-readable storage medium as claimed in claim 31 further comprising:
instructions for detecting an earliest detection time of the at least one abnormal event by identifying a pattern of abnormality in the data within the region of interest which are outside the control limit.
33. The computer-readable storage medium as claimed in claim 31 further comprising:
instructions for generating an alert on the occurrence of the at least one abnormal event.
US14/220,050 2013-06-24 2014-03-19 Systems and methods for data-driven anomaly detection Active 2036-08-23 US10552511B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2720CH2013 2013-06-24
IN2720/CHE/2013 2013-06-24

Publications (2)

Publication Number Publication Date
US20140379301A1 true US20140379301A1 (en) 2014-12-25
US10552511B2 US10552511B2 (en) 2020-02-04

Family

ID=52111590

Family Applications (1)

Application Number Title Priority Date Filing Date
US14/220,050 Active 2036-08-23 US10552511B2 (en) 2013-06-24 2014-03-19 Systems and methods for data-driven anomaly detection

Country Status (1)

Country Link
US (1) US10552511B2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150219530A1 (en) * 2013-12-23 2015-08-06 Exxonmobil Research And Engineering Company Systems and methods for event detection and diagnosis
US10148680B1 (en) * 2015-06-15 2018-12-04 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using hybrid decomposition
WO2021038079A1 (en) * 2019-08-29 2021-03-04 Wago Verwaltungsgesellschaft Mbh Method and device for analyzing a sequential process
CN113139610A (en) * 2021-04-29 2021-07-20 国网河北省电力有限公司电力科学研究院 Abnormity detection method and device for transformer monitoring data
WO2021184727A1 (en) * 2020-03-19 2021-09-23 平安科技(深圳)有限公司 Data abnormality detection method and apparatus, electronic device and storage medium
CN115618766A (en) * 2022-11-08 2023-01-17 中国航发四川燃气涡轮研究院 Algorithm capable of eliminating dead pixels of aero-engine flow passage test data in real time
US11593245B2 (en) * 2017-05-22 2023-02-28 Siemens Energy Global GmbH & Co. KG System, device and method for frozen period detection in sensor datasets
US11949703B2 (en) * 2019-05-01 2024-04-02 Oracle International Corporation Systems and methods for multivariate anomaly detection in software monitoring

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200014740A1 (en) * 2018-07-06 2020-01-09 Avigilon Corporation Tile stream selection for mobile bandwith optimization

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120063641A1 (en) * 2009-04-01 2012-03-15 Curtin University Of Technology Systems and methods for detecting anomalies from data
US8588484B2 (en) * 2007-06-22 2013-11-19 Warwick Warp Limited Fingerprint matching method and apparatus
US20140032506A1 (en) * 2012-06-12 2014-01-30 Quality Attributes Software, Inc. System and methods for real-time detection, correction, and transformation of time series data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5566092A (en) * 1993-12-30 1996-10-15 Caterpillar Inc. Machine fault diagnostics system and method
JP2004309998A (en) 2003-02-18 2004-11-04 Nec Corp Probabilistic distribution estimation apparatus, abnormal behavior detection device, probabilistic distribution estimation method, and abnormal behavior detection method
US20070289013A1 (en) 2006-06-08 2007-12-13 Keng Leng Albert Lim Method and system for anomaly detection using a collective set of unsupervised machine-learning algorithms
CA2615161A1 (en) 2006-12-21 2008-06-21 Aquatic Informatics Inc. Automated validation using probabilistic parity space
US7696866B2 (en) * 2007-06-28 2010-04-13 Microsoft Corporation Learning and reasoning about the context-sensitive reliability of sensors
US8626889B2 (en) 2008-02-22 2014-01-07 Hewlett-Packard Development Company, L.P. Detecting anomalies in a sensor-networked environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8588484B2 (en) * 2007-06-22 2013-11-19 Warwick Warp Limited Fingerprint matching method and apparatus
US20120063641A1 (en) * 2009-04-01 2012-03-15 Curtin University Of Technology Systems and methods for detecting anomalies from data
US20140032506A1 (en) * 2012-06-12 2014-01-30 Quality Attributes Software, Inc. System and methods for real-time detection, correction, and transformation of time series data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A novel changepoint detection algorithm, Allen B. Downey, December 5, 2008 *
Non-linear dimensionality reduction techniques for unsupervisedfeature extraction S. De Backer, A. Naud, P. Scheunders )Vision Lab, Department of Physics, UniÕersity of Antwerp, Groenenborgerlaan 171, 2020 Antwerpen, BelgiumReceived 17 July 1997; revised 30 January 1998 *
Q-statistic and T2-statistic PCA-based measures for damage assessment in structures Q-statistic and T2-statistic PCA-based measures for damage assessment in structures LE Mujica J. Rodellar A. Fernández A. Güemes First Published November 23, 2010 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150219530A1 (en) * 2013-12-23 2015-08-06 Exxonmobil Research And Engineering Company Systems and methods for event detection and diagnosis
US10148680B1 (en) * 2015-06-15 2018-12-04 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using hybrid decomposition
US10419470B1 (en) * 2015-06-15 2019-09-17 Thetaray Ltd System and method for anomaly detection in dynamically evolving data using hybrid decomposition
US10798118B1 (en) * 2015-06-15 2020-10-06 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using hybrid decomposition
US10812515B1 (en) * 2015-06-15 2020-10-20 ThetaRay Ltd. System and method for anomaly detection in dynamically evolving data using hybrid decomposition
US11593245B2 (en) * 2017-05-22 2023-02-28 Siemens Energy Global GmbH & Co. KG System, device and method for frozen period detection in sensor datasets
US11949703B2 (en) * 2019-05-01 2024-04-02 Oracle International Corporation Systems and methods for multivariate anomaly detection in software monitoring
WO2021038079A1 (en) * 2019-08-29 2021-03-04 Wago Verwaltungsgesellschaft Mbh Method and device for analyzing a sequential process
CN114341755A (en) * 2019-08-29 2022-04-12 Wago管理有限责任公司 Method and device for analyzing a process
WO2021184727A1 (en) * 2020-03-19 2021-09-23 平安科技(深圳)有限公司 Data abnormality detection method and apparatus, electronic device and storage medium
CN113139610A (en) * 2021-04-29 2021-07-20 国网河北省电力有限公司电力科学研究院 Abnormity detection method and device for transformer monitoring data
CN115618766A (en) * 2022-11-08 2023-01-17 中国航发四川燃气涡轮研究院 Algorithm capable of eliminating dead pixels of aero-engine flow passage test data in real time

Also Published As

Publication number Publication date
US10552511B2 (en) 2020-02-04

Similar Documents

Publication Publication Date Title
US10552511B2 (en) Systems and methods for data-driven anomaly detection
CN103703487B (en) Information identifying method and system
AU2017274576B2 (en) Classification of log data
US8886574B2 (en) Generalized pattern recognition for fault diagnosis in machine condition monitoring
US10719577B2 (en) System analyzing device, system analyzing method and storage medium
CN108170909B (en) Intelligent modeling model output method, equipment and storage medium
Cheliotis et al. A novel data condition and performance hybrid imputation method for energy efficient operations of marine systems
US20160369777A1 (en) System and method for detecting anomaly conditions of sensor attached devices
US10457423B2 (en) System and method for aircraft failure prediction
US20170249559A1 (en) Apparatus and method for ensembles of kernel regression models
Arul et al. Data anomaly detection for structural health monitoring of bridges using shapelet transform
Entezami et al. On continuous health monitoring of bridges under serious environmental variability by an innovative multi-task unsupervised learning method
Atzmueller et al. Anomaly detection and structural analysis in industrial production environments
Manservigi et al. Detection of Unit of Measure Inconsistency in gas turbine sensors by means of Support Vector Machine classifier
Song et al. The potential benefit of relevance vector machine to software effort estimation
CN113435753A (en) Enterprise risk judgment method, device, equipment and medium in high-risk industry
Ghiasi et al. An intelligent health monitoring method for processing data collected from the sensor network of structure
CN110808947B (en) Automatic vulnerability quantitative evaluation method and system
Jang et al. A proactive alarm reduction method and its human factors validation test for a main control room for SMART
Liu et al. EXPERIENCE: Algorithms and case study for explaining repairs with uniform profiles over IoT data
EP2915059B1 (en) Analyzing data with computer vision
Nanditha et al. Optimized defect prediction model using statistical process control and Correlation-Based feature selection method
CN111814764A (en) Lost article determining system
Wu et al. Significance test in operational ROC analysis
Adapala et al. Breast cancer classification using svm and knn

Legal Events

Date Code Title Description
AS Assignment

Owner name: INFOSYS LIMITED, INDIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHASTRI, LOKENDRA;RAJ, KOLANDAISWAMY ANTONY AROKIA DURAI;KANAGASABAPATHI, BALASUBRAMANIAN;SIGNING DATES FROM 20141225 TO 20141229;REEL/FRAME:035207/0302

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: AWAITING TC RESP., ISSUE FEE NOT PAID

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4