US20080010330A1 - Method and system for detecting difference between plural observed results - Google Patents

Method and system for detecting difference between plural observed results Download PDF

Info

Publication number
US20080010330A1
US20080010330A1 US11/768,967 US76896707A US2008010330A1 US 20080010330 A1 US20080010330 A1 US 20080010330A1 US 76896707 A US76896707 A US 76896707A US 2008010330 A1 US2008010330 A1 US 2008010330A1
Authority
US
United States
Prior art keywords
time series
series data
data set
cumulative probability
probability distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/768,967
Inventor
Tsuyoshi Ide
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: IDE, TSUYOSHI
Publication of US20080010330A1 publication Critical patent/US20080010330A1/en
Priority to US12/264,556 priority Critical patent/US7849124B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • G05B23/0218Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterised by the fault detection method dealing with either existing or incipient faults
    • G05B23/0224Process history based detection method, e.g. whereby history implies the availability of large amounts of data
    • G05B23/024Quantitative history assessment, e.g. mathematical relationships between available data; Functions therefor; Principal component analysis [PCA]; Partial least square [PLS]; Statistical classifiers, e.g. Bayesian networks, linear regression or correlation analysis; Neural networks

Definitions

  • the present invention relates to a method and system for detecting a difference between a set of observed results, and in particular, the present invention relates to a method and system for statistically analyzing the difference between time series data sets.
  • Detecting and analyzing anomalies of dynamic systems is an important technical challenge in various areas in the manufacturing industry. For example, anomaly detection in a production line has been of particular importance, and a lot of statistical techniques have been developed for quality control purposes. However, most of the traditional statistical quality control techniques are based on a strong assumption of multivariate normal distribution. Unless the system of interest is relatively static and stationary, the distribution of data is far from the normal distribution in general. This is especially the case in the analysis of automotives, where the system is highly dynamic and the definition of the normal state is not apparent. As a result, the utility of such traditional approaches is quite limited in many cases.
  • a typical approach to anomaly detection and analysis is limit-check or its variant, where an observed value is compared to a threshold (or reference) value that has been predetermined using some algorithm.
  • a rule-based system is often implemented, which enables, at least in principle, making a decision on a detected fault, based on a rule that “if a certain kind of observed value is larger than a predetermined reference value, a user is informed of an occurrence of anomaly”.
  • a rule-based system is often implemented, which enables, at least in principle, making a decision on a detected fault, based on a rule that “if a certain kind of observed value is larger than a predetermined reference value, a user is informed of an occurrence of anomaly”.
  • the trend of a variable can be greatly changed over time. Thus it is difficult to determine the reference value of a variable for detecting anomalies.
  • test experiments are performed on a certain round basis.
  • one experimental round can be one lap of a test course. This experimental round is referred as a run.
  • n-times observed values of n runs, that is, n time series data sets of each kind of observed values are obtained.
  • the time series data sets in individual runs may be different from one another more or less. In conventional techniques, it is hard to handle such fluctuations in experimental conditions, so that a substantial status of a diagnosis target cannot be appropriately characterized in many cases.
  • the present invention provides a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising performing for each observation variable:
  • an initial first time series data set consisting of a first plurality of first observed values over a first period of time, followed by generating an initial first cumulative probability distribution from the initial first time series data set, said initial first time series data set being designated as a previous first time series data set;
  • an initial second time series data set consisting of a second plurality of second observed or computed values over a second period of time, followed by generating an initial second cumulative probability distribution from the initial second time series data set, said initial second time series data set being designated as a previous second time series data set;
  • an initial score as a function of an absolute value of a difference between respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution, said initial score being designated as a previous score;
  • performing operations on the previous first time series data set including: generating a next first time series data set by adding newly-obtained first observed values at first additional times to the previous first time series data set, followed by generating a next first cumulative probability distribution from the next first time series data set;
  • performing operations on the previous second time series data set including: generating a next second time series data set by adding newly-obtained second observed or computed values at second additional times to the previous second time series data set, followed by generating a next second cumulative probability distribution from the next second time series data set;
  • the present invention provides a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising, for each observation variable, processing time series data for an ordered sequence of time periods 1, 2, . . . , J such that J is at least 3,
  • processing for time period 1 comprises acquiring a new time series data set 1 having N 1 observed values for the time period 1 and generating a first cumulative probability distribution (Pcomb,1) from the new time series data set 1, and
  • FIG. 1 shows a functional structure of a diagnosis system.
  • FIG. 2 shows a flowchart of a process in which the diagnosis system creates a detection result of anomaly.
  • FIG. 3 is a conceptual diagram of a process in which a first time series data set to be diagnosed is compared with a second time series data set indicating a normal status.
  • FIG. 4 shows an example of a process in which a score is calculated on the basis of an original time series data set instead of a replicated data set.
  • FIG. 5 shows an example of a calculation process of a score D.
  • FIG. 6 shows a process in which scores D converge with an increase in the number of repetition times of sampling with replacement.
  • FIG. 7 shows an example of a screen on which a display unit displays a detection result.
  • FIG. 8 shows a flowchart of a process of displaying a detection result of anomaly in a modified example of this embodiment.
  • FIG. 9 shows an example of a hardware configuration of an information processing apparatus functioning as the diagnosis system in this embodiment or the modified example.
  • FIG. 1 shows a functional configuration of a diagnosis system 10 .
  • the diagnosis system 10 includes an acquisition unit 100 , a normal-status database (DB) 105 , a replication unit 110 , a calculation unit 120 , a judgment unit 130 and a display unit 140 .
  • the acquisition unit 100 acquires a first time series data set from outside for each observation variable of a plurality of observation variables, and acquires a second time series data set based on data in the normal-status DB 105 .
  • the first time series data set includes a plurality of observed values obtained by observing a certain observation target over time, and is to be used for detecting anomalies of the observation target.
  • the second time series data set for each observation variable includes a plurality of observed values obtained by observing another observation target over time on a period different from the first data set.
  • the normal-status DB 105 stores at least one time series data set obtained as the result of an observation of the observation target in its normal status.
  • the acquisition unit 100 may acquire any one of the time series data sets as the second time series data set, or may acquire, as the second time series data set, the time series data set obtained for the observation period of the time length closest to that of the observation period of the first time series data set. Alternatively, the acquisition unit 100 may create a new time series data set, and acquire the new data set as the second time series data set.
  • the new time series data set is the one minimizing the sum of the squares of the inter-vector distances between the new time series data set and each time series data set obtained from the observation target in a normal status.
  • the replication unit 110 creates a first replicated data set that is a replica of the first time series data set, by repeatedly sampling a plurality of observed values with replacement from the first time series data set.
  • the replication unit 110 may create the first replicated data set by repeatedly sampling, with replacement, observed values of the number greater than that of observed values included in the first time series data set.
  • the replication unit 110 may create a second replicated data set that is a replica of the second time series data set, by repeatedly sampling a plurality of observed values with replacement from the second time series data set.
  • the replication unit 110 may create the first replicated data set by directly using the first replicated data set without change, and may create the second replicated data set by repeatedly sampling a plurality of observed values with replacement from the second time series data set.
  • the calculation unit 120 calculates a score for each observation variable, said score indicating the difference between the first time series data set and the second time series data set, on the basis of the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the same kind of observed values. Each kind of observed value pertains to a different observation variable of the plurality of observation variables. Then, the calculation unit 120 outputs the score for each observation variable to the display unit 140 .
  • the second replicated data set needs to include only at least a part of the second time series data set.
  • the second replicated data set may be the second time series data set as it is, or may be a replicated data set that the replication unit 110 creates by repeatedly sampling observed values with replacement from the second time series data set.
  • any kind of score may be employed as long as the score is based on the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the observed values for the same observation variable. Accordingly, the score may be the maximum value of the difference between the cumulative probability values of the observed values for each observation variable, or may be a value based on the mean or the sum of the absolute value of the difference between the cumulative probability values of the observed values for each observation variable.
  • the score thus calculated indicates the difference between the first time series data set and the second time series data set for each observation variable.
  • this score indicates a degree of anomaly occurring in an observation variable.
  • the judgment unit 130 judges whether or not the score has converged on a predetermined value as a result of repeated sampling with replacement. Specifically, in a case where the replication unit 110 sequentially creates a plurality of replicated data sets of each of the first time series data set and the second time series data set with an increase of the number of times of sampling, the updated replicated data sets are sequentially supplied to the calculation unit 120 . In this case, the calculation unit 120 recalculates the score every time the replicated data is updated. Then, the judgment unit 130 judges whether or not the difference between a previous score and a current score is equal to or less than a predetermined reference value.
  • the previous score is based on the difference between the probability value in the cumulative probability distribution of a previously-created first replicated data set, and the probability value in the cumulative probability distribution of a previously-created second replicated data set, with respect to the same observation variable.
  • the current score is based on the difference between the probability value in the cumulative probability distribution of a currently-created first replicated data set, and the probability value in the cumulative probability distribution of a currently-created second replicated data set, with respect to the same observation variable.
  • the judgment unit 130 outputs the current score to the display unit 140 .
  • the calculation unit 120 may output a calculated score when the number of times of sampling reaches a specified number of times equal to at least 2 .
  • the acquisition unit 100 acquires a first time series data set and a second time series data set corresponding to each observation variable.
  • the replication unit 110 creates a first replicated data set and a second replicated data set corresponding to each observation variable.
  • the calculation unit 120 calculates a score corresponding to each observation variable.
  • the display unit 140 displays information indicating the observation variables, and the respective scores in association with each other in descending order of the calculated scores. This allows a user to easily recognize a place with a strong probability that anomaly occurs, or a combination of places where anomaly occurs.
  • the calculation unit 120 may calculate an index indicating a relationship among plural kinds of observation variables, and the display unit 140 may display the index together with the above information and score.
  • the display unit 140 may display the index together with the above information and score.
  • FIG. 2 shows a flowchart of processing in which the diagnosis system 10 creates a result of anomaly detection.
  • the acquisition unit 100 acquires the first time series data set and the second time series data set (S 200 ).
  • FIG. 3 shows an outline of a method for acquiring the second time series data set.
  • FIG. 3 is a conceptual diagram of a process in which a first time series data set to be diagnosed is compared with second time series data sets each representing a normal status of an observation variable.
  • the normal-status DB 105 stores four time series data sets each representing the normal status as shown to the right upper side in FIG. 3 , it is desirable to create one second time series data set by using these four sets, and then to compare the first time series data set with the second time series data set thus created.
  • the acquisition unit 100 may create a new time series data set that minimizes the sum of the squares of the inter-vector distances between the new time series data set and each of these four time series data sets. The following describes a specific example of this creating method.
  • the normal-status DB 105 stores m time series of each observation variable i, and that each time series is expressed as a column vector x i (r) , where r is an integer value from 1 to m and specifies a different experimental run, wherein the integer i indicates the i th observation variable.
  • Each time series xi(r) includes p different time points so that x i (r) is a p-dimensional vector. Note that the numbers of observed values included in the respective time series data sets are often different from one another for each observation variable. This results from a difference in the observation period or the observation intervals.
  • the acquisition unit 100 executes preprocessing so that each time series data set would include p observed values. Specifically, the acquisition unit 100 interpolates a lacking observed value on the basis of observed values obtained before and after the lacking observed value, or excludes an unnecessary observed value.
  • the present invention finds a representative vector ui having p values therein, wherein ui maintains the essential features of the m time series (m>1).
  • the set m time series for the i-th variable each representing the normal status is expressed as the following p-by-m matrix shown in Equation 1.
  • a time series data set is created as a second time series data set, namely the p dimensional vector u i .
  • u i T is the transpose of u i .
  • One natural principle for finding u i is to minimize the sum of squares error function between the extracted representative vector u i and the original m time series. Since x i (r) amounts to projecting (u i T x i (r) ) on the one-dimensional space spanned by u i , the sum-of-squares error function is defined in the following Equation 2:
  • the second time series data set to be determined is ui that minimizes the above sum of the squares.
  • Equation 3 shows a derivation process.
  • Equation 3 is now reduced to the eigen equation as
  • H i T is a transposed matrix of H i .
  • the representative p-dimensional vector u i for observation variable i is found as the eigenvector corresponding to the maximum eigenvalue ⁇ .
  • SVD singular value decomposition
  • the acquisition unit 100 can efficiently solve the eigenvalue problem shown in the equation 4, and thereby can efficiently create a new second time series data set that minimizes the sum of the squares of differences between the second time series data set and each of the time series data sets stored in the normal-status DB 105 .
  • the replication unit 110 creates the first replicated data set and the second replicated data set by repeatedly sampling with replacement a predetermined number of times (S 210 ).
  • this predetermined number of times may be equal to the number of observed values, or an integral multiple of a predetermined integer, such as 2, of the number of observed values.
  • a method for creating a replicated data set by repeated sampling with replacement refer to A. C. Davison, D. V. Hinkley, “Bootstrap Methods and Their Application,” Cambridge University Press (Oct. 28, 1997).
  • the calculation unit 120 calculates a score indicating the difference between the first time series data set and the second time series data set on the basis of the difference between the cumulative probability value in the cumulative probability distribution of the first time series data set, and the cumulative probability value in the cumulative probability distribution of the second time series data set with respect to the same observation variable (S 220 ).
  • the cumulative probability distribution may be calculated by any method known by a person of ordinary skill in the art.
  • the cumulative probability distribution may be calculated for an observation variable by expressing the first or second first time series data set as a frequency distribution in the form of a probability density function in the observation variable, followed by integrating the probability density function from the lowest value of the observation variable to successively higher values of the observation variable to obtain the cumulative probability distribution normalized to a range of 0 to 1.
  • the calculation unit 120 may exclude a predetermined range of observed values by assuming that the probability distribution in the predetermined range is not likely to vary, and then may calculate the difference between the probability value in the cumulative probability distribution of the first time series data set, and the probability value in the cumulative probability distribution of the second time series data set with respect to the same observation variable. For example, in a case where it is obvious that values for a semi-discrete variable within a certain range are continuous values, the calculation unit 120 may exclude the values within the range, and then may calculate the score.
  • This calculated score may be a score known as a Kolmogorov-Smirnov (KS) statistic (hereinafter, referred to as a score D).
  • KS statistic indicates the maximum absolute value of the difference between the probability value in the cumulative probability distribution of the first replicated data set, and the probability value in the cumulative probability distribution of the second replicated data set, with respect to each of the observed values.
  • the KS statistic takes a value within a range of 0 to 1, i.e., takes 0 if no difference exists, and takes 1 if the absolute value of the difference is the maximum. Equation 5 shows a definition of the score D, where the subscripts N and T represent the normal and the target data.
  • KS statistic it is possible to appropriately recognize a substantial status variation even when an observation period is changed due to a change in an observation environment; an observed value becomes the predetermined multiple of its usual value due to a change in certain conditions; or a value is observed by adding a predetermined offset value to (by subtracting the offset value from) an observed value.
  • FIG. 4 shows an example of the process for calculating a score D on the basis of an original time series data set instead of a replicated data set.
  • the left-upper part of FIG. 4 shows a graph indicating a cumulative probability distribution of a second time series data set in a solid line.
  • the left-lower part of FIG. 4 shows a graph indicating a cumulative probability distribution of a first time series data set in a dotted line. In these cumulative probability distributions, observed values vary uncontinuously around the central part of the graph.
  • the probability that an observed value obtained from an observation target takes a certain value is very high, and the probability that the observed value takes a different value is very low, though it may happen sometimes.
  • the observed value is called a semi-discrete observed value
  • a variable to which the semi-discrete observed value is assigned is called a semi-discrete variable.
  • a large number of semi-discrete observed values are observed from an observation target such as an automobile. For example, only discrete values can be inherently obtained from the gear position and the shift position. Moreover, like an accelerator opening degree, there is also an observed value that is more likely to take 0 (when not pressing down on the accelerator) or 100 (when pressing down on the accelerator completely) in light of usual usage thereof. In addition, there is a case where observed values that are inherently continuous values become semi-discrete due to limited resolution. An example of this case is that the performance of a thermometer only allows a temperature to be measured with accuracy of 1° C. Smoothing and resampling processing or inclusion of noise may cause such semi-discrete variables not to be completely discrete.
  • the automobile is illustrated as observation target in this embodiment, this embodiment can be also effectively applied to any observation target other than the automobile as long as such semi-discrete observed value are observed from the observation target.
  • the diagnosis system 10 may observe an economical index such as a gross domestic product, a price of a stock or a bond. In this case, it is possible to detect anomaly occurring in an organization, such as a nation state or a company, which is an observation target.
  • the right part of FIG. 4 shows a graph in which the graph (a solid line) of the cumulative probability distribution of the second time series data set overlaps the graph (a dotted line) of the cumulative probability distribution of the first time series data set.
  • the lines of the first and second time series data sets are almost identical.
  • the score D that is the KS statistic
  • the calculation unit 120 calculates the KS statistic indicating the difference between the first replicated data set and the second replicated data set instead of the first time series data set and the second time series data set.
  • Each of the replicated data sets is obtained by repeatedly sampling observed values with replacement from the corresponding time series data set.
  • FIG. 5 shows an example of the process for calculating a score D.
  • the calculation unit 120 creates a probability distribution of a first replicated data set, and a probability distribution of a second replicated data set. If each of the replicated data sets is expressed as changes of observed values with the passage of time, probability distributions are each expressed as a set of probability values each corresponding to an observed value, and each indicating the probability of obtaining the observed value. Then, the calculation unit 120 creates a cumulative probability distribution of each of the probability distributions. A cumulative probability distribution is expressed as a cumulative value of the probability that each observation variable takes on a value less than or equal to a certain value.
  • the calculation unit 120 calculates, as the score D, the maximum value of the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the same observation variable.
  • the judgment unit 130 calculates the difference between scores, one of which has been created in S 220 in the previous execution of S 210 to S 240 , and the other of which has been created in S 220 in the current execution of S 210 to S 240 (S 230 ). Then, on condition that the calculated difference is not equal to or less than a predetermined reference value (S 240 : NO) which is a specified tolerance, the judgment unit 130 causes the processing to return to S 210 .
  • the replication unit 110 repeatedly performs sampling of observed values with replacement by increasing the number of times of repetition, and creates a new replicated data set by adding newly sampled observed values to the replicated data set created in the previous execution. This processing of S 210 to S 240 is repeated until the difference calculated in S 230 becomes equal to or less than the reference value.
  • the calculation unit 120 On condition that the difference between the scores converges to the reference value or below (S 240 : YES), the calculation unit 120 outputs the currently-calculated score to the display unit 140 (S 250 ). After that, the display unit 140 displays the calculated score as a detection result of the difference between the first time series data set and the second time series data set (S 260 ). Alternatively, on condition that the score exceeds a predetermined threshold value (for example, 0.4), the display unit 140 may notify a user of the observation variable corresponding to the score. The foregoing processing may be repeatedly preformed for each observation variable of the plurality of observation variables, and the display unit 140 may display a list of detection results of all of the observation variables.
  • a predetermined threshold value for example, 0.4
  • FIG. 6 shows a process in which scores D converge as the number of times of repeated sampling with replacement increases.
  • the x-axis shows kinds of observed values
  • the z-axis shows the number of times of repeated sampling with replacement by indicating how many times larger than the number of observed values in the first time series data set and/or the second time series data set
  • the y-axis shows the score D.
  • the number of repetition times of the processing shown in S 210 to S 240 is on the order of a several 10-fold higher number than the number of observed values.
  • the replication unit 110 may repeatedly perform sampling with replacement the approximate several 10-fold higher number of times than the number of observed values in advance, and then the calculation unit 120 may calculate the score D only once by using the replicated data sets thus created.
  • a sufficiently reliable score can be calculated by repeatedly sampling with replacement the approximate several 10-fold higher number of times than the number of observed values.
  • FIG. 7 shows an example of a screen 700 on which a detection result is displayed by the display unit 140 .
  • the display unit 140 displays information indicating plural, observation variables and scores, such as KS statistics, calculated from the respective observation variables in descending order of the calculated score.
  • the observed values having the larger scores are displayed from the left-hand side in descending order.
  • the display unit 140 shows a KS statistic of an observed value by using the length of a bar 710 in the bar chart, in association with an identification number 8 indicating the observation variable.
  • the display unit 140 shows a KS statistic of an observed value by using the length of a bar 720 in the bar chart, in association with an identification number 12 indicating the observation variable.
  • the KS statistics become smaller in order of the identification numbers 8, 12, 55, 13, 2, 60, . . . , from the left side to the right side in the graph.
  • FIG. 8 shows a flowchart of a real-time process for displaying a detection result of anomaly in a modified example of this embodiment.
  • scores such as the foregoing KS statistics
  • FIG. 8 shows a flowchart of a real-time process for displaying a detection result of anomaly in a modified example of this embodiment.
  • scores such as the foregoing KS statistics
  • FIG. 8 shows a flowchart of a real-time process for displaying a detection result of anomaly in a modified example of this embodiment.
  • scores such as the foregoing KS statistics
  • the acquisition unit 100 sequentially observes the observation target, and sequentially adds a time series data set including newly-obtained observed values to a time series data set including previously-obtained observed values. Thereby, the acquisition unit acquires a new first time series data set (S 800 ) with a fixed window of length N1. In other words, the acquisition unit 100 acquires a new first series data set including N1. data points in every execution of S 800 . Then, the replication unit 110 creates a replicated data set of the time series data set including only the newly-obtained observed values, out of the acquired first time series data set (S 8 10 ).
  • the calculation unit 120 creates a cumulative probability distribution based on the replicated data set of the first time series data set, by combining a cumulative probability distribution based on the replicated data set of the time series data set including the newly-obtained observed value, with a cumulative probability distribution based on a replicated data set of the time series data set including the previously-obtained observed values (S 820 ).
  • This combining is carried out according to a ratio of the number of newly-obtained observed values to the number of previously-obtained observed values.
  • a cumulative probability value of observation variable is calculated as a weighted average by using the numbers of newly-obtained and previously-obtained observed values as the weighting factors, respectively.
  • the previously-calculated cumulative probability value and the newly-calculated cumulative probability value of a certain observation variable are 0.4 and 0.5, respectively, and that the numbers of previously-obtained observed values and newly-obtained observed values are 9000 and 1000, respectively.
  • the average thereof is calculated by weighting according to a ratio between the numbers of observed values 9:1, and thereby the probability value 0.41 is computed.
  • the resulting score S is written as ⁇ S 2 +(1 ⁇ ) S 1 , where ⁇ is a real number within 0 through 1, representing a discounting factor.
  • is a real number within 0 through 1, representing a discounting factor.
  • N 2 is the window size (the number of data points) of the previously observed data.
  • the calculation unit 120 calculates a score such as a KS statistic indicating the difference between the combined cumulative probability distribution, and the previously-created cumulative probability distribution of the second replicated data set (S 830 ).
  • the calculated score is displayed as a detection result of the difference between the first time series data set and the second time series data set (S 840 ). If the diagnosis system 10 continues to obtain observed values, the diagnosis system 10 causes the processing to return to S 800 , and subsequently updates the score.
  • FIG. 8 describes a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit.
  • the method comprises, for each observation variable, processing time series data for an ordered sequence of time periods 1, 2, . . . , J such that J is at least 3.
  • Processing the time series data for time period 1 comprises acquiring a new time series data set 1 having N 1 observed values for the time period 1 and generating a first cumulative probability distribution (Pcomb,1) from the new time series data set 1.
  • Pcomb first cumulative probability distribution
  • N 1 1. In another embodiment, N 1 >1.
  • Processing the time series data for the time periods 1, 2, . . . , J is performed in real time.
  • FIG. 9 shows an example of a hardware configuration of an information processing apparatus 500 functioning as the diagnosis system 10 in this embodiment or the modified example.
  • the information processing apparatus 500 includes a CPU peripheral unit, an input/output unit and a legacy input/output unit.
  • the CPU peripheral unit includes a CPU 1000 , a RAM 1020 and a graphics controller 1075 , all of which are mutually connected to one another via a host controller 1082 .
  • the input/output unit includes a communication interface 1030 , a hard disk drive 1040 and a CD-ROM drive 1060 , all of which are connected to the host controller 1082 via an input/output controller 1084 .
  • the legacy input/output unit includes a ROM 1010 , a flexible disk drive 1050 and an input/output chip 1070 , all of which are connected to the input/output controller 1084 .
  • the host controller 1082 mutually connects the RAM 1020 to the CPU 1000 and the graphics controller 1075 , both of which access the RAM 1020 at a high transfer rate.
  • the CPU 1000 is operated based on programs stored in the ROM 1010 and the RAM 1020 , and controls each of the components.
  • the graphics controller 1075 obtains image data generated by the CPU 1000 or the like in a frame buffer provided in the RAM 1020 , and causes the obtained image data to be displayed on a display device 1080 .
  • the graphics controller 1075 may internally include a frame buffer in which the image data generated by the CPU 1000 or the like is stored.
  • the input/output controller 1084 connects the host controller 1082 to the communication interface 1030 , the hard disk drive 1040 and the CD-ROM drive 1060 , all of which are high-speed input/output devices.
  • the communication interface 1030 communicates with an external device via a network.
  • programs and data to be used by the information processing apparatus 500 are stored.
  • the CD-ROM drive 1060 reads a program or data from a CD-ROM 1095 , and provides the read-out program or data to the RAM 1020 or the hard disk 1040 .
  • the input/output controller 1084 is connected to relatively low-speed input/output devices such as the ROM 1010 , the flexible disk drive 1050 and the input/output chip 1070 .
  • the ROM 1010 stored are programs such as a boot program executed by the CPU 1000 at a start-up time of the information processing apparatus 500 and a program depending on hardware of the information processing apparatus 500 .
  • the flexible disk drive 1050 reads a program or data from a flexible disk 1090 , and provides the read-out program or data to the RAM 1020 or the hard disk drive 1040 via the input/output chip 1070 .
  • the input/output chip 1070 is connected to the flexible disk drive 1050 and various kinds of input/output devices, for example, through a parallel port, a serial port, a keyboard port, a mouse port and the like.
  • a program to be provided to the information processing apparatus 500 is provided by a user with the program stored in a storage medium such as the flexible disk 1090 , the CD-ROM 1095 and an IC card.
  • the program is read from the storage medium via the input/output chip 1070 and/or the input/output controller 1084 , and is installed and executed on the information processing apparatus 500 .
  • An operation that the program causes the information processing apparatus 500 or the like to execute, is identical to the operation of the diagnosis system 10 described by referring to FIGS. 1 to 8 . Therefore, the description thereof is omitted here.
  • the program described above may be stored in an external storage medium.
  • the storage medium any one of the following mediums may used: an optical storing medium such as a DVD and a PD; a magneto-optic storing medium such as an MD; a tape medium; and a semiconductor memory such as an IC card, in addition to the flexible disk 1090 and the CD-ROM 1095 .
  • the program may be provided to the information processing apparatus 500 via a network, by using, as a storage medium, a storage device such as a hard disk and a RAM, provided in a server system connected to a private communication network or the internet.
  • a storage medium with the program stored thereon is a computer program product.
  • the diagnosis system 10 it is possible to detect anomaly occurring in an observation target by using plural kinds of observed values exhaustively without fully having knowledge of behavior of the observed values in advance. Moreover, a score indicating anomaly is calculated by repeatedly sampling observed values with replacement from a time series data set. As a result, the substantial status of the observation target can be diagnosed by excluding a slight difference between the observed values caused by a difference in an observation environment. This makes it possible to achieve an anomaly detection method that effectively functions instead of the conventional limit check, or complementally functions in addition to the limit check.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Testing Or Calibration Of Command Recording Devices (AREA)
  • Complex Calculations (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

A method and system for analyzing time series data. In an embodiment, a loop is executed and terminated upon a specified maximum number of iterations of the loop being performed or upon a difference between scores in successive iterations of the loop not being greater than a specified tolerance, wherein the score in each iteration is calculated as function of an absolute value of a difference between respective cumulative probability values of first and second cumulative probability distributions which are generated from respectively first and second time series data sets. In an embodiment, time series data is processed in a sequence of time periods, wherein a combined cumulative probability distribution is generated in each time period by combining a cumulative probability distribution of new time series data with previously combined cumulative probability distribution data according to a ratio of the number of new to previous observed values.

Description

    FIELD OF THE INVENTION
  • The present invention relates to a method and system for detecting a difference between a set of observed results, and in particular, the present invention relates to a method and system for statistically analyzing the difference between time series data sets.
  • BACKGROUND OF THE INVENTION
  • Detecting and analyzing anomalies of dynamic systems is an important technical challenge in various areas in the manufacturing industry. For example, anomaly detection in a production line has been of particular importance, and a lot of statistical techniques have been developed for quality control purposes. However, most of the traditional statistical quality control techniques are based on a strong assumption of multivariate normal distribution. Unless the system of interest is relatively static and stationary, the distribution of data is far from the normal distribution in general. This is especially the case in the analysis of automotives, where the system is highly dynamic and the definition of the normal state is not apparent. As a result, the utility of such traditional approaches is quite limited in many cases.
  • The following points may be considered on anomaly detection and analysis of cars:
    • 1. From each component of an automobile, hundreds of time series data are observed.
    • 2. The types of observed time series data can be various; for example, the values could be discrete in some variable, and be continuous in another.
    • 3. The intervals of observations (or sampling interval) can be also various depending on the types of observed values.
    • 4. The knowledge of individual engineers may be incomplete; they may not always make a valid decision based on experimental data.
  • Heretofore, a typical approach to anomaly detection and analysis is limit-check or its variant, where an observed value is compared to a threshold (or reference) value that has been predetermined using some algorithm. Based on limit-check, a rule-based system is often implemented, which enables, at least in principle, making a decision on a detected fault, based on a rule that “if a certain kind of observed value is larger than a predetermined reference value, a user is informed of an occurrence of anomaly”. However, in highly dynamic systems such as an automobile, the trend of a variable can be greatly changed over time. Thus it is difficult to determine the reference value of a variable for detecting anomalies. While experienced engineers may be able to make a decision on the state of the system based on such complicated numerical data, it is unrealistic to assume that enough manpower of experienced engineers is available in every phase and place of anomaly detection. In addition, the knowledge of experienced engineers is often hard to translate to specific mathematical rules used in the limit-check routine. To summarize, the applicability of limit-check in combination with partial human knowledge is seriously limited in general. Accordingly, if there is an anomaly detection method that works more effectively than limit-check, or complementally functions in addition to the limit-check, time and effort for an anomaly diagnosis will be greatly reduced.
  • Generally, test experiments are performed on a certain round basis. For example, in a case of an automobile, one experimental round can be one lap of a test course. This experimental round is referred as a run. When an automobile goes round the test course n-times, observed values of n runs, that is, n time series data sets of each kind of observed values are obtained. In general, it is difficult to make test conditions in all the runs exactly the same since the complexity of the system is too high to completely control their values. The time series data sets in individual runs may be different from one another more or less. In conventional techniques, it is hard to handle such fluctuations in experimental conditions, so that a substantial status of a diagnosis target cannot be appropriately characterized in many cases.
  • In addition, the tendency of variations in observed values are greatly different among types of observed values. Moreover, since the number of variables of the system is very large, considering all combinations of the variable is computationally prohibitive.
  • SUMMARY OF THE INVENTION
  • The present invention provides a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising performing for each observation variable:
  • acquiring an initial first time series data set consisting of a first plurality of first observed values over a first period of time, followed by generating an initial first cumulative probability distribution from the initial first time series data set, said initial first time series data set being designated as a previous first time series data set;
  • acquiring an initial second time series data set consisting of a second plurality of second observed or computed values over a second period of time, followed by generating an initial second cumulative probability distribution from the initial second time series data set, said initial second time series data set being designated as a previous second time series data set;
  • calculating an initial score as a function of an absolute value of a difference between respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution, said initial score being designated as a previous score;
  • performing operations on the previous first time series data set, including: generating a next first time series data set by adding newly-obtained first observed values at first additional times to the previous first time series data set, followed by generating a next first cumulative probability distribution from the next first time series data set;
  • performing operations on the previous second time series data set, including: generating a next second time series data set by adding newly-obtained second observed or computed values at second additional times to the previous second time series data set, followed by generating a next second cumulative probability distribution from the next second time series data set;
  • calculating a next score as said function of an absolute value of a difference between respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution;
  • ascertaining whether a condition exists,
      • wherein the condition is that the next score has been calculated a specified maximum number of times equal to at least 1 or an absolute value of a difference between the next score and the previous score is not greater than a specified tolerance, and
      • wherein if said ascertaining ascertains that the condition exists then outputting the next score as a detection result to the display unit, otherwise setting the previous first time series data set equal to the next first time series data set, setting the previous second time series data set equal to the next second time series data set, setting the previous score equal to the next score, and again executing said performing operations on the previous first time series data set, said performing operations on the previous second time series data set, said calculating the next score, and said ascertaining.
  • The present invention provides a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising, for each observation variable, processing time series data for an ordered sequence of time periods 1, 2, . . . , J such that J is at least 3,
  • wherein said processing for time period 1 comprises acquiring a new time series data set 1 having N1 observed values for the time period 1 and generating a first cumulative probability distribution (Pcomb,1) from the new time series data set 1, and
  • wherein said processing for time period j (j=2, 3, . . . , J) comprises the steps of:
  • (a) acquiring a new time series data set j having NI observed values for the time period j;
  • (b) generating a new cumulative probability distribution (Pnew,j) from the new time series data set j;
  • (c) generating a combined cumulative probability distribution (Pcomb,j) equal to β*Pnew,j+(1−β)*Pcomb,j−1, wherein β=N2/(N1+N2), and wherein N2=(j−1)*N1;
  • (d) computing a score equal to the difference of Pcombj and Pcomb,j−1; and
  • (e) outputting the score as a detection result to the display unit.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • For a more complete understanding of the present invention and the advantage thereof, reference is now made to the following description taken in conjunction with the accompanying drawings.
  • FIG. 1 shows a functional structure of a diagnosis system.
  • FIG. 2 shows a flowchart of a process in which the diagnosis system creates a detection result of anomaly.
  • FIG. 3 is a conceptual diagram of a process in which a first time series data set to be diagnosed is compared with a second time series data set indicating a normal status.
  • FIG. 4 shows an example of a process in which a score is calculated on the basis of an original time series data set instead of a replicated data set.
  • FIG. 5 shows an example of a calculation process of a score D.
  • FIG. 6 shows a process in which scores D converge with an increase in the number of repetition times of sampling with replacement.
  • FIG. 7 shows an example of a screen on which a display unit displays a detection result.
  • FIG. 8 shows a flowchart of a process of displaying a detection result of anomaly in a modified example of this embodiment.
  • FIG. 9 shows an example of a hardware configuration of an information processing apparatus functioning as the diagnosis system in this embodiment or the modified example.
  • DETAILED DESCRIPTION OF THE INVENTION
  • Hereinafter, the present invention will be described by using an embodiment of the present invention. However, the following embodiment does not limit the present invention according to the scope of claims, and all the combinations of features described in the embodiment are not always required for solving means of the invention.
  • FIG. 1 shows a functional configuration of a diagnosis system 10. The diagnosis system 10 includes an acquisition unit 100, a normal-status database (DB) 105, a replication unit 110, a calculation unit 120, a judgment unit 130 and a display unit 140. The acquisition unit 100 acquires a first time series data set from outside for each observation variable of a plurality of observation variables, and acquires a second time series data set based on data in the normal-status DB 105. The first time series data set includes a plurality of observed values obtained by observing a certain observation target over time, and is to be used for detecting anomalies of the observation target. The second time series data set for each observation variable includes a plurality of observed values obtained by observing another observation target over time on a period different from the first data set. The normal-status DB 105 stores at least one time series data set obtained as the result of an observation of the observation target in its normal status.
  • In a case where the normal-status DB 105 stores a plurality of time series data sets, the acquisition unit 100 may acquire any one of the time series data sets as the second time series data set, or may acquire, as the second time series data set, the time series data set obtained for the observation period of the time length closest to that of the observation period of the first time series data set. Alternatively, the acquisition unit 100 may create a new time series data set, and acquire the new data set as the second time series data set. To be more precise, in a case where a time series data set is regarded as a vector of observed values, the new time series data set is the one minimizing the sum of the squares of the inter-vector distances between the new time series data set and each time series data set obtained from the observation target in a normal status.
  • The replication unit 110 creates a first replicated data set that is a replica of the first time series data set, by repeatedly sampling a plurality of observed values with replacement from the first time series data set. The replication unit 110 may create the first replicated data set by repeatedly sampling, with replacement, observed values of the number greater than that of observed values included in the first time series data set. Then, the replication unit 110 may create a second replicated data set that is a replica of the second time series data set, by repeatedly sampling a plurality of observed values with replacement from the second time series data set. Instead, the replication unit 110 may create the first replicated data set by directly using the first replicated data set without change, and may create the second replicated data set by repeatedly sampling a plurality of observed values with replacement from the second time series data set.
  • The calculation unit 120 calculates a score for each observation variable, said score indicating the difference between the first time series data set and the second time series data set, on the basis of the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the same kind of observed values. Each kind of observed value pertains to a different observation variable of the plurality of observation variables. Then, the calculation unit 120 outputs the score for each observation variable to the display unit 140. The second replicated data set needs to include only at least a part of the second time series data set. The second replicated data set may be the second time series data set as it is, or may be a replicated data set that the replication unit 110 creates by repeatedly sampling observed values with replacement from the second time series data set. In addition, any kind of score may be employed as long as the score is based on the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the observed values for the same observation variable. Accordingly, the score may be the maximum value of the difference between the cumulative probability values of the observed values for each observation variable, or may be a value based on the mean or the sum of the absolute value of the difference between the cumulative probability values of the observed values for each observation variable. The score thus calculated indicates the difference between the first time series data set and the second time series data set for each observation variable. In a case where the second time series data set is a time series data set representing data in a normal status, this score indicates a degree of anomaly occurring in an observation variable.
  • The judgment unit 130 judges whether or not the score has converged on a predetermined value as a result of repeated sampling with replacement. Specifically, in a case where the replication unit 110 sequentially creates a plurality of replicated data sets of each of the first time series data set and the second time series data set with an increase of the number of times of sampling, the updated replicated data sets are sequentially supplied to the calculation unit 120. In this case, the calculation unit 120 recalculates the score every time the replicated data is updated. Then, the judgment unit 130 judges whether or not the difference between a previous score and a current score is equal to or less than a predetermined reference value. Here, the previous score is based on the difference between the probability value in the cumulative probability distribution of a previously-created first replicated data set, and the probability value in the cumulative probability distribution of a previously-created second replicated data set, with respect to the same observation variable. Then, the current score is based on the difference between the probability value in the cumulative probability distribution of a currently-created first replicated data set, and the probability value in the cumulative probability distribution of a currently-created second replicated data set, with respect to the same observation variable. On condition that this difference between the scores is equal to or less than the reference value, the judgment unit 130 outputs the current score to the display unit 140. Instead, regardless of the result of a convergence judgment, the calculation unit 120 may output a calculated score when the number of times of sampling reaches a specified number of times equal to at least 2.
  • All the foregoing units perform the same processing as the above for each observation variable. Specifically, the acquisition unit 100 acquires a first time series data set and a second time series data set corresponding to each observation variable. The replication unit 110 creates a first replicated data set and a second replicated data set corresponding to each observation variable. In addition, the calculation unit 120 calculates a score corresponding to each observation variable. Then, the display unit 140 displays information indicating the observation variables, and the respective scores in association with each other in descending order of the calculated scores. This allows a user to easily recognize a place with a strong probability that anomaly occurs, or a combination of places where anomaly occurs. In order to further facilitate the anomaly detection, the calculation unit 120 may calculate an index indicating a relationship among plural kinds of observation variables, and the display unit 140 may display the index together with the above information and score. For a method for calculating such an index, see a reference to Tsuyoshi Ide and Keisuke Inoue, “Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations” in Proceedings of 2005 SIAM International Conference on Data Mining, Apr. 21-23, 2005, pp. 571-576.
  • FIG. 2 shows a flowchart of processing in which the diagnosis system 10 creates a result of anomaly detection. The acquisition unit 100 acquires the first time series data set and the second time series data set (S200). FIG. 3 shows an outline of a method for acquiring the second time series data set.
  • FIG. 3 is a conceptual diagram of a process in which a first time series data set to be diagnosed is compared with second time series data sets each representing a normal status of an observation variable. In a case where the normal-status DB 105 stores four time series data sets each representing the normal status as shown to the right upper side in FIG. 3, it is desirable to create one second time series data set by using these four sets, and then to compare the first time series data set with the second time series data set thus created. For example, in a case where a time series data set is regarded as a vector of observed values, the acquisition unit 100 may create a new time series data set that minimizes the sum of the squares of the inter-vector distances between the new time series data set and each of these four time series data sets. The following describes a specific example of this creating method.
  • In order to explain the creating method in a generalized manner, it is assumed that the normal-status DB 105 stores m time series of each observation variable i, and that each time series is expressed as a column vector xi (r), where r is an integer value from 1 to m and specifies a different experimental run, wherein the integer i indicates the ith observation variable. Each time series xi(r) includes p different time points so that xi (r) is a p-dimensional vector. Note that the numbers of observed values included in the respective time series data sets are often different from one another for each observation variable. This results from a difference in the observation period or the observation intervals. In this embodiment, the acquisition unit 100 executes preprocessing so that each time series data set would include p observed values. Specifically, the acquisition unit 100 interpolates a lacking observed value on the basis of observed values obtained before and after the lacking observed value, or excludes an unnecessary observed value.
  • Instead of handling each of the m time series separately, the present invention finds a representative vector ui having p values therein, wherein ui maintains the essential features of the m time series (m>1). In accordance with the above definition, the set m time series for the i-th variable each representing the normal status is expressed as the following p-by-m matrix shown in Equation 1.

  • H i(x)=└i xi (1) , . . . , x i (m)┘  Equation 1
  • A time series data set is created as a second time series data set, namely the p dimensional vector ui. Here, a normalization condition ui Tui=1 is assumed, where the superscript T denotes the transposition. Thus, ui T is the transpose of ui. One natural principle for finding ui is to minimize the sum of squares error function between the extracted representative vector ui and the original m time series. Since xi (r) amounts to projecting (ui Txi (r)) on the one-dimensional space spanned by ui, the sum-of-squares error function is defined in the following Equation 2:
  • E ( u i ) = r = 1 m ( u i T x i ( r ) ) u i - x i ( r ) 2 Equation 2
  • The second time series data set to be determined is ui that minimizes the above sum of the squares. The following Equation 3 shows a derivation process.
  • u i = arg min u r = 1 m ( u T x i ( r ) ) u - x i ( r ) 2 = arg min u r = 1 m [ - ( u T x i ( r ) ) 2 + x i ( r ) T x i ( r ) ] = arg max u [ u T r = 1 m x i ( r ) x i ( r ) T u ] = arg max u [ u T H i H i T u ] Equation 3
  • When the constraint ui Tui=1=1 is incorporated into the above equation by using a Lagrangian coefficient λ, this problem becomes an eigenvalue problem shown in an Equation 4. What is to be determined is an eigenvector ui corresponding to the maximum eigenvalue λ. In short, the procedure is exactly to find the maximum left-singular vector of a matrix H. For more detailed descriptions of the singular value decomposition, Refer to Japanese translated version of “Linear Algebra and Its Applications,” written by Gilbert Strang, published by Sangyo-Tosho, 1978, for example.
  • In addition, such a feature extraction method is called the Karhunen-Loeve transform, or the like (for example, refer to Ishii Kenichiro, et. al., “Pattern Ninshiki (Pattern Recognition),” Ohmusha, 1998). Equation 3 is now reduced to the eigen equation as

  • H i H i T u i =λu i  Equation 4
  • wherein Hi T is a transposed matrix of Hi. The representative p-dimensional vector ui for observation variable i is found as the eigenvector corresponding to the maximum eigenvalue λ. Mathematically, this is the same as performing the singular value decomposition (SVD) for Hi and taking the leading left singular vector. When the size of the matrix is large, several techniques to reduce the computational cost are available. As the result it is known in the field of data mining that SVD can be executed at the same calculation cost as that of other indexing methods such as the discrete Fourier transform, the wavelet transformation and the piecewise aggregate approximation (For example, refer to E. Keogh, K. Chakrabarti, M. J. Pazzani, S. Mehrotra, “Dimensionality reduction for fast similarity search in large time-series databases,” Knowledge and Information Systems, 3 (2001) 263-286)) In other words, by executing the SVD according to such an algorithm, the acquisition unit 100 can efficiently solve the eigenvalue problem shown in the equation 4, and thereby can efficiently create a new second time series data set that minimizes the sum of the squares of differences between the second time series data set and each of the time series data sets stored in the normal-status DB 105.
  • Returning to FIG. 2, the replication unit 110 creates the first replicated data set and the second replicated data set by repeatedly sampling with replacement a predetermined number of times (S210). For example, this predetermined number of times may be equal to the number of observed values, or an integral multiple of a predetermined integer, such as 2, of the number of observed values. For detailed descriptions of a method for creating a replicated data set by repeated sampling with replacement, refer to A. C. Davison, D. V. Hinkley, “Bootstrap Methods and Their Application,” Cambridge University Press (Oct. 28, 1997).
  • After calculating the cumulative probability distribution of the first time series data set and the cumulative probability distribution of the second time series data set, the calculation unit 120 calculates a score indicating the difference between the first time series data set and the second time series data set on the basis of the difference between the cumulative probability value in the cumulative probability distribution of the first time series data set, and the cumulative probability value in the cumulative probability distribution of the second time series data set with respect to the same observation variable (S220). For each observation variable, the cumulative probability distribution may be calculated by any method known by a person of ordinary skill in the art. For example, the cumulative probability distribution may be calculated for an observation variable by expressing the first or second first time series data set as a frequency distribution in the form of a probability density function in the observation variable, followed by integrating the probability density function from the lowest value of the observation variable to successively higher values of the observation variable to obtain the cumulative probability distribution normalized to a range of 0 to 1.
  • For the purpose of increasing efficiency in the calculation processing, the calculation unit 120 may exclude a predetermined range of observed values by assuming that the probability distribution in the predetermined range is not likely to vary, and then may calculate the difference between the probability value in the cumulative probability distribution of the first time series data set, and the probability value in the cumulative probability distribution of the second time series data set with respect to the same observation variable. For example, in a case where it is obvious that values for a semi-discrete variable within a certain range are continuous values, the calculation unit 120 may exclude the values within the range, and then may calculate the score.
  • This calculated score may be a score known as a Kolmogorov-Smirnov (KS) statistic (hereinafter, referred to as a score D). To be more precise, the KS statistic indicates the maximum absolute value of the difference between the probability value in the cumulative probability distribution of the first replicated data set, and the probability value in the cumulative probability distribution of the second replicated data set, with respect to each of the observed values. The KS statistic takes a value within a range of 0 to 1, i.e., takes 0 if no difference exists, and takes 1 if the absolute value of the difference is the maximum. Equation 5 shows a definition of the score D, where the subscripts N and T represent the normal and the target data. By using the KS statistic, it is possible to appropriately recognize a substantial status variation even when an observation period is changed due to a change in an observation environment; an observed value becomes the predetermined multiple of its usual value due to a change in certain conditions; or a value is observed by adding a predetermined offset value to (by subtracting the offset value from) an observed value.
  • [ Formula 5 ] D i = max x F N ( x ) - F T ( x ) Equation 5
  • FIG. 4 shows an example of the process for calculating a score D on the basis of an original time series data set instead of a replicated data set. By referring to FIG. 4, a problem in a case of comparison using the original time series data set instead of the replicated data set is pointed out, and effectiveness using the replicated data set is explained. The left-upper part of FIG. 4 shows a graph indicating a cumulative probability distribution of a second time series data set in a solid line. The left-lower part of FIG. 4 shows a graph indicating a cumulative probability distribution of a first time series data set in a dotted line. In these cumulative probability distributions, observed values vary uncontinuously around the central part of the graph. Specifically, the probability that an observed value obtained from an observation target takes a certain value is very high, and the probability that the observed value takes a different value is very low, though it may happen sometimes. Suppose that, when obtained observed values take a certain value in a concentrated manner as described above, the observed value is called a semi-discrete observed value, and that a variable to which the semi-discrete observed value is assigned is called a semi-discrete variable.
  • A large number of semi-discrete observed values are observed from an observation target such as an automobile. For example, only discrete values can be inherently obtained from the gear position and the shift position. Moreover, like an accelerator opening degree, there is also an observed value that is more likely to take 0 (when not pressing down on the accelerator) or 100 (when pressing down on the accelerator completely) in light of usual usage thereof. In addition, there is a case where observed values that are inherently continuous values become semi-discrete due to limited resolution. An example of this case is that the performance of a thermometer only allows a temperature to be measured with accuracy of 1° C. Smoothing and resampling processing or inclusion of noise may cause such semi-discrete variables not to be completely discrete.
  • It should be noted that, although the automobile is illustrated as observation target in this embodiment, this embodiment can be also effectively applied to any observation target other than the automobile as long as such semi-discrete observed value are observed from the observation target. For instance, as the observed value, the diagnosis system 10 may observe an economical index such as a gross domestic product, a price of a stock or a bond. In this case, it is possible to detect anomaly occurring in an organization, such as a nation state or a company, which is an observation target.
  • The right part of FIG. 4 shows a graph in which the graph (a solid line) of the cumulative probability distribution of the second time series data set overlaps the graph (a dotted line) of the cumulative probability distribution of the first time series data set. The lines of the first and second time series data sets are almost identical. However, as shown in the central part of the graph, there is a slight difference in the observed values that are observed as discrete values. Since such a difference often results from the foregoing noise, the difference should not be detected as a substantial difference between the first time series data set and the second time series data set. Nevertheless, when the difference between the cumulative probability distribution of the first time series data set, and the cumulative probability distribution of the second time series data set is evaluated by using the score D that is the KS statistic, the difference becomes very significant.
  • In contrast, in this embodiment, the calculation unit 120 calculates the KS statistic indicating the difference between the first replicated data set and the second replicated data set instead of the first time series data set and the second time series data set. Each of the replicated data sets is obtained by repeatedly sampling observed values with replacement from the corresponding time series data set. When the number of times of resampling is set to be sufficiently large, the cumulative probability distribution is smoothed, and the part of a gradient ∞ is excluded. As a result, only a substantial feature in the probability distribution can be sampled.
  • FIG. 5 shows an example of the process for calculating a score D. The calculation unit 120 creates a probability distribution of a first replicated data set, and a probability distribution of a second replicated data set. If each of the replicated data sets is expressed as changes of observed values with the passage of time, probability distributions are each expressed as a set of probability values each corresponding to an observed value, and each indicating the probability of obtaining the observed value. Then, the calculation unit 120 creates a cumulative probability distribution of each of the probability distributions. A cumulative probability distribution is expressed as a cumulative value of the probability that each observation variable takes on a value less than or equal to a certain value. Thereafter, the calculation unit 120 calculates, as the score D, the maximum value of the difference between a probability value in the cumulative probability distribution of the first replicated data set, and a probability value in the cumulative probability distribution of the second replicated data set, with respect to the same observation variable.
  • Returning to FIG. 2, the judgment unit 130 calculates the difference between scores, one of which has been created in S220 in the previous execution of S210 to S240, and the other of which has been created in S220 in the current execution of S210 to S240 (S230). Then, on condition that the calculated difference is not equal to or less than a predetermined reference value (S240: NO) which is a specified tolerance, the judgment unit 130 causes the processing to return to S210. To be more precise, the replication unit 110 repeatedly performs sampling of observed values with replacement by increasing the number of times of repetition, and creates a new replicated data set by adding newly sampled observed values to the replicated data set created in the previous execution. This processing of S210 to S240 is repeated until the difference calculated in S230 becomes equal to or less than the reference value.
  • On condition that the difference between the scores converges to the reference value or below (S240: YES), the calculation unit 120 outputs the currently-calculated score to the display unit 140 (S250). After that, the display unit 140 displays the calculated score as a detection result of the difference between the first time series data set and the second time series data set (S260). Alternatively, on condition that the score exceeds a predetermined threshold value (for example, 0.4), the display unit 140 may notify a user of the observation variable corresponding to the score. The foregoing processing may be repeatedly preformed for each observation variable of the plurality of observation variables, and the display unit 140 may display a list of detection results of all of the observation variables.
  • FIG. 6 shows a process in which scores D converge as the number of times of repeated sampling with replacement increases. In a graph shown in FIG. 6, the x-axis shows kinds of observed values, the z-axis shows the number of times of repeated sampling with replacement by indicating how many times larger than the number of observed values in the first time series data set and/or the second time series data set, and the y-axis shows the score D. By referring to FIG. 6, it is found that the score D of any observation variable hardly varies, when the number of repetition times exceeds about a 10-fold higher number than the number of observed values in the first time series data set and the second time series data set. Accordingly, it is understood that the number of repetition times of the processing shown in S210 to S240 is on the order of a several 10-fold higher number than the number of observed values. Alternatively, instead of the processing shown in FIG. 5, the replication unit 110 may repeatedly perform sampling with replacement the approximate several 10-fold higher number of times than the number of observed values in advance, and then the calculation unit 120 may calculate the score D only once by using the replicated data sets thus created. As shown in FIG. 6, it is understood that a sufficiently reliable score can be calculated by repeatedly sampling with replacement the approximate several 10-fold higher number of times than the number of observed values.
  • FIG. 7 shows an example of a screen 700 on which a detection result is displayed by the display unit 140. On the screen 700, the display unit 140 displays information indicating plural, observation variables and scores, such as KS statistics, calculated from the respective observation variables in descending order of the calculated score. In a bar chart of FIG. 7, the observed values having the larger scores are displayed from the left-hand side in descending order. To be more precise, the display unit 140 shows a KS statistic of an observed value by using the length of a bar 710 in the bar chart, in association with an identification number 8 indicating the observation variable. Moreover, next to this bar on the right side, the display unit 140 shows a KS statistic of an observed value by using the length of a bar 720 in the bar chart, in association with an identification number 12 indicating the observation variable. The KS statistics become smaller in order of the identification numbers 8, 12, 55, 13, 2, 60, . . . , from the left side to the right side in the graph. Thus, with the display of the display unit 140, a user can recognize places in each of which there is a strong possibility that anomaly occurs, and the descending order of the possibilities. As a result, an automobile engineer can properly know components that should be diagnosed in detail preferentially, and the priority order, and then can work for a countermeasure against the anomaly more efficiently than otherwise.
  • FIG. 8 shows a flowchart of a real-time process for displaying a detection result of anomaly in a modified example of this embodiment. In this modified example, descriptions will be provided for an example of a process in which scores, such as the foregoing KS statistics, are dynamically updated by using observed values sequentially obtained while an observation target is being observed. In this modified example, even in a case where a new score is calculated every time an observed value is added to a time series data set, the new score is not calculated by using all the observed values from the beginning in every execution. Instead, the modified example aims to calculate the new score more efficiently by using a calculation result in the previous execution.
  • The acquisition unit 100 sequentially observes the observation target, and sequentially adds a time series data set including newly-obtained observed values to a time series data set including previously-obtained observed values. Thereby, the acquisition unit acquires a new first time series data set (S800) with a fixed window of length N1. In other words, the acquisition unit 100 acquires a new first series data set including N1. data points in every execution of S800. Then, the replication unit 110 creates a replicated data set of the time series data set including only the newly-obtained observed values, out of the acquired first time series data set (S8 10).
  • Note that there is a trade-off between the time resolution and the precision of the score. Specifically, for a smaller the window size, the time resolution of the real-time anomaly detection system will get better, while statistical errors for estimating the cumulative probability distribution will get larger, since the number of the data points within the window gets smaller. To handle this trade-off, the calculation unit 120 creates a cumulative probability distribution based on the replicated data set of the first time series data set, by combining a cumulative probability distribution based on the replicated data set of the time series data set including the newly-obtained observed value, with a cumulative probability distribution based on a replicated data set of the time series data set including the previously-obtained observed values (S820).
  • This combining is carried out according to a ratio of the number of newly-obtained observed values to the number of previously-obtained observed values. For example, a cumulative probability value of observation variable is calculated as a weighted average by using the numbers of newly-obtained and previously-obtained observed values as the weighting factors, respectively. One example is provided hereafter. Suppose that the previously-calculated cumulative probability value and the newly-calculated cumulative probability value of a certain observation variable are 0.4 and 0.5, respectively, and that the numbers of previously-obtained observed values and newly-obtained observed values are 9000 and 1000, respectively. In this case, the average thereof is calculated by weighting according to a ratio between the numbers of observed values 9:1, and thereby the probability value 0.41 is computed. Putting formally, if the D-scores of the previously observed data and the newly observed data are S2 and S1, respectively, the resulting score S is written as βS2+(1−β) S1, where β is a real number within 0 through 1, representing a discounting factor. One natural choice for the discounting factor is N2(N1+N2), where N2 is the window size (the number of data points) of the previously observed data.
  • Thereafter, the calculation unit 120 calculates a score such as a KS statistic indicating the difference between the combined cumulative probability distribution, and the previously-created cumulative probability distribution of the second replicated data set (S830). The calculated score is displayed as a detection result of the difference between the first time series data set and the second time series data set (S840). If the diagnosis system 10 continues to obtain observed values, the diagnosis system 10 causes the processing to return to S800, and subsequently updates the score.
  • According to this foregoing modified example, it is possible to recognize the status of an observation target in real time during the observation thereof. In addition, the score indicating the difference can be calculated quickly by using the previous calculation result.
  • The preceding discussion of FIG. 8 describes a method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit. The method comprises, for each observation variable, processing time series data for an ordered sequence of time periods 1, 2, . . . , J such that J is at least 3.
  • Processing the time series data for time period 1 comprises acquiring a new time series data set 1 having N1 observed values for the time period 1 and generating a first cumulative probability distribution (Pcomb,1) from the new time series data set 1.
  • Processing the time series data for time period j (j=2, 3, . . . , J) comprises the steps of:
  • (a) acquiring a new time series data set j having N1 observed values for the time period j;
  • (b) generating a new cumulative probability distribution (Pnew,j) from the new time series data set j;
  • (c) generating a combined cumulative probability distribution (Pcomb,j) equal to β*Pnew,j+(1−β)*Pcomb,j−1, wherein β=N2/(N1+N2), and wherein N2=(j−1)*N1;
  • (d) computing a score of Pcomb,j-Pcomb,j−1; and
  • (e) outputting the score as a detection result to the display unit.
  • In one embodiment, N1=1. In another embodiment, N1>1.
  • Processing the time series data for the time periods 1, 2, . . . , J is performed in real time.
  • FIG. 9 shows an example of a hardware configuration of an information processing apparatus 500 functioning as the diagnosis system 10 in this embodiment or the modified example. The information processing apparatus 500 includes a CPU peripheral unit, an input/output unit and a legacy input/output unit. The CPU peripheral unit includes a CPU 1000, a RAM 1020 and a graphics controller 1075, all of which are mutually connected to one another via a host controller 1082. The input/output unit includes a communication interface 1030, a hard disk drive 1040 and a CD-ROM drive 1060, all of which are connected to the host controller 1082 via an input/output controller 1084. The legacy input/output unit includes a ROM 1010, a flexible disk drive 1050 and an input/output chip 1070, all of which are connected to the input/output controller 1084.
  • The host controller 1082 mutually connects the RAM 1020 to the CPU 1000 and the graphics controller 1075, both of which access the RAM 1020 at a high transfer rate. The CPU 1000 is operated based on programs stored in the ROM 1010 and the RAM 1020, and controls each of the components. The graphics controller 1075 obtains image data generated by the CPU 1000 or the like in a frame buffer provided in the RAM 1020, and causes the obtained image data to be displayed on a display device 1080. In place of this, the graphics controller 1075 may internally include a frame buffer in which the image data generated by the CPU 1000 or the like is stored.
  • The input/output controller 1084 connects the host controller 1082 to the communication interface 1030, the hard disk drive 1040 and the CD-ROM drive 1060, all of which are high-speed input/output devices. The communication interface 1030 communicates with an external device via a network. In the hard disk drive 1040, programs and data to be used by the information processing apparatus 500 are stored. The CD-ROM drive 1060 reads a program or data from a CD-ROM 1095, and provides the read-out program or data to the RAM 1020 or the hard disk 1040.
  • Moreover, the input/output controller 1084 is connected to relatively low-speed input/output devices such as the ROM 1010, the flexible disk drive 1050 and the input/output chip 1070. In the ROM 1010, stored are programs such as a boot program executed by the CPU 1000 at a start-up time of the information processing apparatus 500 and a program depending on hardware of the information processing apparatus 500. The flexible disk drive 1050 reads a program or data from a flexible disk 1090, and provides the read-out program or data to the RAM 1020 or the hard disk drive 1040 via the input/output chip 1070. The input/output chip 1070 is connected to the flexible disk drive 1050 and various kinds of input/output devices, for example, through a parallel port, a serial port, a keyboard port, a mouse port and the like.
  • A program to be provided to the information processing apparatus 500 is provided by a user with the program stored in a storage medium such as the flexible disk 1090, the CD-ROM 1095 and an IC card. The program is read from the storage medium via the input/output chip 1070 and/or the input/output controller 1084, and is installed and executed on the information processing apparatus 500. An operation that the program causes the information processing apparatus 500 or the like to execute, is identical to the operation of the diagnosis system 10 described by referring to FIGS. 1 to 8. Therefore, the description thereof is omitted here.
  • The program described above may be stored in an external storage medium. As the storage medium, any one of the following mediums may used: an optical storing medium such as a DVD and a PD; a magneto-optic storing medium such as an MD; a tape medium; and a semiconductor memory such as an IC card, in addition to the flexible disk 1090 and the CD-ROM 1095. Alternatively, the program may be provided to the information processing apparatus 500 via a network, by using, as a storage medium, a storage device such as a hard disk and a RAM, provided in a server system connected to a private communication network or the internet. A storage medium with the program stored thereon is a computer program product.
  • As has been described in this embodiment hereinabove, according to the diagnosis system 10, it is possible to detect anomaly occurring in an observation target by using plural kinds of observed values exhaustively without fully having knowledge of behavior of the observed values in advance. Moreover, a score indicating anomaly is calculated by repeatedly sampling observed values with replacement from a time series data set. As a result, the substantial status of the observation target can be diagnosed by excluding a slight difference between the observed values caused by a difference in an observation environment. This makes it possible to achieve an anomaly detection method that effectively functions instead of the conventional limit check, or complementally functions in addition to the limit check.
  • According to the present invention, it is possible to detect anomalies occurring in an observation target with high accuracy.
  • Hereinabove, the present invention has been described by using the embodiment. However, the technical scope of the present invention is not limited to the above-described embodiment. It is obvious to one skilled in the art that various modifications and improvements may be made to the embodiment. Moreover, it is also obvious from the scope of the present invention that thus modified and improved embodiments are included in the technical scope of the present invention.

Claims (20)

1. A method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising performing for each observation variable:
acquiring an initial first time series data set consisting of a first plurality of first observed values over a first period of time, followed by generating an initial first cumulative probability distribution from the initial first time series data set, said initial first time series data set being designated as a previous first time series data set;
acquiring an initial second time series data set consisting of a second plurality of second observed or computed values over a second period of time, followed by generating an initial second cumulative probability distribution from the initial second time series data set, said initial second time series data set being designated as a previous second time series data set;
calculating an initial score as a function of an absolute value of a difference between respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution, said initial score being designated as a previous score;
performing operations on the previous first time series data set, including: generating a next first time series data set by adding newly-obtained first observed values at first additional times to the previous first time series data set, followed by generating a next first cumulative probability distribution from the next first time series data set;
performing operations on the previous second time series data set, including: generating a next second time series data set by adding newly-obtained second observed or computed values at second additional times to the previous second time series data set, followed by generating a next second cumulative probability distribution from the next second time series data set;
calculating a next score as said function of an absolute value of a difference between respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution;
ascertaining whether a condition exists,
wherein the condition is that the next score has been calculated a specified maximum number of times equal to at least 1 or an absolute value of a difference between the next score and the previous score is not greater than a specified tolerance, and
wherein if said ascertaining ascertains that the condition exists then outputting the next score as a detection result to the display unit, otherwise setting the previous first time series data set equal to the next first time series data set, setting the previous second time series data set equal to the next second time series data set, setting the previous score equal to the next score, and again executing said performing operations on the previous first time series data set, said performing operations on the previous second time series data set, said calculating the next score, and said ascertaining.
2. The method of claim 1, wherein upon the next score having been calculated only once, said ascertaining ascertains that the condition does not exist.
3. The method of claim 1, wherein upon the next score having been calculated a fewer number of times than the specified maximum number of times, said ascertaining ascertains that the condition exists.
4. The method of claim 1, wherein upon the next score having been calculated the specified maximum number of times and the absolute value of the difference between the next score and the previous score being greater than the specified tolerance, said ascertaining ascertains that the condition exists.
5. The method of claim 1, wherein the specified maximum number of times is 1, and wherein the number of newly-obtained first observed values and the number of newly-obtained second observed values each comprise more than a 10-fold higher number than both the number of first observed values of the first plurality of first observed values and the number of second observed values of the second plurality of second observed values.
6. The method of claim 1,
wherein said function of the absolute value of the difference between respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution is a maximum value of the absolute value of the difference between the respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution, and
wherein said function of the absolute value of the difference between respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution is a maximum value of the difference between the respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution.
7. The method of claim 1,
wherein said function of the absolute value of the difference between respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution is an average value of or a summation of the absolute value of the difference between the respective cumulative probability values of the generated initial first cumulative probability distribution and the generated initial second cumulative probability distribution, and
wherein said function of the absolute value of the difference between respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution is an average value of or a summation of the difference between the respective cumulative probability values of the generated next first cumulative probability distribution and the generated next second cumulative probability distribution.
8. The method of claim 1, wherein the second plurality of second observed or computed values of the initial second time series data set is a vector u representing m time series, wherein m is at least 2, and wherein said acquiring the initial second time series data set comprises:
providing the m time series, wherein time series r of the m time series is denoted as a p-dimensional vector x(r) (r=1, 2, . . . , m) of p observed values respectively corresponding to p different times, and wherein p is at least 2; and
computing the vector u as a vector that minimizes a sum of squares over r from r=1 to r=m of inter-vector distances between u and x(r).
9. The method of claim 8, wherein said computing the vector u comprises solving an eigenvalue equation (HHT)u=λu for the maximum eigenvalue λ and its associated eigenvector u, wherein H is a p-by-m matrix expressed as H=[x(1), x(2), . . . , x(m)] and wherein HT is a transposed matrix of H.
10. The method of claim 1, wherein the first period of time is unequal to the second period of time, and wherein the number of first observed values of the first plurality of first observed values is unequal to the number of second observed values of the second plurality of second observed values.
11. The method of claim 1, wherein the method comprises displaying on the display unit a bar chart of the detection result versus observation variable, wherein the observation variable is ordered on the bar chart in descending order of the detection result.
12. The method of claim 1, wherein the method comprises displaying on the display unit a three-dimensional bar chart in which the next score is plotted as a function of observation variable and number of calculations of the next score.
13. A computer program product, comprising a computer usable storage medium having a computer readable program stored thereon, wherein the program when executed on a processor of the information processing apparatus performs the method of claim 1.
14. A system comprising a processor and a computer readable memory unit coupled to the processor, said memory unit containing a program that when executed by the processor implement the method of claim 1, wherein the system comprises the information processing apparatus, and wherein the information processing apparatus comprises the processor and the computer readable memory unit.
15. A method for analyzing time series data for each observation variable of a plurality of observation variables through execution of a program by a processor of an information processing apparatus that comprises a display unit, said method comprising, for each observation variable, processing time series data for an ordered sequence of time periods 1, 2, . . . , J such that J is at least 3,
wherein said processing for time period 1 comprises acquiring a new time series data set 1 having N1 observed values for the time period 1 and generating a first cumulative probability distribution (Pcomb,1) from the new time series data set 1, and
wherein said processing for time period j (j=2, 3, . . . , J) comprises the steps of:
(a) acquiring a new time series data set j having N1 observed values for the time period j;
(b) generating a new cumulative probability distribution (Pnew,j) from the new time series data set j;
(c) generating a combined cumulative probability distribution (Pcomb,j) equal to β*Pnew,j+(1−β)*Pcomb,j−1, wherein β=N2/(N1+N2), and wherein N2=(j−1)*N1;
(d) computing a score equal to the difference of Pcomb,j and Pcomb,j−1; and
(e) outputting the score as a detection result to the display unit.
16. The method of claim 15, wherein N1=1.
17. The method of claim 15, wherein N1>1.
18. The method of claim 15, wherein said processing the time series data for the time periods 1, 2, . . . , J is performed in real time.
19. A computer program product, comprising a computer usable storage medium having a computer readable program stored thereon, wherein the program when executed on a processor of the information processing apparatus performs the method of claim 15.
20. A system comprising a processor and a computer readable memory unit coupled to the processor, said memory unit containing a program that when executed by the processor implement the method of claim 15, wherein the system comprises the information processing apparatus, and wherein the information processing apparatus comprises the processor and the computer readable memory unit.
US11/768,967 2006-07-10 2007-06-27 Method and system for detecting difference between plural observed results Abandoned US20080010330A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/264,556 US7849124B2 (en) 2006-07-10 2008-11-04 Method and system for detecting difference between plural observed results

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2006-189301 2006-07-10
JP2006189301A JP4201027B2 (en) 2006-07-10 2006-07-10 System and method for detecting differences between multiple observations

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/264,556 Continuation US7849124B2 (en) 2006-07-10 2008-11-04 Method and system for detecting difference between plural observed results

Publications (1)

Publication Number Publication Date
US20080010330A1 true US20080010330A1 (en) 2008-01-10

Family

ID=38920272

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/768,967 Abandoned US20080010330A1 (en) 2006-07-10 2007-06-27 Method and system for detecting difference between plural observed results
US12/264,556 Expired - Fee Related US7849124B2 (en) 2006-07-10 2008-11-04 Method and system for detecting difference between plural observed results

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/264,556 Expired - Fee Related US7849124B2 (en) 2006-07-10 2008-11-04 Method and system for detecting difference between plural observed results

Country Status (2)

Country Link
US (2) US20080010330A1 (en)
JP (1) JP4201027B2 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2354330A1 (en) * 2009-04-23 2011-03-14 Universitat Pompeu Fabra Method for calculating measures of similarity between time signals
JP2013206915A (en) * 2012-03-27 2013-10-07 Seiko Epson Corp Method of manufacturing semiconductor device and monitoring device
US20160292196A1 (en) * 2015-03-31 2016-10-06 Adobe Systems Incorporated Methods and Systems for Collaborated Change Point Detection in Time Series
US20160321128A1 (en) * 2011-03-23 2016-11-03 Nec Corporation Operations management system, operations management method and program thereof
US20170017698A1 (en) * 2011-11-16 2017-01-19 Ptc Inc. Method for analyzing time series activity streams and devices thereof
US20170178038A1 (en) * 2015-12-22 2017-06-22 International Business Machines Corporation Discovering linkages between changes and incidents in information technology systems
US10366078B2 (en) * 2013-11-27 2019-07-30 The Regents Of The University Of California Data reduction methods, systems, and devices
US10719380B2 (en) * 2014-12-22 2020-07-21 Nec Corporation Operation management apparatus, operation management method, and storage medium
CN113052455A (en) * 2021-03-22 2021-06-29 广西电网有限责任公司 Method and device for fine evaluation of watershed runoff hydrological conditions
US11989014B2 (en) * 2019-12-17 2024-05-21 Kabushiki Kaisha Toshiba State estimation apparatus, method, and non-transitory computer readable medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7770072B2 (en) * 2007-01-16 2010-08-03 Xerox Corporation Method and system for analyzing time series data
US7774655B2 (en) * 2007-03-16 2010-08-10 Xerox Corporation Method and system for retrieving time series data
JP5434562B2 (en) * 2009-12-18 2014-03-05 富士通株式会社 Operation management program, operation management apparatus, and operation management method
JP5573743B2 (en) 2011-03-22 2014-08-20 トヨタ自動車株式会社 Vehicle data analysis device, vehicle data analysis method, and failure diagnosis device
JP5839970B2 (en) 2011-12-05 2016-01-06 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method, apparatus and computer program for calculating risk evaluation value of event series
KR20130091098A (en) * 2012-02-07 2013-08-16 한국전자통신연구원 Apparatus and method for transmitting node discovery message
CN105027088B (en) * 2013-02-18 2018-07-24 日本电气株式会社 Network analysis equipment and systematic analytic method
US9635050B2 (en) * 2014-07-23 2017-04-25 Cisco Technology, Inc. Distributed supervised architecture for traffic segregation under attack
US20190163680A1 (en) * 2016-06-08 2019-05-30 Nec Corporation System analysis device, system analysis method, and program recording medium
DE102021005214A1 (en) 2021-10-19 2022-01-05 Daimler Ag Process of transition analysis for the detection of anomalies in vehicles

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5046179A (en) * 1989-03-14 1991-09-03 Matsushita Electric Industrial Co., Ltd. Correlation computing device for image signal
US5257364A (en) * 1990-09-12 1993-10-26 Nec Research Institute, Inc. Method for generating a correlated sequence of variates with desired marginal distribution for testing a model of a communications system
US5574641A (en) * 1993-01-06 1996-11-12 Mitsubishi Jidosha Kogyo Kabushiki Kaisha Apparatus and method for improving the awareness of vehicle drivers
US6408321B1 (en) * 1999-03-24 2002-06-18 International Business Machines Corporation Method and apparatus for mapping components of descriptor vectors to a space that discriminates between groups
US20050050129A1 (en) * 2003-05-08 2005-03-03 Mitsubishi Research Institute, Inc. Method of estimating a penetrance and evaluating a relationship between diplotype configuration and phenotype using genotype data and phenotype data
US20050283511A1 (en) * 2003-09-09 2005-12-22 Wei Fan Cross-feature analysis
US20060101402A1 (en) * 2004-10-15 2006-05-11 Miller William L Method and systems for anomaly detection
US20070005297A1 (en) * 2005-06-30 2007-01-04 Oracle International Corporation Automatic determination of high significance alert thresholds for system performance metrics using an exponentially tailed model
US7162489B2 (en) * 2004-06-23 2007-01-09 Microsoft Corporation Anomaly detection in data perspectives
US7346593B2 (en) * 2002-07-17 2008-03-18 Nec Corporation Autoregressive model learning device for time-series data and a device to detect outlier and change point using the same

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US346593A (en) * 1886-08-03 dibble

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5046179A (en) * 1989-03-14 1991-09-03 Matsushita Electric Industrial Co., Ltd. Correlation computing device for image signal
US5257364A (en) * 1990-09-12 1993-10-26 Nec Research Institute, Inc. Method for generating a correlated sequence of variates with desired marginal distribution for testing a model of a communications system
US5574641A (en) * 1993-01-06 1996-11-12 Mitsubishi Jidosha Kogyo Kabushiki Kaisha Apparatus and method for improving the awareness of vehicle drivers
US6408321B1 (en) * 1999-03-24 2002-06-18 International Business Machines Corporation Method and apparatus for mapping components of descriptor vectors to a space that discriminates between groups
US7346593B2 (en) * 2002-07-17 2008-03-18 Nec Corporation Autoregressive model learning device for time-series data and a device to detect outlier and change point using the same
US20050050129A1 (en) * 2003-05-08 2005-03-03 Mitsubishi Research Institute, Inc. Method of estimating a penetrance and evaluating a relationship between diplotype configuration and phenotype using genotype data and phenotype data
US20050283511A1 (en) * 2003-09-09 2005-12-22 Wei Fan Cross-feature analysis
US7162489B2 (en) * 2004-06-23 2007-01-09 Microsoft Corporation Anomaly detection in data perspectives
US20060101402A1 (en) * 2004-10-15 2006-05-11 Miller William L Method and systems for anomaly detection
US20070005297A1 (en) * 2005-06-30 2007-01-04 Oracle International Corporation Automatic determination of high significance alert thresholds for system performance metrics using an exponentially tailed model

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110178615A1 (en) * 2009-04-23 2011-07-21 Universitat Pompeu Fabra Method for calculating measures of similarity between time signals
US8718803B2 (en) 2009-04-23 2014-05-06 Universitat Pompeu Fabra Method for calculating measures of similarity between time signals
ES2354330A1 (en) * 2009-04-23 2011-03-14 Universitat Pompeu Fabra Method for calculating measures of similarity between time signals
US20160321128A1 (en) * 2011-03-23 2016-11-03 Nec Corporation Operations management system, operations management method and program thereof
US10430268B2 (en) * 2011-03-23 2019-10-01 Nec Display Solutions, Ltd. Operations management system, operations management method and program thereof
US20170017698A1 (en) * 2011-11-16 2017-01-19 Ptc Inc. Method for analyzing time series activity streams and devices thereof
US9965527B2 (en) * 2011-11-16 2018-05-08 Ptc Inc. Method for analyzing time series activity streams and devices thereof
JP2013206915A (en) * 2012-03-27 2013-10-07 Seiko Epson Corp Method of manufacturing semiconductor device and monitoring device
US10366078B2 (en) * 2013-11-27 2019-07-30 The Regents Of The University Of California Data reduction methods, systems, and devices
US10719380B2 (en) * 2014-12-22 2020-07-21 Nec Corporation Operation management apparatus, operation management method, and storage medium
US10108978B2 (en) * 2015-03-31 2018-10-23 Adobe Systems Incorporated Methods and systems for collaborated change point detection in time series
US20160292196A1 (en) * 2015-03-31 2016-10-06 Adobe Systems Incorporated Methods and Systems for Collaborated Change Point Detection in Time Series
US20170178038A1 (en) * 2015-12-22 2017-06-22 International Business Machines Corporation Discovering linkages between changes and incidents in information technology systems
US11151499B2 (en) * 2015-12-22 2021-10-19 International Business Machines Corporation Discovering linkages between changes and incidents in information technology systems
US11989014B2 (en) * 2019-12-17 2024-05-21 Kabushiki Kaisha Toshiba State estimation apparatus, method, and non-transitory computer readable medium
CN113052455A (en) * 2021-03-22 2021-06-29 广西电网有限责任公司 Method and device for fine evaluation of watershed runoff hydrological conditions

Also Published As

Publication number Publication date
JP4201027B2 (en) 2008-12-24
JP2008014912A (en) 2008-01-24
US7849124B2 (en) 2010-12-07
US20090132626A1 (en) 2009-05-21

Similar Documents

Publication Publication Date Title
US7849124B2 (en) Method and system for detecting difference between plural observed results
US11624731B2 (en) Method for predicting remaining life of numerical control machine tool
US9805002B2 (en) Anomaly detection method, program, and system
US11669080B2 (en) Abnormality detection device, abnormality detection method, and program
US20160110651A1 (en) Method of Sequential Kernel Regression Modeling for Forecasting and Prognostics
US10496730B2 (en) Factor analysis device, factor analysis method, and factor analysis program
US11692910B2 (en) Abnormality diagnostic device, abnormality diagnostic method, and program
JP2015170121A (en) Abnormality diagnosis device and program
US20180267503A1 (en) System and method for extracting principal time series data
CN110308713A (en) A kind of industrial process failure identification variables method based on k neighbour reconstruct
US20080021897A1 (en) Techniques for detection of multi-dimensional clusters in arbitrary subspaces of high-dimensional data
Kang Statistical quality control using image intelligence: A sparse learning approach
CN110648763A (en) Method and apparatus for tumor assessment using artificial intelligence for spectral analysis
Zhou et al. Automatic feature selection for unsupervised clustering of cycle-based signals in manufacturing processes
Ebadi et al. Phase I analysis of high-dimensional processes in the presence of outliers
Sena et al. Multivariate statistical analysis and chemometrics
Bashir et al. Matlab-based graphical user interface for IOT sensor measurements subject to outlier
CN116028858A (en) Pre-labeled self-supervision neural network learning heart beat classification method and system
Peng et al. Performance monitoring of non-gaussian chemical processes with modes-switching using globality-locality preserving projection
CN114611606A (en) Fault detection method based on nuclear hybrid space projection
CN114879612A (en) Blast furnace iron-making process monitoring method based on Local-DBKSSA
Melissa et al. Partial correlation metric based classifier for food product characterization
Zhang et al. Improved locally linear embedding based method for nonlinear system fault detection
JP2018151913A (en) Information processing system, information processing method, and program
US11755622B1 (en) Multivariate data processing

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:IDE, TSUYOSHI;REEL/FRAME:019484/0725

Effective date: 20070627

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE