WO2012157154A1

WO2012157154A1 - Outlier detecting apparatus, outlier detecting method, and vehicle trouble diagnosis system

Info

Publication number: WO2012157154A1
Application number: PCT/JP2012/001315
Authority: WO
Inventors: Takuro KUTSUNA; Shuichi Sato
Original assignee: Kabushiki Kaisha Toyota Chuo Kenkyusho
Priority date: 2011-05-17
Filing date: 2012-02-27
Publication date: 2012-11-22
Also published as: JP2012256311A; CN103493075A; CN103493075B; JP5533894B2

Abstract

The present invention provides an outlier detecting apparatus and the like that assist or execute detection of an outlier within a practical time without performing a parameter tuning operation for a nonlinear data set. An outlier detecting apparatus (1) converts each of a plurality of pieces of data included in a data set for each dimension, and establishes an observation region for the data set on the basis of the bit sequence. Then, the outlier detecting apparatus (1) determines a piece of target data one by one from the plurality of pieces of data included in the data set, and calculates the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.

Description

OUTLIER DETECTING APPARATUS, OUTLIER DETECTING METHOD, AND VEHICLE TROUBLE DIAGNOSIS SYSTEM

The present invention relates to an outlier detecting apparatus and the like that assist or execute detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions.

An outlier detection problem is considered to be a problem for finding, as an outlier, data belonging to a low data density region from a given data set. Application examples of techniques for solving the outlier detection problem include, for example, processing for removing noise data contained in a data set (preprocessing for data screening), processing for detecting a customer who conducts unusual transactions from a data set of credit transactions, processing for detecting a defective from a data set of products in a product line, and the like.

As techniques for solving the outlier detection problem, for example, Mahalanobis distance, one-class support vector machine (hereinafter, abbreviated as "OC-SVM"), and local outlier factor (hereinafter, abbreviated as "LOF") are known.

NPL 1 describes the Mahalanobis distance. In NPL 1, the center of mass (average) of the entire given data set and a covariance matrix are calculated, the distance to the normalized center of mass from each piece of data is calculated using the covariance matrix, and data with a large distance is regarded as an outlier.

In the Mahalanobis distance, it is assumed that a data set conforms to multivariate normal distribution. In the case where a data set cannot be described using the multivariate normal distribution, that is, in the case where a data set is nonlinear, an appropriate outlier cannot be detected.

NPL 2 describes the OC-SVM. In NPL 2, a received data set is mapped into a high-order feature space F by nonlinear mapping, and from among hyperplanes each separating a mapped data group from the origin, the hyperplane that is farthest away from the origin is selected. In the case where the OC-SVM is employed for solving the outlier detection problem, hyperplanes are determined in such a manner that a certain percentage of data is allowed to be grouped near the origin than near the hyperplanes, and data that is grouped near the origin is regarded as an outlier.

In the OC-SVM, by solving a convex optimization problem for which the solution can be easily found, hyperplanes can be obtained. Furthermore, since the OC-SVM employs nonlinear mapping, the OC-SVM is applicable to a nonlinear data set.

NPL 3 describes the LOF. In NPL 3, the average of distances from data x to k pieces of data adjacent to the data x is calculated as the k-nearest distance. Then, the value obtained by dividing the k-nearest distance of the data x by the average of k-nearest distances of k pieces of adjacent data is calculated as an LOF of the data x. As is clear from the processing mentioned above, the LOF exhibits a larger value as the difference between the k-nearest distance of the data x and the average of the k-nearest distances of the k pieces of adjacent data (that is, the value obtained by subtracting the average of the k-nearest distances of the k pieces of adjacent data from the k-nearest distance of the data x) increases. Accordingly, data with a large LOF is regarded as an outlier.

The LOF is also applicable to a nonlinear data set.

However, the three examples of the related art described above have problems described below.

As described above, the Mahalanobis distance has a problem in that in the case of a nonlinear data set, an appropriate outlier cannot be detected.

The OC-SVM has an unsolved problem in that it is difficult to select appropriate nonlinear mapping. This results in a problem in that a parameter tuning operation is necessary in which a human being determines, by trial and error, parameters for determining nonlinear mapping.

Furthermore, in the OC-SVM, in the case of a large amount of data to be processed, a long time is required to solve an optimization problem. Let the number of pieces of data be N, the order of the calculation amount in the OC-SVM is O(N³), unless no adjustment is made.

The LOF has an unsolved problem in which it is difficult to select appropriate k. This also results in a problem in that a parameter tuning operation is necessary, as in the OC-SVM.

Furthermore, the LOF requires a relatively high computational load. Let the number of pieces of data be N, the order of the calculation amount in the LOF is O(N²), unless no adjustment is made.

Mahalanobis, P. C., On the Generalized Distance in Statistics, Proceedings of the National Institute of Science, 49-55, 1936 Scholkopf, B. et. al., Estimating the Support of a High-Dimensional Distribution, Neural Computation, 7, 1443-1471, 2001 Breunig, M. M. et. al., LOF: Identifying Density-Based Local Outliers, SIGMOD Conference, 93-104, 2000

The present invention has been designed in view of the above-described problems, and an object of the present invention is to provide an outlier detecting apparatus and the like that assist or execute detection of an outlier within a practical time without performing a parameter tuning operation for a nonlinear data set.

In order to achieve the above-described object, according to a first aspect of the present invention, there is provided an outlier detecting apparatus that assists or executes detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions including a controller that converts each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions, establishes an observation region for the data set on the basis of the bit sequence, determines a piece of target data one by one from the plurality of pieces of data included in the data set, and calculates the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.

According to the first aspect of the present invention, assist or execution of detection of an outlier can be performed within a practical time without performing a parameter tuning operation for a nonlinear data set.

Preferably, the controller in the first aspect of the present invention establishes the observation region as a binary decision diagram, defines, as a single-piece-of-data-removed local density, a value obtained by subtracting a density equivalent of a single piece of data from a local density of each node, and calculates the degree of deviation of the piece of target data on the basis of the single-piece-of-data-removed local density.

Thus, the order of the calculation amount in the first aspect of the present invention is at least represented by O(N x D), where N represents the number of pieces of data and D represents the number of nodes, and has a superiority over the OC-SVM or the LOF.

Preferably, the controller in the first aspect of the present invention hierarchically establishes the binary decision diagram by sorting a bit sequence group for dimensions of numeric attributes in the order from the most significant bit to the least significant bit, searches for a path representing the piece of target data in the binary decision diagram, and calculates the degree of deviation of the piece of target data on the basis of the single-piece-of-data-removed local density for a node whose level is changed.

Thus, even in a case where no information on the characteristics of the data set is provided in advance, an appropriate degree of deviation can be calculated.

For example, the controller in the first aspect of the present invention defines the maximum value, the median value, or the average value of some or all of the single-piece-of-data-removed local densities of nodes whose level is changed as the degree of deviation of the piece of target data.

For example, the controller in the first aspect of the present invention detects the outlier by comparing the degree of deviation with a threshold.

According to a second aspect of the present invention, there is provided an outlier detecting method for assisting or executing detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions including converting each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions, establishing an observation region for the data set on the basis of the bit sequence, determining a piece of target data from the plurality of pieces of data included in the data set and calculating the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.

According to a third aspect of the present invention, there is provided a vehicle trouble diagnosis system including an outlier detecting apparatus that assists or executes detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions; and a data collecting apparatus that collects vehicle data, wherein the outlier detecting apparatus includes a controller that detects an outlier by converting each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions, the vehicle data collected by the data collecting apparatus being defined as the data set, establishing an observation region for the data set on the basis of the bit sequence, determining a piece of target data one by one from the plurality of pieces of data included in the data set, calculating the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region, and comparing the degree of deviation with a threshold.

According to the present invention, an outlier detecting apparatus and the like that assist or execute detection of an outlier within a practical time without performing a parameter tuning operation for a nonlinear data set can be provided.

Fig. 1 illustrates an example of the hardware configuration of an outlier detecting apparatus. Fig. 2 is a flowchart illustrating in detail a process performed by the outlier detecting apparatus. Fig. 3 is a diagram for explaining processing for converting a data set. Fig. 4 illustrates a Karnaugh map. Fig. 5 illustrates a binary decision diagram. Fig. 6A is a diagram for explaining processing for calculating the number of minterms. Fig. 6B is a diagram for explaining processing for calculating the number of minterms. Fig. 7 illustrates the results of calculation of the number of minterms. Fig. 8A is a diagram for explaining processing for calculating local density. Fig. 8B is a diagram for explaining processing for calculating local density. Fig. 9 illustrates the results of calculation of local density. Fig. 10 illustrates the results of calculation of LOO density. Fig. 11 illustrates a path representing a piece of target data in a binary decision diagram. Fig. 12 illustrates a region representing a piece of target data in a Karnaugh map. Fig. 13A is a diagram for explaining LOO density to be extracted. Fig. 13B is a diagram for explaining LOO density to be extracted. Fig. 13C is a diagram for explaining LOO density to be extracted. Fig. 13D is a diagram for explaining LOO density to be extracted. Fig. 14A illustrates a data set used in Example 1 in a first embodiment of the present invention and Comparative Examples. Fig. 14B illustrates a data set used in Example 1 in the first embodiment of the present invention and Comparative Examples. Fig. 15A illustrates the results of detection of outliers in Example 1. Fig. 15B illustrates the results of detection of outliers in Example 1. Fig. 16A illustrates the results of detection of outliers in Comparative Example 1. Fig. 16B illustrates the results of detection of outliers in Comparative Example 1. Fig. 17A illustrates the results of detection of outliers in Comparative Example 2. Fig. 17B illustrates the results of detection of outliers in Comparative Example 2. Fig. 18A illustrates the results of detection of outliers in Comparative Example 3. Fig. 18B illustrates the results of detection of outliers in Comparative Example 3. Fig. 19A illustrates the results of detection of outliers in Comparative Example 4. Fig. 19B illustrates the results of detection of outliers in Comparative Example 4. Fig. 20 illustrates an example of the configuration of a vehicle trouble diagnosis system according to a second embodiment of the present invention. Fig. 21 is a flowchart illustrating a process performed by the vehicle trouble diagnosis system according to the second embodiment. Fig. 22A illustrates the results of detection of an outlier in the second embodiment. Fig. 22B illustrates the results of detection of an outlier in the second embodiment. Fig. 22C illustrates the results of detection of an outlier in the second embodiment.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

In the embodiments of the present invention, an outlier detection problem for finding, as an outlier, data belonging to a low data density region from a given data set is solved.

First, a "data set" will be explained by taking a credit transaction as an example. For example, regarding a data set of credit transactions, the case where the combination of three types of information, the sex of a customer, the age of the customer, and the amount of money involved in a transaction, is given as a single piece of data, will be discussed. Then, the case where two pieces of data, x1 = (male, 25 years old, 10,000 yen) and x2 = (female, 30 years old, 20,000 yen), are given as a data set, will be discussed.

In the example mentioned above, a data set including "two" pieces of data each having "three" dimensions is given. The dimension of data is also called a variate (for example, a multivariate analysis means analysis of multidimensional data). In addition, the number of data is also called the number of samples.

Each dimension of data is either a category attribute or a numeric attribute. In the example mentioned above, the sex of a customer is a category attribute, and the age of a customer and the amount of money involved in a transaction are numeric attributes.

Other examples of a data set include a data set acquired by a vehicle-mounted device. In this case, the vehicle speed, the number of revolutions, ON/OFF of auto cruse control (ACC), and the like observed at a certain time are dimensions (variates). The vehicle speed and the number of revolutions are numeric attributes, and ON/OFF of ACC is a category attribute. A plurality of pieces of data observed at a plurality of times are given as a data set.

Hereinafter, an outlier detecting apparatus that assists or executes detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions will be explained. An outlier detecting apparatus according to an embodiment of the present invention is capable of assisting or executing detection of an outlier within a practical time without performing a parameter tuning operation for a nonlinear data set.

Fig. 1 illustrates an example of the hardware configuration of an outlier detecting apparatus. The hardware configuration illustrated in Fig. 1 is merely an example, and various configurations may be employed according to the use and purpose.

In an outlier detecting apparatus 1, a controller 11, a storing unit 12, a medium input/output unit 13, a communication controller 14, an input unit 15, a display unit 16, a peripheral device interface (I/F) unit 17, and the like are connected via a bus 18.

The controller 11 includes a central processing unit (CPU), a read only memory (ROM), a random access memory (RAM), and the like.

The CPU invokes programs, which are stored in the storing unit 12, the ROM, a recording medium, and the like, to a work memory area on the RAM and executes the programs. The CPU performs driving control of individual devices connected via the bus 18, and implements a process to be performed by the outlier detecting apparatus 1, which will be described later.

The ROM is a nonvolatile memory. The ROM permanently stores a boot program for the outlier detecting apparatus 1, a program for BIOS or the like, data, and the like.

The RAM is a volatile memory. The RAM temporarily stores programs, data, and the like loaded from the storing unit 12, the ROM, the recording medium, and the like. The RAM includes a work area to be used by the controller 11 for performing various types of processing.

The storing unit 12 is a hard disk drive (HDD). The storing unit 12 stores programs to be executed by the controller 11, data necessary for execution of the program, an operation system (OS), and the like. Regarding the programs, a control program corresponding to the OS and an application program for causing the outlier detecting apparatus 1 to perform the process described later are stored in the storing unit 12.

Program codes of the programs mentioned above are read and transferred to the RAM by the controller 11 according to need, and then read by the CPU to be executed as various means.

The medium input/output unit 13 (a drive device) performs input and output of data. The medium input/output unit 13 includes, for example, a medium input/output device such as a compact disc (CD) drive (for a CD-ROM, a CD-R, a CD-RW, etc.), a digital versatile disk (DVD) drive (for a DVD-ROM, a DVD-R, a DVD-RW, etc.), and the like.

The communication controller 14 includes a communication control device, a communication port, and the like. The communication controller 14 is a communication interface that allows communication between the outlier detecting apparatus 1 and a network, and controls communication with an external computer via the network. The network may be wired or wireless.

The input unit 15 receives data. The input unit 15 includes an input device, such as, for example, a keyboard, a pointing device such as a mouse, a numeric keypad, and the like.

With the use of the input unit 15, an operating instruction, an action instruction, data input, and the like can be made for the outlier detecting apparatus 1.

The display unit 16 includes a display device such as a liquid crystal panel, a logic circuit (a video adaptor or the like) for implementing a video function of the outlier detecting apparatus 1 in conjunction with the display device, and the like.

The peripheral device I/F unit 17 is a port for connecting a peripheral device to the outlier detecting apparatus 1. The outlier detecting apparatus 1 performs transmission and reception of data to and from the peripheral device via the peripheral device I/F unit 17. The peripheral device I/F unit 17 includes a universal serial bus (USB), IEEE 1394, RS-232C, or the like. In the usual case, a plurality of peripheral device interfaces are provided. The outlier detecting apparatus 1 may be connected with peripheral devices in a wired or wireless manner.

The bus 18 is a path that allows transfer of control signals, data signals, and the like between devices.

The hardware configuration of the outlier detecting apparatus 1 has been described above. An apparatus implemented as the outlier detecting apparatus 1 is not limited to the example described above. For example, the outlier detecting apparatus 1 may be implemented as part of an automobile, a home electric appliance, a production line, or the like by installing a program for implementing the process described later to a vehicle-mounted device, a control device for the home electric appliance, a detecting device for detecting a defective in the production line, or the like. Furthermore, for example, the outlier detecting apparatus 1 may be implemented as a server apparatus including a plurality of computers.

Hereinafter, an example in which the outlier detecting apparatus 1 is implemented by a single computer will be explained.

Fig. 2 is a flowchart illustrating in detail a process performed by the outlier detecting apparatus 1. Hereinafter, a process for an example of a data set will be explained in detail with reference to Figs. 3 to 13 according to need.

As illustrated in Fig. 2, the controller 11 of the outlier detecting apparatus 1 receives a data set via input means (the medium input/output unit 13, the communication controller 14, the input unit 15, the peripheral device I/F unit 17, and the like) (step S1). The controller 11 may receive a data set stored as a file in the storing unit 12.

Fig. 3 is a diagram for explaining a process for converting a data set. Fig. 3 illustrates a data set 21. The data set 21 includes "nineteen" pieces of data each having "two" dimensions. The individual dimensions are numeric attributes and may take integers within a range between 0 and 7.

The controller 11 does not necessarily receive a normalized data set like the data set 21 illustrated in Fig. 3. The controller 11 may receive a raw data set and perform normalization processing for the received raw data set. For example, the controller 11 performs various types of processing for a raw data set so that the raw data set can take integers within a certain range.

In the case where some dimensions (variates) of a raw data set are numeric attributes, the controller 11 performs discretization of the raw data set by fine division to achieve digitization. For example, the controller 11 rounds off an actual value to an integer so that the value can be treated as an int type by a computer. In the case where the range of values to be taken is extremely narrow or wide, the controller 11 carries out multiplication with a proper coefficient so that the values can be evenly distributed throughout an assumed range. Furthermore, in the case where a plurality of pieces of data with different measures are mixed, the controller 11 performs normalization to have a mean of 0 and a variance of 1. In the case where distribution is extremely deviated, the controller 11 performs logarithmic transformation or the like.

In addition, even regarding a dimension (variate) of a numeric attribute, in the case where values to be taken is in a narrow range, for example, in the case where values take integers only within a range between 0 and 3, the controller 11 may treat a value as a dimension (variate) of a category attribute. Furthermore, even regarding a dimension (variate) of a category attribute, in the case where the concept of some distance can be introduced to values to be taken, the controller 11 may treat a value as a dimension (variate) of a numeric attribute.

Referring back to the explanation with reference to Fig. 2, the controller 11 converts individual pieces of data to bit sequences for each dimension (variate) (step S2). A bit sequence 22a corresponding to data x1 and a bit sequence 22b corresponding to data x2 are illustrated in Fig. 3. For example, the bit sequence 22a for the data x1 = 6 represents (d1,d2,d3) = (1,1,0). For example, the bit sequence 22b for the data x2 = 2 represents (e1,e2,e3) = (0,1,0).

The controller 11 sorts a bit sequence group of numeric attributes in the order from the most significant bit to the least significant bit (step S3). A sorted bit sequence 23 is illustrated in Fig. 3. For example, for the bit sequence 22a of (d1,d2,d3) = (1,1,0) and the bit sequence 22b of (e1,e2,e3) = (0,1,0), the sorted bit sequence 23 represents (d1,e1,d2,e2,d3,e3) = (1,0,1,1,0,0).

Hereinafter, the left-most bit like d1 or e1 is referred to with the "most significant bit (MSB)", and the right-most bit like d3 or e3 is referred to with the "least significant bit (LSB)".

The sorting processing in step S3 is not necessarily performed. Since all the dimensions (variates) are treated in an equivalent manner in the sorting processing in step S3, in the case where some information on the characteristics of a data set is provided in advance, it might be better not to perform sorting. For example, in the case where it is clear that dimensions (variates) of the data x1 fully exhibit the characteristics of the data and dimensions (variates) of the data x2 do not fully exhibit the characteristics of the data with little change, it is recommended that the sorting processing in step S3 be not performed and that the data x1 and the data x2 be not equivalently treated.

The sorting processing in step S3 is effective when no information on the characteristics of a data set is provided in advance.

It is desirable that a bit sequence group of category attributes be distinguished from a bit sequence group of numeric attributes and the bit sequences of the category attributes be arranged to take precedence over the bit sequences of numeric attributes. For example, in the case where a data set includes data x3 of a category attribute as well as the data x1 and the data x2 of numeric attributes illustrated in Fig. 3, a bit sequence obtained by converting the data x3 of the category attribute is represented by (f1,f2,f3). In this case, it is desirable that the controller 11 perform sorting in the order of (f1,f2,f3,d1,e1,d2,e2,d3,e3).

Category attributes and numeric attributes are distinguished from each other since, in general, the concept of distance cannot be introduced to values to be taken as values of category attributes and it is thus difficult to treat such values similarly to values of numeric attributes.

In the case where no information on the characteristics of a data set is provided in advance, the category attributes and the numeric attributes may be arranged irrespective of the precedence.

Referring back to the explanation with reference to Fig. 2, the controller 11 establishes a binary decision diagram (BDD) as an observation region F (step S4). The controller 11 may establish a Karnaugh map or the like as an observation region F, instead of the binary decision diagram. Either a binary decision diagram or a Karnaugh map is a data structure to be used for expressing a logic function. That is, the observation region F may represent a logic function.

As described later, in the case where a binary decision diagram is established as an observation region F, the outlier detecting apparatus 1 is capable of assisting or executing detection of an outlier within a practical time even if the number of pieces of data increases. Hereinafter, in order not to unnecessarily obscure the present invention, the case where the outlier detecting apparatus 1 establishes a binary decision diagram as an observation region F will be explained. Furthermore, in order to clearly explain a process performed by the outlier detecting apparatus 1, a Karnaugh map will also be illustrated.

Fig. 4 illustrates a Karnaugh map.

In a Karnaugh map 30a illustrated in Fig. 4, (d1,d2,d3) of the bit sequence 22a illustrated in Fig. 3 is arranged vertically and (e1,e2,e3) of the bit sequence 22b illustrated in Fig. 3 is arranged horizontally. One black square in Fig. 3 corresponds to one piece of data. Nineteen black squares are illustrated in the Karnaugh map 30a illustrated in Fig. 4.

Fig. 5 illustrates a binary decision diagram. A binary decision diagram 31 illustrated in Fig. 5 is established on the basis of the sorted bit sequence 23 illustrated in Fig. 3.

Since a binary decision diagram is represented in accordance with the arrangement of pointers in a computer, the amount of necessary storage capacity can be reduced. In the case of a reduced ordered binary decision diagram, a computing time substantially proportional to the size of the diagram is required for computation of logic functions. The size of a diagram corresponds to the number of nodes.

In the example illustrated in Fig. 5, nodes 32 having an elliptical shape and the like are illustrated. Individual bits of the sorted bit sequence 23 illustrated in Fig. 3 may be regarded as being Boolean variables (either true or false). For example, the first bit d1 corresponds to a node 32a.

An ordered binary decision diagram is defined as: (1) the total order relation among the nodes is defined; and (2) the order in which variables appear for all the paths from the top node to a constant node is consistent with the total order relation. In the example illustrated in Fig. 5, a top node (root node) 33 and a constant node 34 are illustrated. In the example illustrated in Fig. 5, the constant node represents "1" (true). Since the top node and the constant node are special, these nodes are referred to with reference signs different from the other ordinary nodes.

A reduced binary decision diagram is a binary decision diagram to which the following two simplification rules are applied as much as possible: (1) all the redundant nodes are deleted; and (2) all the equivalent nodes are shared.

Accordingly, the binary decision diagram illustrated in Fig. 5 is a reduced ordered binary decision diagram.

The binary decision diagram illustrated in Fig. 5 employs three types of branches, "Then" branches represented as solid lines, "Else" branches represented as widely-spaced dotted lines, and "negative Else" branches represented as narrowly-spaced dotted lines including the sign "*". With the use of a "negative Else" branch, it takes a short period of time to carry out NOT operations. For example, an "Else" branch 35a is illustrated in Fig. 5.

Referring back to the explanation with reference to Fig. 2, the controller 11 calculates the number of minterms for each node in the binary decision diagram (step S5). Processing for calculating the number of minterms will be explained below with reference to Figs. 6A and 6B and Fig. 7.

Figs. 6A and 6B is a diagram for explaining the processing for calculating the number of minterms. Fig. 7 illustrates the results of calculation of the number of minterms.

A minterm is a product term including literals of all the Boolean variables in the case where a set of Boolean variables are given. For example, for the case where a set of Boolean variables is (a,b,c), "a!bc" is a minterm, but "a!b" is not a minterm. The expression "!b" represents negation of "b".

The controller 11 calculates, for each node, the number P of minterms in the case where negative branches are passed through from the top node an even number of times and the number N of minterms in the case where negative branches are passed through from the top node an odd number of times.

First, the controller 11 calculates the number of minterms for a constant node. The number P of minterms for the constant node is 2ⁿ (n represents the number of Boolean variables, that is, the number of bits of the sorted bit sequence 23), and the number N of minterms of the constant node is "0". As illustrated in Fig. 3, since the number of bits of the sorted bit sequence 23 is "6", the number P of minterms for the constant node is 2⁶ = 64. Thus, for the constant node 34 illustrated in Fig. 7, the number P of minterms is 64 and the number N of minterms is 0.

Next, the controller 11 recursively calculates, by depth-first search, the number of minterms for individual nodes other than the constant node.

As illustrated in Figs. 6A and 6B, the controller 11 calculates the number of minterms for individual nodes for the case (a) where an "Else" branch is not a negative branch and the case (b) where an "Else" branch is a negative branch.

The case illustrated in Fig. 6A will now be explained. Referring to Fig. 6A, a node 32d is a node for which calculation is to be performed, the value P for a lower node 32b connected through a "Then" branch is "t_p" (known), the value N for the node 32b is "t_n" (known), the value P for a lower node 32c connected through an "Else" branch is "e_p" (known), and the value N for the node 32c is "e_n" (known). At this time, the controller 11 calculates the number of minterms for the node 32d in accordance with calculation results for the

lower nodes

32b and 32c, using the equation: P = t_p/2 + e_p/2, N = t_n/2 + e_n/2.

The case illustrated in Fig. 6B will now be explained. Referring to Fig. 6B, a node 32g is a node for which calculation is to be performed, the value P for a lower node 32e connected through a "Then" branch is "t_p" (known), the value N for the node 32e is "t_n" (known), the value P for a lower node 32f connected through a "negative Else" branch is "e_p" (known), and the value N for the node 32f is "e_n" (known). At this time, the controller 11 calculates the number of minterms for the node 32g in accordance with calculation results for the

lower nodes

32e and 32f, using the expression: P = t_p/2 + e_n/2, N = t_n/2 + e_p/2.

For example, in the case of a node 32h illustrated in Fig. 7, an "Else" branch through which the node 32h is connected to a lower node (= the constant node 34) is a negative branch. Thus, the number of minterms is calculated in the calculation method illustrated in Fig. 6B. That is, for the node 32h, P = 64/2 + 0/2 = 32, and N = 64/2 + 0/2 = 32.

For example, in the case of a node 32i illustrated in Fig. 7, an "Else" branch through which the node 32i is connected to a lower node is not a negative branch. Thus, the number of minterms is calculated in the calculation method illustrated in Fig. 6A. That is, for the node 32i, P = 32/2 + 64/2 = 48, and N = 32/2 + 0/2 = 16.

Referring back to the explanation with reference to Fig. 2, the controller 11 calculates the local density for each node of the binary decision diagram (step S6). Processing for calculating the local density will be explained below with reference to Figs. 8A and 8B and Fig. 9.

Figs. 8A and 8B are diagrams for explaining the processing for calculating the local density. Fig. 9 illustrates calculation results of the local density. Note that the values of P and N illustrated in Fig. 9 have meanings different from the values of P and N illustrated in Fig. 7.

Fig. 8A illustrates, using a Karnaugh map 30b for convenience, the local density of P-connection of a node 32j illustrated in Fig. 7. Fig. 8B illustrates, using a Karnaugh map 30c for convenience, the local density of P-connection of a node 32k illustrated in Fig. 7.

The "P-connection" represents a path from the top node to a target node in which negative branches are passed through an even number of times. The "N-connection" represents a path from the top node to a target node in which negative branches are passed through an odd number of times.

The node 32j will now be considered.

As is clear from Fig. 7, only a path passing through

branches

35a and 35b are passed in that order exists as a path from the top node 33 to the node 32j.

The branch 35a is an "Else" branch, which indicates that the Boolean variable d1 is "0". Similarly, the branch 35b is an "Else" branch, which indicates that the Boolean variable e1 is "0". The other Boolean variables d2, e2, d3, and e3 are "don't care" ("don't care" variables could take a value of either "0" or "1".) A rectangular region 41a surrounded with dotted lines illustrated in Fig. 8A represents a region corresponding to the path, where the Boolean variable "d1" represents "0", the Boolean variables "e1" represents "0", and the other Boolean variables are "don't care".

The Karnaugh map 30b illustrated in Fig. 8A is obtained when the pattern of the rectangular region 41a in the Karnaugh map 30a is repeated four times. The local density of P-connection for the node 32j corresponds to the total density of the Karnaugh map 30b. That is, as illustrated in Fig. 9, the local density of P-connection for the node 32j is "0.25".

The node 32k will now be considered.

As is clear from Fig. 7, a first path passing through

branches

35a, 35b, 35c, and 35d in that order and a second path passing through

branches

35a, 35b, 35e, and 35f in that order exist as paths from the top node 33 to the node 32k.

Regarding the first path, the Boolean variable d1 is "0", the Boolean variable e1 is "0", the Boolean variable d2 is "1", the Boolean variable e2 is "0", and the other Boolean variables d3 and e3 are "don't care". A rectangular region 41b surrounded with dotted lines illustrated in Fig. 8B represents a region corresponding to the first path.

Regarding the second path, the Boolean variable d1 is "0", the Boolean variable e1 is "0", the Boolean variable d2 is "0", the Boolean variable e2 is "1", and the other Boolean variables d3 and e3 are "don't care". A rectangular region 41c surrounded with dotted lines illustrated in Fig. 8B represents a region corresponding to the second path.

The Karnaugh map 30c illustrated in Fig. 8B is obtained when the pattern of the rectangular region 41b (or 41c) in the Karnaugh map 30a is repeated sixteen times. The local density of P-connection for the node 32k corresponds to the total density of the Karnaugh map 30c. That is, as illustrated in Fig. 9, the local density of P-connection for the node 32k is "0.25".

In embodiments of the present invention, the controller 11 calculates the number of minterms for individual nodes in the binary decision diagram 31 in step S5. Thus, the controller 11 is capable of calculating the local density of P-connection for each node by dividing the number of minterms for the node of the binary decision diagram 31 by 2ⁿ, where n represents the number of bits of a bit sequence. That is, it is unnecessary for the controller 11 to execute processing for establishing the Karnaugh maps 30b and 30c illustrated in Figs. 8A and 8B, with the use of the number of minterms for individual nodes.

For example, the local density of P-connection for the node 32j = the number of minterms for the node 32j/2ⁿ = 16/2⁶ = 0.25. For example, the local density of P-connection for the node 32k = the number of minterms for the node 32k/2ⁿ = 16/2⁶ = 0.25. The local densities for the other nodes can be obtained in a similar manner.

The local density of N-connection for each node is obtained by "subtracting the local density of P-connection of the node from 1".

Referring back to the explanation with reference to Fig. 2, the controller 11 calculates, as a single-piece-of-data-removed local density, a value obtained by subtracting the density equivalent of a single piece of data from the local density (step S7).

Hereinafter, the single-piece-of-data-removed local density will be abbreviated as "leave-one-out (LOO) density". Processing for calculating the LOO density will be explained with reference to Fig. 10.

Fig. 10 illustrates the calculation results of the LOO density. Note that the values of P and N illustrated in Fig. 10 have meanings different from the values of P and N illustrated in Fig. 7 or 9.

The LOO density is obtained by "subtracting the density equivalent of a single piece of data from the local density". The density equivalent of each node is defined as "the reciprocal of 2^{(L x M)}", which is summarized into the equation: the LOO density = the local density - the reciprocal of 2^{(L x M)}. M is the number of numeric attributes.

The controller 11 calculates the LOO density for individual nodes.

The level L will now be explained. As illustrated in Fig. 10, the constant node 34 is defined as "level 0". Nodes corresponding to d3 and e3, which are the "least significant bits (LSBs)", are defined as "level 1". Nodes corresponding to d2 and e2, which are bits in the next level, are defined as "level 2". Nodes corresponding to d1 and e1, which are the "most significant bits (MSBs)", are defined as "level 3". That is, the length K of a bit sequence used for representing dimensions (variates) of numeric attributes corresponds to the maximum value of the level L, and the level L takes integers within a range between 0 and K.

A calculation example of the LOO density will be explained with reference to Fig. 10.

For example, the LOO density of P-connection for the node 32j is obtained by the expression: 0.25 - 1/2^{(2 x 2)} = 0.25 - 1/16 = 3/16 nearly equal 0.19.

For example, the LOO density of N-connection for the node 32j is obtained by the expression: 0.75 - 1/2^{(2 x 2)} = 0.75 - 1/16 = 11/16 nearly equal 0.69.

For example, the LOO density of P-connection for the node 32k is obtained by the expression: 0.25 - 1/2^{(1 x 2)} = 0.25 - 1/4 = 0.

For example, the LOO density of N-connection for the node 32k is obtained by the expression: 0.75 - 1/2^{(1 x 2)} = 0.75 - 1/4 = 0.5.

The LOO density for the constant node 34 is calculated using the equation: the LOO density = max {0, the local density - the reciprocal of 2^{(L x M)}} in order to prevent the LOO density from taking a negative value. However, this is not essential. No problem exists with the LOO density taking a negative value in the present invention.

Referring back to the explanation with reference to Fig. 2, the controller 11 calculates the degree of deviation of individual pieces of data on the basis of the LOO density (step S8). Processing for calculating the degree of deviation will be explained with reference to Figs. 11 and 12 and Figs. 13A to 13D.

Fig. 11 illustrates a path representing a piece of target data in the binary decision diagram. Fig. 12 illustrates a region representing a piece of target data in a Karnaugh map. Figs. 13A to 13D are diagrams for explaining LOO density to be extracted.

The controller 11 determines a piece of target data one by one from the data set and performs processing for the piece of target data. Hereinafter, an example in which (d1,e1,d2,e2,d3,e3) = (1,0,0,1,1,0) is determined to be a piece of target data x will be explained.

The controller 11 searches for a path representing a piece of target data x in the binary decision diagram, extracts the LOO density of a node whose level (layer) is changed, and calculates the degree of deviation of the piece of target data x on the basis of the extracted LOO density.

In the example illustrated in Fig. 11, the levels (layers) of

nodes

32a, 32l, 32m, and 34 change.

The determination as to whether the LOO density of P-connection is to be extracted or the LOO density of N-connection is to be extracted is determined on the basis of the number of times negative branches are passed through from the top node. That is, the controller 11 extracts the LOO density of P-connection in the case where negative branches are passed through from the top node an even number of times and extracts the LOO density of N-connection in the case where negative branches are passed through from the top node an odd number of times.

For the node 32a, since one negative branch is passed through, the controller 11 extracts the LOO density of N-connection "0.28". For the node 32l, since one negative branch is passed through, the controller 11 extracts the LOO density of N-connection "0.38". For the node 32m, since one negative branch is passed through, the controller 11 extracts the LOO density of N-connection "0.25". For the constant node 34, since two negative branches are passed through, the controller 11 extracts the LOO density of P-connection "0".

Accordingly, the controller 11 extracts (0.28, 0.38, 0.25, 0).

The meaning of individual extracted LOO densities will be explained with reference to Figs. 13A to 13D.

As illustrated in Fig. 13A, the LOO density for level 0, that is, the LOO density of the constant node 34, corresponds to the data density in the case where the piece of target data x is removed when a rectangular region 41d, which is a single unit region (= a region occupied by the piece of target data x), is defined as the entire region.

As illustrated in Fig. 13B, the LOO density for level 1, that is, the LOO density of the node 32m, corresponds to the data density in the case where the piece of target data x is removed when a rectangular region 41e, which includes four unit regions, is defined as the entire region.

As illustrated in Fig. 13C, the LOO density for level 2, that is, the LOO density of the node 32l, corresponds to the data density in the case where the piece of target data x is removed when a rectangular region 41f, which includes sixteen unit regions, is defined as the entire region.

As illustrated in Fig. 13D, the LOO density for level 3, that is, the LOO density of the node 32a, corresponds to the data density in the case where the piece of target data x is removed when a rectangular region 41g, which includes sixty-four unit regions, is defined as the entire region.

As described above, the extracted LOO densities may be expressed as hierarchical local densities (HLDs).

For example, the controller 11 defines the maximum value of the extracted LOO densities (HLDs) as the degree of deviation of the piece of target data. In the example illustrated in Fig. 11, the controller 11 defines the maximum value "0.38" of the values (0.28, 0.38, 0.25, 0) as the degree of deviation of the piece of target data.

Furthermore, the controller 11 may define the average value or the median value of the extracted LOO densities, instead of the maximum value of the extracted LOO densities, as the degree of deviation of the piece of target data.

Furthermore, the controller 11 may calculate the degree of deviation on the basis of some of the extracted LOO densities, instead of all the extracted LOO densities.

For example, the controller 11 may calculate the degree of deviation on the basis of the LOO density for a node in a higher level (layer) among the extracted LOO densities. In the example illustrated in Fig. 11, the controller 11 may calculate the degree of deviation on the basis of the LOO densities (0.28, 0.38) for nodes in "level 3" and "level 2". Also in this case, the controller 11 can employ the maximum value, the average value, the median value, or the like for the degree of deviation of a piece of target data.

In the foregoing explanations, an observation region is established as a binary decision diagram. However, the present invention can also be applied to a case where an observation region is established using a Karnaugh map or other data structures.

The controller 11 may determine a piece of target data one by one from a plurality of pieces of data included in a data set and calculate the degree of deviation of the target data on the basis of the data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region. LOO densities are examples of data densities of data adjacent to a piece of target data when a region corresponding to the piece of target data is removed from an observation region.

Referring back to the explanation with reference to Fig. 2, the controller 11 detects an outlier by comparing the degree of deviation calculated in step S8 with a predetermined threshold (step S9). However, processing of step S9 is not necessarily performed. For example, the controller 11 may output a list of the degrees of deviation calculated in step S8 via output means (the medium input/output unit 13, the communication controller 14, the display unit 16, the peripheral device I/F unit 17, and the like). A user may view the list of the degrees of deviation and detect an outlier.

The processing step having the heaviest computational load among the processing steps illustrated in Fig. 2 is the processing for establishing a binary decision diagram in step S4. The order of the calculation amount for the processing for establishing a binary decision diagram is represented by O(N x D), where N represents the number of pieces of data in a data set and D represents the number of nodes in the binary decision diagram.

In the case where the number of dimensions of a data set is small, the number D of nodes is often not very large. Thus, in the case where the number of dimensions of a given data set is large, the number of dimensions may be decreased using a dimensional reduction method. With an appropriate dimensional reduction, the number of dimensions can be reduced without affecting results. In addition, the number D of nodes can be reduced by rounding off data of a numeric attribute to restrict the number of bits.

Accordingly, by performing appropriate preprocessing prior to the execution of the processing for establishing a binary decision diagram in step S4, the relationship: D << N can be achieved. That is, the order of the calculation amount can be regarded as O(N¹).

Furthermore, as is clear from the individual processing steps illustrated in Fig. 2, only the threshold used in step S9 is the parameter to be tuned by a user. The threshold used in step S9 is a parameter used for determining whether or not a value is an outlier but is not a parameter used for calculating the degree of deviation. That is, the outlier detecting apparatus 1 is capable of calculating the degree of deviation without performing a parameter tuning operation.

It is unnecessary to perform the processing of steps S1 to S8 again even when the threshold used in step S9 is changed, and the computational load in the processing of step S9 is negligibly small. Thus, the outlier detecting apparatus 1 is capable of assisting or executing detection of an outlier within a practical time.

As described above, in the embodiments of the present invention, the outlier detecting apparatus 1 converts individual pieces of data included in a data set into a bit sequence for each dimension, and establishes an observation region of the data set on the basis of the bit sequence. Then, the outlier detecting apparatus 1 determines each piece of target data one by one from the plurality of pieces of data included in the data set, and calculates the degree of deviation of the piece of target data on the basis of the data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.

Accordingly, the outlier detecting apparatus 1 is capable of assisting or executing detection of an outlier within a practical time without performing a parameter tuning operation for a nonlinear data set.

First Embodiment

Hereinafter, in a first embodiment of the present invention, Example 1 and Comparative Examples will be explained with reference to Figs. 14 to 19.

Figs. 14A and 14B each illustrate a data set used in Example 1 and Comparative Examples. The data set illustrated in Figs. 14A and 14B is obtained by schematically plotting the light of the moon (Moon) and stars (Stars) in the night sky in a two-dimensional space. Hereinafter, the data set illustrated in Figs. 14A and 14B is called a MoonStar data set. The attributes of the MoonStar data set are: artificially generated data; the number M of dimensions is 2; 95% of the data is distributed within a crescent-shaped region and 5% of the data is distributed at random; and the number N of pieces of data is 1000, 5000.

Fig. 14A illustrates a data set including 1,000 pieces of data (N = 1,000) and the data set illustrated in Fig. 14A is called a MoonStar 1000. Fig. 14B illustrates a data set including 5,000 pieces of data (N = 5,000) and the data set illustrated in Fig. 14B is called a MoonStar 5000. In Figs. 14A and 14B, circle marks represent individual pieces of data.

As described later, in order to achieve appropriate comparison of determination accuracy, in Examples 1 and Comparative Examples, 5% of the data set was determined to be outliers (Stars). Furthermore, in order to achieve appropriate comparison of computing time, processing was performed using the same computer.

In Example 1, the outlier detecting apparatus 1 calculated the degree of deviation, and determined 5% of the data set having small values (the threshold used in step S9) to be outliers.

In Comparative Example 1, in the OC-SVM, a kernel parameter gamma for determining nonlinear mapping was set to "0.5" and a parameter v for specifying the proportion of outliers was set to "0.05".

In Comparative Example 2, in the OC-SVM, the kernel parameter gamma for determining nonlinear mapping was set to "2", and the parameter v for specifying the proportion of outliers was set to "0.05".

In Comparative Example 3, in the LOF, a parameter k was set to "10", and 5% of the data set having large values was determined to be outliers.

In Comparative Example 4, in the LOF, the parameter k was set to "100", and 5% of the data set having large values was determined to be outliers.

For all the calculations in the OC-SVM, an svm function in the library e1071 of the statistical computing language R was used. For all the calculations in the LOF, a lofactor function in the library dprep of the statistical computing language R was used.

Figs. 15A and 15B illustrate the detection results of outliers in Example 1. Fig. 15A illustrates the results for the MoonStar 1000. Fig. 15B illustrates the results for the MoonStar 5000. In Figs. 15A and 15B, cross marks represent data detected as outliers (Stars) and circle marks represent data determined to be the moon (Moon). For clearer visualization of the cross marks (outliers), the circle marks are illustrated in light gray. The same applies to Figs. 16A and 16B, Figs. 17A and 17B, Figs. 18A and 18B, and Figs. 19A and 19B.

Table 1 represents the determination results for the MoonStar 1000 in Example 1. Table 2 represents the determination results for the MoonStar 5000 in Example 1.

In each of Tables 1 and 2, among four cells, the value in the upper left-hand cell represents the number of results in which "Moon" was detected as "Moon". The value in the upper right-hand cell represents the number of results in which "Moon" was detected as "Star". The value in the lower left-hand cell represents the number of results in which "Star" was detected as "Moon". The value in the lower right-hand cell represents the number results in which "Star" was detected as "Star". The sum of the values in the upper left-hand cell and the lower right-hand cell represents the number of accurate detection results. The sum of the values in the lower left-hand cell and the upper right-hand cell represents the number of inaccurate detection results. The same applies to Tables 3 to 10.

In Example 1, a computing time of 0.03 seconds was required for the MoonStar 1000, and a computing time of 0.17 seconds was required for the MoonStar 5000. That is, since an about 5.7-fold increase of the computing time was exhibited with respect to a 5-fold increase of the number of pieces of data, it can be said that the order of the computing time in Example 1 is O(N¹), where N represents the number of pieces of data.

Figs. 16A and 16B illustrate the detection results of outliers in Comparative Example 1. Table 3 illustrates the determination results for the MoonStar 1000 in Comparative Example. Table 4 illustrates the determination results for the MoonStar 5000 in Comparative Example.

In Comparative Example 1, a computing time of 0.07 seconds was required for the MoonStar 1000, and a computing time of 1.30 seconds was required for the MoonStar 5000. That is, an about 18.6-fold increase of the computing time was exhibited with respect to a 5-fold increase of the number of pieces of data. Thus, compared with Example 1, the increasing rate of the computing time with respect to the increase in the number of pieces of data is apparently high.

Figs. 17A and 17B illustrate the detection results of outliers in Comparative Example 2. Table 5 illustrates the determination results for the MoonStar 1000 in Comparative Example 2. Table 6 illustrates the determination results for the MoonStar 5000 in Comparative Example 2.

In Comparative Example 2, a computing time of 0.11 seconds was required for the MoonStar 1000, and a computing time of 1.69 seconds was required for the MoonStar 5000. That is, an about 15.4-fold increase of the computing time was exhibited with respect to a 5-fold increase of the number of pieces of data. Thus, compared with Example 1, the increasing rate of the computing time with respect to the increase in the number of pieces of data is apparently high.

Figs. 18A and 18B illustrate the detection results of outliers in Comparative Example 3. Table 7 illustrates the determination results for the MoonStar 1000 in Comparative Example 3. Table 8 illustrates the determination results for the MoonStar 5000 in Comparative Example 3.

In Comparative Example 3, a computing time of 2.05 seconds was required for the MoonStar 1000, and a computing time of 34.95 seconds was required for the MoonStar 5000. That is, an about 17.0-fold increase of the computing time was exhibited with respect to a 5-fold increase of the number of pieces of data. Thus, compared with Example 1, the increasing rate of the computing time with respect to the increase in the number of pieces of data is apparently high.

Figs. 19A and 19B illustrate the detection results of outliers in Comparative Example 4. Table 9 illustrates the determination results for the MoonStar 1000 in Comparative Example 4. Table 10 illustrates the determination results for the MoonStar 5000 in Comparative Example 4.

In Comparative Example 4, a computing time of 6.38 seconds was required for the MoonStar 1000, and a computing time of 150.47 seconds was required for the MoonStar 5000. That is, an about 23.6-fold increase of the computing time was exhibited with respect to a 5-fold increase of the number of pieces of data. Thus, compared with Example 1, the increasing rate of the computing time with respect to the increase in the number of pieces of data is apparently high.

As is clear from the above description, the outlier detecting apparatus 1 according to an embodiment of the present invention has a superiority in the computing time over the OC-SVM or the LOF.

The feature that the outlier detecting apparatus 1 according to an embodiment of the present invention also has a superiority in the calculation accuracy over the OC-SVM or the LOF will be explained below.

In order to make a comparison between Example 1 and each of Comparative Examples 1 to 4 for calculation accuracy, Table 11 illustrates the area under the ROC curve (AUC) for each of Example 1 and Comparative Examples 1 to 4. The ROC stands for the receiver operating characteristic. The AUC takes a value within a range between 0 and 1, and a value nearer 1 exhibits a higher accuracy.

As illustrated in Table 11, in Example 1 of the present invention, the MoonStar 1000 and the MoonStar 5000 each exhibit an AUC of 0.9 or more and stably achieve high accuracy. In contrast, in the case of the OC-SVM, the MoonStar 1000 and the MoonStar 5000 each exhibit an AUC of 0.9 or less and cannot achieve an adequate accuracy. Furthermore, in the case of the LOF, the results are greatly affected by parameters, and a parameter tuning operation is thus essential.

Second Embodiment

Hereinafter, a second embodiment of the present invention will be described with reference to Figs. 20 and 21 and Figs. 22A to 22C. In the second embodiment, the outlier detecting apparatus 1 is applied to a vehicle trouble diagnosis system.

Fig. 20 illustrates an example of the configuration of the vehicle trouble diagnosis system according to the second embodiment. As illustrated in Fig. 20, a vehicle trouble diagnosis system 100 includes a vehicle system 101 for which trouble diagnosis is to be performed, a data collecting apparatus 102, and the outlier detecting apparatus 1.

The vehicle system 101 is mounted on a vehicle such as an automobile. A plurality of electronic control units (ECUs), a plurality of sensors, a plurality of actuators, and the like are connected via an on-vehicle network, in the vehicle system 101.

The data collecting apparatus 102 collects vehicle data of the vehicle in a travelling state. The data collecting apparatus 102 collects a signal flowing in the on-vehicle network, the state values of individual components (the ECUs, the sensors, the actuators, and the like), and the like as vehicle data. The data collecting apparatus 102 transmits the vehicle data to the outlier detecting apparatus 1, to which the data collecting apparatus 102 is connected in a wired or wireless manner.

For example, the data collecting apparatus 102 may be installed inside the vehicle system 101 or installed as an outside apparatus. For example, the data collecting apparatus 102 may be a computer (including the outlier detecting apparatus 1) installed in a place physically distant from the vehicle system 101. In the case where the data collecting apparatus 102 is installed in a place physically distant from the vehicle system 101, the vehicle system 101 transmits vehicle data to the data collecting apparatus 102 by wireless communication.

The outlier detecting apparatus 1 detects an outlier from vehicle data collected by the data collecting apparatus 102. An expert who makes a diagnosis of vehicle trouble analyzes in detail a detected outlier and checks whether an error has occurred in the vehicle.

Fig. 21 is a flowchart illustrating a process performed by the vehicle trouble diagnosis system according to the second embodiment. As illustrated in Fig. 21, the data collecting apparatus 102 collects vehicle data of the vehicle in a traveling state from the vehicle system 101 (step S11).

Then, the outlier detecting apparatus 1 detects an outlier from the vehicle data collected by the data collecting apparatus 102 in accordance with the flowchart illustrated in Fig. 2.

An expert who makes a diagnosis of vehicle trouble analyzes in detail an outlier detected by the outlier detecting apparatus 1. If an error to be handled has occurred (Yes in step S13), the expert urges a driver to restore the vehicle (step S14). Alternatively, the outlier detecting apparatus 1 may transmit via wireless communication a message indicating the occurrence of an error in the vehicle system 101 and the vehicle system 101 may output a warning using an output device (a display device, a sound output device, or the like).

Figs. 22A to 22C illustrate detection results of outliers in the second embodiment. In Figs. 22A to 22C, results obtained when the outlier detecting apparatus 1 performed detection of outliers for a data set having three dimensions (three variates), the amount of accelerator operation, the number of engine revolutions, and a gear shift position.

The amount of accelerator operation and the number of engine revolutions are variates of numeric attributes. Since the gear shift position may take three values, "3rd", "2nd", and "Low", the gear shift position may be a variate of a category attribute or a variate of a numeric attribute.

In Figs. 22A to 22C, cross marks represent data detected as outliers, and circle marks represent data determined not to be outliers. In the examples illustrated in Figs. 22A to 22C, an outlier is detected when the gear shift position is "Low". This represents an error in which "the engine revolutions do not smoothly increase with respect to the amount of accelerator operation". As described above, when the outlier detecting apparatus 1 according to an embodiment of the present invention is applied to the vehicle trouble diagnosis system 100, a trouble diagnosis of a vehicle can be performed with high accuracy.

An outlier detecting apparatus and the like according to preferred embodiments of the present invention have been described above with reference to the drawings. However, the present invention is not limited to the foregoing embodiments. It is obvious that those skilled in the art can conceive various changes and modifications within the scope of the technical ideas disclosed in this application. It should be understood that those changes and modifications are within the technical scope of the present invention.

1 Outlier detecting apparatus
21 Data set
22a, 22b Bit sequence
23 Sorted bit sequence
30a-30c Karnaugh map
31 Binary decision diagram
32a-32m Node
33 Most significant node
34 Constant node
35a-35f Branch
41a-41g Rectangular region

Claims

An outlier detecting apparatus that assists or executes detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions, the apparatus comprising:
a controller that converts each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions, establishes an observation region for the data set on the basis of the bit sequence, determines a piece of target data one by one from the plurality of pieces of data included in the data set, and calculates the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.
The outlier detecting apparatus according to Claim 1, wherein the controller establishes the observation region as a binary decision diagram, defines, as a single-piece-of-data-removed local density, a value obtained by subtracting a density equivalent of a single piece of data from a local density of each node, and calculates the degree of deviation of the piece of target data on the basis of the single-piece-of-data-removed local density.
The outlier detecting apparatus according to Claim 2, wherein the controller hierarchically establishes the binary decision diagram by sorting a bit sequence group for dimensions of numeric attributes in the order from the most significant bit to the least significant bit, searches for a path representing the piece of target data in the binary decision diagram, and calculates the degree of deviation of the piece of target data on the basis of the single-piece-of-data-removed local density for a node whose level is changed.
The outlier detecting apparatus according to Claim 3, wherein the controller defines the maximum value, the median value, or the average value of some or all of the single-piece-of-data-removed local densities of nodes whose level is changed as the degree of deviation of the piece of target data.
The outlier detecting apparatus according to Claim 4, wherein the controller detects the outlier by comparing the degree of deviation with a threshold.
An outlier detecting method for assisting or executing detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions, the method comprising:
converting each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions;
establishing an observation region for the data set on the basis of the bit sequence;
determining a piece of target data from the plurality of pieces of data included in the data set; and
calculating the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region.
A vehicle trouble diagnosis system comprising:
an outlier detecting apparatus that assists or executes detection of an outlier from a data set including a plurality of pieces of data each having one or more dimensions; and
a data collecting apparatus that collects vehicle data, wherein
the outlier detecting apparatus includes a controller that detects an outlier by converting each of the plurality of pieces of data included in the data set into a bit sequence for each of the one or more dimensions, the vehicle data collected by the data collecting apparatus being defined as the data set, establishing an observation region for the data set on the basis of the bit sequence, determining a piece of target data one by one from the plurality of pieces of data included in the data set, calculating the degree of deviation of the piece of target data on the basis of data densities of data adjacent to the piece of target data when a region corresponding to the piece of target data is removed from the observation region, and comparing the degree of deviation with a threshold.