CN101324904B

CN101324904B - High-dimension index structure technique of equipment failure cases based on distance measurement

Info

Publication number: CN101324904B
Application number: CN2008101502616A
Authority: CN
Inventors: 刘弹; 徐光华; 张庆
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2008-07-04
Filing date: 2008-07-04
Publication date: 2010-08-11
Anticipated expiration: 2028-07-04
Also published as: CN101324904A

Abstract

The invention discloses a high-dimensional indexing structure technology of an equipment failure case based on the distance measurement. A novel high-dimensional data indexing structure is provided. The indexing structure divides a space dynamic state into a grid structure according to the aggregating properties of an equipment state case, so that two spatially-adjacent cases are put together. At the same time, the relation between a threshold to be inquired and the distance from a vector to be inquired to two reference points is utilized, thereby excluding impossible data in a large range, greatly improving the efficiency of inquiry and reducing the CPU consumption, and meanwhile the indexing structure is insensitive to the data distribution, so as to lay the foundation for the engineering intellectualization of the equipment fault diagnosis.

Description

A kind of high-dimension index structure technique of equipment failure cases based on distance metric

Technical field

The invention belongs to fields such as the data mining of plant equipment information and cluster analysis, relate to the similar cases retrieval technique of equipment failure case, be specifically related to a kind of equipment failure case high dimensional indexing structure based on distance metric.

Background technology

Along with monitoring of equipment and diagnostic techniques in the application of the widespread use, particularly remote monitoring of enterprise and diagnostic techniques in large equipment manufacturing enterprise, the equipment failure case is able to abundant accumulation.Under this background, how to make full use of these cases, to instruct the judgement of unknown failure, caused people's very big interest.From in essence, the equipment failure case can be regarded a high dimensional data that is formed by various features as, the equipment failure casebook forms a high dimensional data storehouse, therefore, the problem of utilizing of equipment failure case can be converted to similar inquiry problem to case library, be k neighbour inquiry, and then the index structure of equipment failure case just become a key issue.

In numerous index structure that proposes and search algorithm, for example: BK-Tree, FQ-Tree, R-Tree etc., they respectively have relative merits.Concentrated reflection is aspect following two: 1) along with the increase of dimension, search efficiency descends.For example present main flow multi-dimensional indexing structure has reasonable efficient when handling the lower situation of dimension, in case but dimension will seem unable to do what one wishes when higher, obtain ideal results surely even adopt dimensionality reduction technology also to differ.This is because adopt this kind technology may cause losing of effective information, especially is not suitable for handling the very little situation of proper vector correlativity in the special category space.2) great number calculation cost.The distance calculation number of times of high-dimensional quantity space is along with the increase of data obviously increases.

Its intrinsic dimensionality of equipment failure case can be up to dimensions up to a hundred, simultaneously, the equipment failure case has the characteristics of himself, it is same fault case certain aggregation properties in the higher dimensional space performance, therefore press for the high dimensional data index structure that is adapted to the equipment failure case, to satisfy the demand of equipment failure case comprehensive utilization.

Summary of the invention

The objective of the invention is to overcome above-mentioned prior art deficiency, a kind of high-dimension index structure technique of equipment failure cases based on distance metric is provided, be adapted to the equipment failure case is carried out the index structure of k neighbour inquiry, this index structure is according to the aggregation properties of equipment failure case, by setting up effective rule to reduce the distance calculation number of times, adopt the range query method to inquire about to the equipment failure case more on this basis, can improve search efficiency greatly.Basic operational steps of the present invention is as follows:

(1) by certain rule axle is organized, and by certain rule selected two reference point on this axle, the vector that is inserted into is inserted into an ad-hoc location, write down the distance of this vector to two reference point, its objective is that the distance between the vector that makes in higher dimensional space can be estimated according to the distance between vector and two reference point;

Calculate the distance between vector to be checked and the reference point when (2) retrieving earlier, owing to reference point is positioned on the axle, so vector is only relevant with dimension with the distance calculation time of a plurality of reference point;

(3) according to the distance calculation result between vector to be checked and the reference point, draw Query Result possible super beginning position and end position, scan the vector data of depositing on these positions, just can obtain corresponding Query Result.Because no longer scan whole data file,, improved inquiry velocity so significantly reduce the disk access cost.

Concrete steps are as follows:

(1) feature axis is sorted by data size of probability distribution density on axle, preferentially select data probability distributions density big, and divide the feature axis that number of times do not reach parameter ε regulation data are divided;

(2) root node is set to present node;

(3) if root node is empty, (1) selectes an axle set by step, and writes down this axle with the internal node form;

(4) if present node is an internal node, partition structure, i.e. a R comprising the data point that is inserted into are found out in the set of the partition structure of this internal node of searching loop _L＜x _i≤ R _R, wherein i is the axle of internal node correspondence, x _iThe axis of projection of data point on the i axle that is inserted into during for the structure index structure;

(5) if R _L＜x _i≤ (R _L+ R _R)/2 then are changed to present node the left area pointer PLR of this partition structure, otherwise, present node is changed to the right regional indicator PRR of this partition structure;

(6) if present node is a leaf node, and the data point number that comprises in the leaf node is more than or equal to m, and also has axle then to call splitting-up method for dividing, and present node is changed to the internal node that will add, and jumps to 2;

(7) computational data is put the left side of the corresponding partition structure of corresponding leaf node, the distance of right reference point, and data point inserted leaf node, return.

Described splitting-up method is:

(1) current partition structure is set for pointing to the partition structure of leaf node;

(2) if (R in the current partition structure _R-R _L)≤1/ ε jumps to (3), current partition structure is divided into two: { R _L, (R _L+ R _R)/2, PLR1, PLR2}, { (R _L+ R _R)/2, R _R, PRR1, PRR2}, wherein the data point in the leaf node in the PLR1 subtree satisfies R in the projection of current partition structure place axle _L＜x _i≤ (R _L+ R _R)/4, the data point in the leaf node in the PLR2 subtree satisfies (R in the projection of current partition structure place axle _L+ R _R)/4＜x _i≤ (R _L+ R _R)/2; Data point in the leaf node in the PRR1 subtree satisfies (R in the projection of current partition structure place axle _L+ R _R)/2＜x _i≤ 3 (R _L+ R _R)/4, the data point in the leaf node in the PRR2 subtree satisfies 3 (R in the projection of current partition structure place axle _L+ R _R)/3＜x _i≤ (R _L+ R _R), return;

(3) by the distribution density size ordering of data on axle, first axle after selecting distribution density less than the partition structure institute respective shaft of pointing to this leaf node;

(4) axle of corresponding selection generates an internal node, and will point to the internal node of the partition structure sensing generation of this leaf node;

(5) present node is set for the internal node that (4) generate, inserts the data point in the leaf node again.

Utilize vector to be checked and the distance that is positioned at the reference point on the axle that the Query Result spatial dimension is got rid of.

Division number of times ε when setting up index structure gets [15,25].

Threshold value m when setting up index structure gets [80,120].

The case high dimensional indexing structure based on the scope beta pruning that the present invention proposes, the reference point that estimated distance is used is fixed on the axle, has reduced the number of times with the reference point distance calculation, simultaneously, has the insensitive characteristic that data are distributed, and has very high search efficiency.

Description of drawings

Fig. 1 is the data structure storage synoptic diagram.

Fig. 2 is for storing the synoptic diagram of data point by adding axle in the splitting-up method.

Fig. 3 is for storing the synoptic diagram of data point by the division intermediate node in the splitting-up method.

Fig. 4 is the index structure of the present invention's proposition and the average query time comparison diagram of sequential organization.

Fig. 5 is the index structure of the present invention's proposition and the average query time comparison diagram of R-Tree structure.

Index structure distance calculation number of times and actual cautious ratio of counting that Fig. 6 proposes for the present invention.

Below in conjunction with accompanying drawing content of the present invention is described in further detail.

Embodiment

Referring to shown in Figure 1, wherein the internal node record carries out divided information to a certain feature axis, and internal node is made up of a plurality of partition structures; Partition structure has write down the specifying information that feature axis is divided, and comprises a pair of reference point, the left side that reference point is pointed, and right reference zone, for example (R1, R2), (R2, R3), (R3 R4) just represents a pair of reference point on the axle respectively among the figure.A pair of reference point on the axle is as (R3 R4) is divided into whole data space two parts, and the data point in the left area of its division arrives the distance of R3 smaller or equal to the distance to R4, and the data point in the zone, the right arrives the distance of R3 greater than the distance to R4; Leaf point nodes records the call number of respective counts strong point in sequential storage, and to a left side, the distance of right reference point.

Referring to shown in Figure 2, in the splitting-up method by adding an executive condition of storing data point be: when the data point number of the storage of leaf node equals the threshold value m of parameter regulation, and the number of times that is divided of i axle equals to divide number of times parameter ε at this moment.

Referring to shown in Figure 3, the executive condition of storing data point by the division intermediate node in the splitting-up method is: when the data point number of the storage of leaf node equals the threshold value m of parameter regulation, and this moment the i axle be divided number of times less than dividing number of times parameter ε.

With reference to Fig. 4, shown in Figure 5, equipment failure case index structure that the present invention proposes and sequential scanning, R-Tree tree index structure have carried out the performance comparison experiment, and experiment is chosen and comprised 600,000 equally distributed 20 dimensional vectors.Wherein the parameter of index structure of the present invention is: dividing number of times ε is 10, and threshold value m is 100.100 data query points were chosen in the variation of index time randomly when index structure had at first been tested in experiment along with the increase of number of data points, and computer capacity inquiry threshold value is the average query time of being consumed in 0.1 o'clock.

With reference to shown in Figure 6, as can be seen from Figure 4, the a little higher than sequential index structure of its average query time of index structure that the present invention proposes, but from Fig. 6, when considering the influencing of disk read-write, the sequential index structure needs all data points are read in internal memory, and the structure that the present invention proposes only need be read in needed number of data points, can shorten the required time of magnetic disc i/o read-write greatly.

For further specifying the index structure that the present invention proposes, establish the span of set Θ on M dimension space i axle and be [Min (z _i), Max (z _i)], then put D ₀(D ₀={ z _j, j=1 ..., M, when j ≠ i, z _j=0, z when j=i _j=z _IMin), D ₁(D ₁={ z _j, j=1 ..., M, when j ≠ i, z _j=0, z when j=i _j=z _IMax), set is divided into two part: X ₀, X ₁, make and gather X ₀Middle any point is to D ₀Distance smaller or equal to D ₁Distance, the set X ₁Middle any point is to D ₀Distance greater than to D ₁Distance, the note X ₀Mid point X is to D ₀, D ₁Distance be respectively λ ₀, λ ₁, some Q is to D ₀, D ₁Distance be respectively λ ₂, λ ₃, then the distance lambda of X and Q satisfy condition λ 〉=| λ ₂-λ ₃|/2, in view of the above, in the present invention, adopt partition structure to indicate that to axle (characteristic quantity) certain once divides, its shape is as { R _L, R _R, PLR, PRR}, wherein R _LRepresent the left side reference point of this time division, R _RRepresent the right reference point of this time division, PLR (left area pointer, Pointer of LeftRegion) specifies by a left side, the pointer of the left area (left area is made of internal node or leaf node) that right reference point delimited (vector data in this zone to the distance of left reference point less than distance) to reference point, PRR (the right regional indicator, Pointer of Right Region) representative is by a left side, the pointer in the zone, the right (zone, the right is made of internal node or leaf node) that right reference point delimited (vector data in this zone to the distance of left reference point more than or equal to distance) to right reference point; Adopt internal node to identify the set of the partition structure that certain one dimension axle is divided and the numbering of axle, its shape is as (i, (R _L1, R _R1, PLR ₁, PRR ₁), (R _L2, R _R2, RLR ₂, PRR ₂) ..., (R _LL, R _RL, PLR _L, PRR _L)), wherein: i represents the axle of internal node correspondence, the number of times that subscript L (L≤ε, ε represent the maximum division number of times that allow of axle) expression is divided axle, and R is arranged _Rj=R _{L (j+1)}, R _L1=0, R _RL=1, promptly the scope of the axle that each partition structure covered in the partition structure set does not repeat, and each partition structure is linked up, and forms the drop shadow spread of data set on axle; Adopt the leaf node sign to comprise the left side that the set of vector data index and vector comprise in the father node, the distance of right reference point.

The concrete steps that structure the present invention proposes index structure are as follows:

(2) corresponding to the data point that is inserted into, root node is set to present node;

(3) if root node is empty, select an axle, and write down this axle with the internal node form by (1);

(4) if present node is an internal node, partition structure, i.e. a R comprising this data point that is inserted into are found out in the set of the partition structure of this internal node of searching loop _L＜x _i≤ R _R, wherein i is the axle of internal node correspondence, x _iThe axis of projection of data point on the i axle that is inserted into during for the structure index structure;

Concrete splitting-up method is:

Embodiment:

Corresponding query point Q={x ₁..., x _N, the range query threshold value is T, its concrete range query algorithm is as follows:

(1) root node is set to present node;

(2), find out and satisfy R in the internal node if present node is an internal node _Lj＜x _i≤ R _RjPartition structure, and it is provided with current partition structure

(3) be the center with current partition structure, in the set of the partition structure of internal node forward and search backward, in search forward, if as satisfy formula T＜| λ ₂-λ ₃|/2 (λ wherein ₂, λ ₃For Q to a left side, the distance of right reference point), then stop to search for forward, backward the search in, if satisfy formula T＜| λ ₂-λ ₃|/2, then stop to search for backward, simultaneously, in search procedure, carry out (4) and (5).

(4) if query point to the distance of left reference point less than distance to right reference point, present node is changed to the left area pointer PLR of this partition structure, repeat (2);

(5) if query point to the distance of left reference point more than or equal to distance to right reference point,, present node is changed to the right regional indicator PRR of this partition structure, repeat (2);

(6) if present node is a leaf node, if the data point in the leaf node to a left side, the distance of right reference point is put to a left side with data query, the distance of right reference point all satisfy formula T 〉=| λ ₂-λ ₃|/2, computed range then, if distance less than threshold value T, adds results set, otherwise, skip this data point.

Among the present invention, there have related parameter to fix really to be then as follows:

1) divide number of times ε: for identical division number of times, different DATA DISTRIBUTION types is little to the influence of query time.Simultaneously, along with dividing increasing of number of times, query time will have a declining tendency, and when the division number of times acquires a certain degree, increases and divides a small amount of increase that number of times can cause query time on the contrary.This explanation in number of data points one regularly, increase the division number of times and can better realize the scope beta pruning, reduce the number of times of distance calculation, when the division number of times is increased to a certain numerical value, with respect to certain number of data points, the grid in the space is enough little, increases the division number of times again and can't reduce the distance calculation number of times, increase the time of on internal node, retrieving on the contrary, caused the increase on the whole query time.

2) threshold value m: for identical data point number and dimension, under the same threshold condition, different DATA DISTRIBUTION types is little to the influence of query time.Simultaneously, along with the increase of threshold value m, query time has downtrending.This explanation in number of data points one is regularly divided number of times one regularly simultaneously, increases threshold value m and can realize better that single-point subtracts branch, reduces the retrieval time on internal node.But threshold value m is big more, and in number of data points one regularly, the grid that index structure is divided also will become greatly, thereby the DeGrain of range of application beta pruning, need more distance calculation.

In a word, the present invention proposes case high dimensional indexing structure, and this index structure of analytical proof has the insensitive characteristic that data are distributed by experiment based on the scope beta pruning.Simultaneously, experimental result shows that index structure of the present invention has the better retrieval performance than sequential scanning, R-Tree tree index structure, has very high search efficiency.

Claims

1. similar querying method of equipment failure higher-dimension case data based on distance metric, it is characterized in that, be based on the aggregation properties of equipment failure case, use the distance between vector to be checked and the reference point to carry out the data in the equipment failure case vector space are got rid of

I, described equipment failure case structure high dimensional indexing structure based on distance metric, concrete steps are as follows:

(2) root node is set to present node;

(3) if root node is empty, (1) selectes an axle set by step, and writes down this axle with the internal node form, jumps to step (4); If root node is not empty, jump to step (4);

(4) if present node is an internal node, partition structure, i.e. a R comprising the data point that is inserted into are found out in the set of the partition structure of this internal node of searching loop _L＜x _i≤ R _R, wherein i is the axle of internal node correspondence, x _iThe subpoint of data point on the i axle that is inserted into during for the structure index structure, R _LRepresent the left side reference point of this time division, R _RRepresent the right reference point of this time division; Otherwise jump to step (5);

(6) if present node is a leaf node, and the data point number that comprises in the present node is more than or equal to m, and m is the threshold value of regulation, and also has axle then to call splitting method for dividing, and present node is changed to the internal node that will add, and jumps to (2); Otherwise jump to step (7);

(7) left side of computational data point and the corresponding partition structure of leaf node, the distance of right reference point, and data point inserted leaf node, leaf node comprises the left side that data acquisition and data comprise in the father node, the distance of right reference point, data insertion method finishes;

Described splitting method is:

A, current partition structure is set for pointing to the partition structure of leaf node;

If (R in the current partition structure of b _R-R _L)≤1/ ε jumps to (3), current partition structure is divided into two: { R _L, (R _L+ R _R)/2, PLR1, PLR2}, { (R _L+ R _R)/2, R _R, PRR1, PRR2}, wherein the data point in the leaf node in the PLR1 subtree satisfies R in the projection of current partition structure place axle _L＜x _i≤ (R _L+ R _R)/4, the data point in the leaf node in the PLR2 subtree satisfies (R in the projection of current partition structure place axle _L+ R _R)/4＜x _i≤ (R _L+ R _R)/2; Data point in the leaf node in the PRR1 subtree satisfies (R in the projection of current partition structure place axle _L+ R _R)/2＜x _i≤ 3 (R _L+ R _R)/4, the data point in the leaf node in the PRR2 subtree satisfies 3 (R in the projection of current partition structure place axle _L+ R _R)/3＜x _i≤ (R _L+ R _R), continue data insertion method;

C, by the distribution density size ordering of data on axle, first axle after selecting distribution density less than the partition structure institute respective shaft of pointing to this leaf node;

The axle of d, corresponding selection generates an internal node, and will point to the internal node of the partition structure sensing generation of this leaf node;

E, present node is set, inserts the data point in the leaf node again for the internal node that (4) generate.

Distance between II, described use vector to be checked and the reference point carries out the data in the equipment failure case vector space being got rid of corresponding query point Q={x ₁..., x _N, the range query threshold value is T, step is as follows:

(1) root node is set to present node;

(2), find out and satisfy R in the internal node if present node is an internal node _Lj＜x _i≤ R _RjPartition structure, and it is provided with current partition structure;

(3) be the center with current partition structure, in the set of the partition structure of internal node forward and search backward, in search forward, if as satisfy formula T＜| λ ₂-λ ₃|/2, λ wherein ₂, λ ₃For Q to a left side, the distance of right reference point then stops to search for forward, backward the search in, if satisfy formula T＜| λ ₂-λ ₃|/2, then stop to search for backward, simultaneously, execution in step in search procedure (4) and step (5);

(4) if query point to the distance of left reference point less than distance to right reference point, present node is changed to the left area pointer PLR of this partition structure, repeat step (2); Otherwise jump to step (5)

(5) if query point to the distance of left reference point more than or equal to distance to right reference point,, present node is changed to the right regional indicator PRR of this partition structure, repeat step (2); Otherwise jump to step (6);

2. the equipment failure case high dimensional indexing structure construction method based on distance metric according to claim 1 is characterized in that the division number of times ε when setting up index structure gets [15,25].

3. the equipment failure case high dimensional indexing structure construction method based on distance metric according to claim 1 is characterized in that the threshold value m when setting up index structure gets [80,120].