CN116429442A

CN116429442A - Method and apparatus for determining a distance metric to determine a distance dimension of a heterogeneous data point

Info

Publication number: CN116429442A
Application number: CN202310028660.XA
Authority: CN
Inventors: K·格劳; M·沃尔勒
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2022-01-13
Filing date: 2023-01-09
Publication date: 2023-07-14
Also published as: US20230222181A1; DE102022200288A1; JP2023103189A; KR20230109561A

Abstract

The invention relates to a computer-implemented method for determining a distance measure for determining a distance to a data point having a heterogeneous parameter class, having the following steps: providing (S1) training data records which respectively assign data points to markers and which are divided into training data points of a training set and verification data points of a verification set; training (S2) a data-based system model (4) using the training set such that the system model (4) assigns model outputs to the data points, respectively; for each verification data point of the verification set, determining (S3, S4) a quality measure of the system model (4) and a distance value from a closest training data point of each of the parameter classes, wherein the distance value from the closest training data point is determined separately with respect to the relevant parameter class; determining (S5) a distance value of a maximum quality measure for each of the parameter classes; and determining a distance measure from the distance value of the maximum quality measure for each of the parameter classes.

Description

Method and apparatus for determining a distance metric to determine a distance dimension of a heterogeneous data point

Technical Field

The present invention relates to a method of evaluating data points using a distance metric, such as in nearest neighbor methods, particularly for anomaly identification and the like. More particularly, the present invention relates to determining a distance metric in the case of heterogeneous data points having multiple physical quantities.

Background

In the operation of technical systems, the system state or state change process is generally evaluated. These system states or state change procedures are typically determined for a particular time step, instant point or period in a sensory or model-based manner and provided as data points for further evaluation. The evaluation of such data points can be carried out, for example, by means of a physical-based or data-based evaluation model.

Another possibility for evaluating data points with respect to other reference data points consists, for example, in applying a nearest neighbor method, which is based on a distance measure for determining the distance dimension of the data point to be evaluated from one or more reference data points.

Disclosure of Invention

According to the present invention, a method for determining a distance measure for determining a distance of a data point from a class of heterogeneous parameters according to claim 1 and a corresponding apparatus according to the parallel independent claims are provided.

Further embodiments are specified in the dependent claims.

According to a first aspect, there is provided a method for determining a distance measure to determine a distance to a data point having a heterogeneous parameter class, the method having the steps of:

-providing training data records, which respectively assign data points to markers and which are divided into training data points of a training set and verification data points of a verification set;

-training a data-based system model with the training set such that the system model assigns model outputs to data points, respectively;

-determining, for each verification data point of the verification set, a quality measure of the system model and a distance value from a closest training data point of each of the parameter classes, wherein the distance value from the closest training data point is determined separately with respect to the relevant parameter class;

-determining a distance value of a maximum quality measure for each of the parameter classes;

-determining the distance measure from the distance value of the maximum quality measure of each of the parameter classes.

Data points typically have values assigned to different parameter classes, i.e., parameters or physical quantities. Various methods use distance metrics to evaluate data points with respect to other reference data points in order to determine the orientation of the data point to be evaluated with respect to those reference data points.

In the case of homogeneous parameters in the data points to be evaluated, euclidean distances or distance dimensions in the form of L1 or L2 norms can generally be specified as distance measures. On the other hand, conventional distance measures cannot be easily used to determine an appropriate distance measure for data points in which the value of the underlying physical quantity is in a different range of values. If, for example, the value range of a parameter in a data point differs significantly from another parameter in the data point to be evaluated, the distance dimension is generally dominated by the physical quantity having the highest upper limit of the value range or the physical quantity having the lowest lower limit of the value range. This typically results in undesirable results in subsequent evaluations, such as when applying nearest neighbor methods.

For monitoring or regulating the technical system, physical measurement variables are detected by means of a sensor system, which represent the respective current state of the technical system. These measurement variables are detected by means of sensors, such as pressure sensors, temperature sensors, acceleration sensors, vibration sensors, radiation sensors, mass flow sensors, and cameras, lidar or radar sensors, which are read in predetermined sampling steps. Within the sampling step, the respective physical quantity, the respective model value, but also the time sequence of physical quantities or one or more camera-based image data or moving image data can be detected as a respective parameter class with respect to the detection range. The measurement data detected in this way for the individual parameter classes are usually combined into data points for further evaluation and processed further.

Thus, the data point may include a plurality of parameter classes that respectively correspond to a single physical quantity, a time series of physical quantities, image data of the imaging device, or moving image data of the imaging device.

One possible further processing consists in evaluating in a nearest neighbor method in which the distance of the relevant data point to be evaluated from other reference points has to be determined. In this way, from the distance dimension or dimensions thus obtained, a decision can be made as to whether an anomaly is present at the time of anomaly detection or as to the quality of the data point, for example when used as a training data point. The determination of the distance dimension is usually done by means of a distance measure, which is usually designed based on euclidean distance.

When different parameter classes, i.e. physical state parameters, physical change process parameters and/or image data are mapped in data points, the data points can have different formats, wherein the elements of the data points are defined in different value ranges. For example, the data points may include parameter classes in the form of time series data, image data, motion image data, and respective scalar values of state parameters. For example, data point x may be illustrated by the following format:

where a, B are individual values corresponding to respective values of the state parameter, y1...yn and z1...zm correspond to time series data of time periods 1..n and 1..m of the corresponding physical quantity, and B corresponds to a matrix of image points of the image data, where a, B, y, z and B represent parameter classes, respectively.

These parameter classes can therefore each lie in different value ranges, so that the state parameter which normally has the largest value range is dominant when the distance size is determined in a conventional manner.

Adjustment of the distance measure, i.e. the measurement method for determining the distance dimension, is not easy to achieve, since the influence of the respective state parameter on the system behavior is not known. For this, the above method proposes: a distance measure is determined that evaluates the effect of the corresponding state information on the system behavior. For this purpose, the method described above first provides that: a data-based system model is provided that maps data points to corresponding measured or otherwise determined system parameters. That is, the system model is used to evaluate a technical system for which the data points account for the status of a particular time step, point in time or period of time.

That is, the system model provides a possibility to evaluate the distance function. On average, a near data point should be of higher quality than such a data point that is farther away. That is, a (weak) correlation between the distance function and the quality function may be expected for the trained system model. If the system model was not trained with this data, then the correlation would not be expected.

Training of the system model is based on training sets of training data records from which a verification set of training data records has been previously obtained. The system model is trained until convergence, i.e., until convergence criteria are met. I.e. trained until on average the value of the quality function no longer changes significantly.

The validation set may be selected according to a common scheme: for example, about 60/20/20 segmentations, where 60% are training data points of the training set, 20% are validation data points of the validation set and 20% are test data points, in order to ultimately evaluate the trained quality function. However, other divisions are also contemplated. If a meta-parameter is present, this meta-parameter can also be considered (e.g. training data can only be from Munich and Stuttgart, verification data can only be from Magdeburg).

It may be provided that: the quality measure is determined for the corresponding verification data record based on a difference between the model output of the system model and the signature of the associated verification data record.

The verification data records of the verification set are then used to determine the corresponding distance metric. For this purpose, for each verification data point, a quality measure is determined by means of a predefined quality function. In the simplest case, the quality function may account for deviations between the model evaluation of the system model at the verification data points and the system parameters entered by means of the relevant verification data records. Other conceivable quality metrics are the loss of the trained model and the predicted Softmax probability.

Then, for each verification data point, a distance value from the closest training data point of the training set is determined. The closest training data points are each determined with respect to only one of the parametric classes of data points. The closest training data point corresponds to the training data point having the smallest distance value from the verification data point in the corresponding parameter class. The distance value may be determined based on a single difference or euclidean distance in the case of a multi-dimensional parameter. This is implemented for each of the parameters in the data points of the verification data record.

If, for example, the validation data record or the training data record consists of time-series vectors of pressure signals, time parameters (scalar quantity) and temperature parameters (scalar quantity), the distance to the nearest training data point is determined for all validation data records for the pressure signals, time parameters and temperature parameters, respectively, with respect to the respective dimension of the relevant parameter, i.e. the distance between the time-series vectors of the pressures of the two data points considered respectively (as euclidean distance), with respect to the time distance of the time parameters of the two data points considered respectively or with respect to the distance of the temperature parameters of the two data points considered respectively.

Thus, for each verification data record and for each parametric class in the considered data point, a quality and distance value is obtained. For each of these parameter classes, the maximum value of the quality is now determined and the associated distance value is assigned to the corresponding parameter class while suppressing edge effects, i.e., for example, within a range of values of the distance value between, for example, 5% and 95% of the determined maximum distance value.

The distance values thus determined for the parameter classes of the data points can now be assigned to the weights of the respective parameter classes. For this purpose, the determined assigned distance values can be normalized to 1 and these distance values can be used to load the corresponding square terms assigned to the parameter class when determining the euclidean distance.

The distance metric is obtained in this manner to determine the distance size of the data points in any heterogeneous format.

Such a distance dimension determined using the distance metric thus determined may be used, for example, to determine anomalies based on the distance of the data point to be evaluated from other data points. Such distance dimensions may also be used to evaluate a data point or set of data points by analyzing the training data space in order to determine gaps in the training data space or to determine outliers from data points in the training data space and thus to determine training data records for further training of the corresponding model.

According to another aspect, there is provided an apparatus for performing one of the above methods.

Drawings

Embodiments are described in more detail below with reference to the accompanying drawings. Wherein:

fig. 1 shows a schematic diagram of a technical system with a series of sensors for detecting status information about the technical system at a certain point in time or period of time;

FIG. 2 shows a flow chart illustrating a method for determining a distance metric for a technical system;

FIGS. 3a to 3c show graphs of different parameter classes to present a distribution of quality metrics and distance values limited to the respective parameter class; and

FIG. 4 illustrates an injection system corresponding to FIG. 1 with a sensor system and anomaly identification.

Detailed Description

Fig. 1 shows a schematic diagram of a technical system 1 with a series of sensors 2, which are designed to detect status information of the technical system 1. These sensors may include, for example, pressure sensors, temperature sensors, acceleration sensors, vibration sensors, radiation sensors, mass flow sensors, and cameras, lidar or radar sensors, and the like. The state information has a format and can be present as scalar quantity parameter, as time sequence information of parameters, as image information or as moving image information with respect to a time step, i.e. with respect to a point in time or a time period.

For further processing of the sensor data, these are detected at predetermined points in time or within predetermined time periods and converted in the formatting block 3 into a corresponding data format having a plurality of parameter classes. The data format results in data points that may be provided in the form of data vectors or data tensors. The data point integrates the different parametric classes of the state information and maps the parametric classes with different numbers of elements in the data point, respectively.

The data points can now be evaluated in the data-based system model 4 in order to determine system variables which are used for monitoring and/or controlling the technical device 5, in particular for controlling downstream functions based on the system variables, for controlling the technical device 5 or for monitoring the technical system 1.

The value ranges of the individual state information in these parameter classes can differ significantly from one another.

An anomaly identification block 6 may also be provided, which obtains the corresponding data points to be evaluated. The anomaly identification block 6 may be designed to determine the distance size from a reference data point representing normal operation. The reference data points are predefined and the distance dimensions of the data points to be evaluated can be determined by means of a distance measure.

The distance metric may be based on, for example, a weighted L2 norm, which provides a separate weight factor for each parametric class.

If the distance dimension determined from the predefined distance measure is higher than the predefined anomaly threshold value, an anomaly is detected and correspondingly reported via a signal S.

Fig. 2 schematically shows a flow chart for elucidating a method for determining a metric parameter of a distance metric that can be used in the anomaly identification block 6. The method may be performed in a computer and used for parameterizing the anomaly identification model in the anomaly identification block 6.

For this purpose, first of all training data records are provided in step S1, which assign data points to markers, wherein the markers correspond to measured, simulated or modeled system variables or other variables describing the system behavior. In particular, the system parameter is selected such that it has a dependency on all parameter classes of the parameter classes used in the data points of the training data records. These training data records are divided into training sets and verification sets.

In step S2, a data-based system model 4 or other data-based model is trained by means of the training sets of these training data records, which maps training data points of the training sets to correspondingly assigned markers.

In step S3, for each data point of the validation set of training data records, a quality measure corresponding to the predefined quality function is determined. One possible quality function may correspond to a single difference between the model output of the data-based system model 4 and the labels at each of the data points in the validation set. In this way, a quality metric is obtained for each data point of the validation set.

Further, in step S4, a distance value from the closest data point of the training set is determined for each time point of the validation set. The distance value is determined only with respect to the corresponding parameter class. That is, the distance value in the case of scalar parameters corresponds to the single difference or squared difference with the corresponding parameter class in the closest data point of the training set. The closest data point of the training set corresponds to the data point with the smallest distance value of the corresponding parameter class.

In the case of time series or multidimensional parameter classes, the distance value between the data point to be evaluated of the validation set and the data point of the training set corresponds to, for example, the euclidean distance. Now, a quality measure is obtained for each of the data points of the validation set, and a distance value is obtained for each of the parameter classes. This is illustrated by way of example from the graphs of fig. 3a to 3c for different parameter classes, wherein fig. 3a corresponds to a scalar parameter of the pressure, the parameter class of fig. 3b corresponds to a time-series signal of an exemplary piezoelectric sensor, and fig. 3c corresponds to a parameter class of a time parameter.

Now, in step S5, the maximum value of the quality measure is determined within the middle range of all distance values from the parameter class. That is, particularly small distance values and particularly high distance values with respect to the parameter class are not considered in determining the maximum value of the corresponding quality measure. For example, the entire range of the distance values that occur may be determined only between 5% and 95%, preferably between 10% and 90%, of the maximum distance value of the relevant parameter class. The distance values of the maximum quality measure are obtained, the relative relation between these distance values determining the weight factor of the distance measure

In particular, after the distance values of the maximum masses have been normalized with respect to one another, the following applies:

distance values which can be determined in particular in accordance with these parameter classes in this way

To determine the maximum distance value of these distance values +.>

. This value corresponds to +.>

Is a weight of (2). Quotient of these distance values>

Weights of other parameter classes are determined. In these figures, the->

At 0.8 the other two peaks are at 0.1, whereby the weight factors for time and pressure are 8 and the weight factor for the signal is 1.

For example, three different parameter classes should be provided

And +.>

And->

Is a signal of (a). Is +.>

Is determined separately for each parameter class. Each of these distances is multiplied by the determined weight and then added together.

Alternatively, a weight factor for adapting the distribution of the quality measure with respect to the distance value of each of the parameter classes may be determined

From these weighting factors, the corresponding weighting factors for the distance measure can likewise be determined after normalization. Now, the weighting factor thus determined +.>

May be used in an anomaly identification model.

Fig. 4 shows, as an example of a sensor system 1, an injection system 10 of an internal combustion engine 12 of a motor vehicle, for which an example of a cylinder 13 (in particular a plurality of cylinders) is shown. The internal combustion engine 12 is preferably designed as a diesel engine with direct injection, but may also be provided as a gasoline engine.

The cylinder 13 has an inlet valve 14 and an outlet valve 15 for delivering fresh air and for discharging combustion exhaust gases.

In addition, fuel for operating the internal combustion engine 12 is injected into the combustion chamber 17 of the cylinder 13 via the injection valve 16. For this purpose, the fuel is supplied to the injection valve via a fuel supply line 18 via which the fuel is supplied at high fuel pressure in a manner known per se, for example a Common Rail (Common Rail).

The injection valve 16 has an electromagnetically or piezoelectrically controllable actuator unit 21, which is coupled to a valve needle 22. In the closed state of the injection valve 16, the valve needle 22 is located on the needle seat 23. By manipulation of the actuator unit 21, the valve needle 22 is moved in the longitudinal direction and releases a portion of the valve bore in the needle seat 23 to inject fuel under pressure into the combustion chamber 17 of the cylinder 13.

Injection valve 16 also has a piezoelectric sensor 25, which is arranged in injection valve 16. The piezoelectric sensor 25 deforms due to pressure variation in the fuel guided through the injection valve 16, and generates a voltage signal as a sensor signal.

The injection is performed under the control of the control unit 30 that specifies the amount of fuel to be injected by energizing the actuator unit 21. The sensor signal is sampled in time by means of an a/D converter 31 in the control unit 30, in particular at a sampling rate of 0.5 to 5 MHz.

A pressure sensor 18 is also provided to determine the fuel pressure (rail pressure) upstream of the injection valve 16.

During operation of the internal combustion engine 12, the sensor signal is used to determine the correct opening or closing time of the injection valve 16. For this purpose, the sensor signals are digitized into a time series of evaluation points by means of the a/D converter 31 and evaluated by means of a suitable sensor model, from which the opening time of the injection valve 16 and accordingly the injected fuel quantity can be determined as a function of the fuel pressure and other operating variables. In order to determine the opening time, in particular the opening time and the closing time are required in order to determine the opening time as a time difference between these variables.

The on-time point and/or the off-time point may be determined from consideration of a sensor signal time sequence of the sampled sensor signals. The opening time and/or the closing time can be performed in particular by means of a data-based system model. In this system model, the rail pressure and the time parameters for actuating the opening and/or closing of injection valve 16 can be evaluated as additional state variables. The data points to be evaluated now contain a time series of sensor signals of scalar values of the rail pressure and scalar values of the time parameter.

Thus, in connection with the above-described sensor system 1, the training data points correspond to data points and Change-Point time points as on and/or off time points for the markers.

For the example of injection system 10 described above, a distance metric for a parametric class of the data point may be determined according to the method described above. To determine the distance metric of injection system 10, a quality metric G may be determined for each data point of the verification set in accordance with the classification model described above and the graphs of FIGS. 3 a-3 c correspondingly determined therefrom. The graph of fig. 3a shows the distance value a of the quality measure G at the rail pressure _rail The distribution over time is shown in FIG. 3b, which shows the distance value A of the quality measure G in the sensor signal time sequence of the piezoelectric voltage _sens The distribution over time and FIG. 3c shows the distance value A of the quality measure G over time _time Distribution on the upper surface. The maximum of these mass distributions is marked at about 0.11 in fig. 3a, about 0.9 in fig. 3b and about 0.115 in fig. 3c, respectively. This results in a distance measure calculated as follows:

the distance between the two signals is:

and->

. In this case, x and y are the first parameter classes (sensor signals), t and s are the time parameters, respectively, and p and q are the pressure parameters, respectively.

The distance measure is now used for anomaly identification in the anomaly identification block 6 in order to determine anomalies from the distance dimensions of the data points from the training data points.

Claims

1. A computer-implemented method for determining a distance metric to determine a distance from a data point having a heterogeneous parameter class, the method having the steps of:

-providing (S1) training data records, which respectively assign data points to markers and which are divided into training data points of a training set and verification data points of a verification set;

-training (S2) a data-based system model (4) with the training set such that the system model (4) assigns model outputs to data points, respectively;

-determining (S3, S4), for each verification data point of the verification set, a quality measure of the system model (4) and a distance value from a closest training data point of each of the parameter classes, wherein the distance value from the closest training data point is determined separately with respect to the relevant parameter class;

-determining (S5) a distance value of a maximum quality measure for each of the parameter classes; and also

2. A method according to claim 1, wherein the quality measure is determined for the respective verification data record in dependence of a distinction between a model output of the system model (4) and a marking of the associated verification data record.

3. The method according to claim 1 or 2, wherein the data points comprise a plurality of parameter classes, which respectively correspond to a single physical quantity, a time sequence of physical quantities, image data of an imaging device or moving image data of an imaging device, wherein at least two of the parameter classes have assigned value ranges which deviate from one another by more than 50%.

4. A method according to any one of claims 1 to 3, wherein the distance value from the closest training data point is determined as euclidean distance in relation to the relevant parameter class in the case of a multidimensional parameter.

5. The method of any of claims 1 to 4, wherein a distance value of the maximum quality measure for each of the parameter classes is determined only within a range between 5% and 95% of a maximum distance value of the associated parameter class.

6. The method according to any of claims 1 to 5, wherein the distance measure is used for determining anomalies or for evaluating data points based on the distance dimensions of the data point to be evaluated from other data points determined with the distance measure, in order to find voids in a training data space or outliers from data points in the training data space.

7. An apparatus for performing the method of any one of claims 1 to 6.

8. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 6.

9. A machine-readable storage medium comprising instructions which, when executed by a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 6.