CN116662847A

CN116662847A - Data distortion identification method based on association density clustering and application thereof

Info

Publication number: CN116662847A
Application number: CN202310279317.2A
Authority: CN
Inventors: 王娟; 杜晓莹; 申祖晨; 祁鑫; 杨娜; 白凡
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-08-29

Abstract

The application discloses a data distortion identification method based on association density clustering, which comprises the following steps: and for the historical time series of n strain measuring points of the strain sensor, respectively measuring the association degree between the measuring point sequences through the angle of the overall displacement difference representing the similarity of the n strain measuring points and the overall first-order slope difference representing the similarity of the development trend and the overall second-order slope difference of the similarity of the development speed based on gray type B association, constructing a classification model, and identifying real-time monitoring data based on the classification model. The application can complete the data distortion identification caused by hardware faults, can realize the fault diagnosis of multi-measuring point equipment by utilizing the relevance of multi-source data, and improves the reliability and the robustness of clustering.

Description

Data distortion identification method based on association density clustering and application thereof

Technical Field

The application relates to the technical field of data processing, in particular to a data distortion identification method based on association density clustering and application thereof.

Background

When the structural safety state is evaluated based on the monitoring data, real and reliable actually measured data is a precondition for searching the actual state of the structure, so that the data obtained by the monitoring sensor cannot be simply and accurately defaulted. For most health monitoring systems, the service lives of the sensors and other equipment are only decades or even more than ten years, compared with the service lives of ancient wood structures for hundreds of years, faults caused by external environment, aging, electromagnetic interference and other reasons are not rare in the service process, so that the real-time identification of data distortion caused by equipment faults is realized, the faults can be processed timely, the reliability of the monitoring system is ensured, reliable and accurate data are provided for structural safety evaluation, and structural abnormality misjudgment of the monitoring system can be reduced.

The data distortion caused by the system hardware fault is different from the abnormality caused by the structural damage, so that the correct distinction is ensured, the data distortion caused by the equipment abnormality is often single or few measuring points, the structural strain has obvious correlation among the strain measuring points under the influence of the environmental temperature, and the data information of a single sensor is considered to be single and the unilateral performance is strong, and meanwhile, the structural strain is utilized to have a relatively stable time sequence characteristic.

Disclosure of Invention

The application aims to provide a data distortion identification method based on association density clustering, which aims to solve the technical problems in the prior art.

In order to solve the technical problems, the application specifically provides the following technical scheme:

in a first aspect of the present application, there is provided a data distortion identification method based on association density clustering, including:

historical time series x= { X for n strain measurement points of strain sensor ₁ (t)，x ₂ (t)，...，x _n (t) } wherein 0.ltoreq.t.ltoreq.m, based on grey type B correlations, respectively by an overall displacement difference representing the closeness of the n strain pointsAnd measuring the correlation degree among the measuring point sequences at the angle of the overall first-order slope difference representing the similarity of the development trend and the overall second-order slope difference representing the similarity of the development speed, constructing a classification model, and identifying real-time monitoring data based on the classification model.

Preferably, the method further comprises a normalization process for the historical time series:

dividing a step-by-step clustering window for n measuring point history time sequences of a certain period of time, wherein the length w determines the time section of abnormal positioning, and the shorter the window is, the stronger the instantaneity of judgment is;

performing standardization processing on each sequence of the window by utilizing the z-score to eliminate the dimension or balance the measurement dimension of each strain sensor;

the time sequence of a certain window after normalization is shown as a formula (6),

wherein n is the measuring point sequence number, and m is the time sequence number.

Preferably, the method further comprises smoothing and denoising the normalized historical time series:

the Savitzky-Golay method is selected for smooth denoising, the convolution and polynomial regression are combined to realize smooth filtering, and different smooth effects are realized by adjusting the sliding window and the fitting order, which are expressed as follows:

wherein 2l+1 is the sliding window length, x _k Represents the center of the sliding window, h _i The smoothing coefficient is obtained by a least square fitting polynomial;

wherein, the smaller the polynomial fitting order, the longer the window length, the more remarkable the smoothing effect.

Preferably, constructing the classification model includes:

respectively carrying out differential processing on the smoothed and denoised window sequence as an overall displacement difference, an overall first-order slope difference and an overall second-order slope difference according to columns, merging the window sequence with a matrix X to form a new matrix, and adding a weight coefficient into the new matrix to obtain a clustering feature matrix Y;

performing cluster analysis, and determining Manhattan distance as a calculation mode of the inter-sequence distance according to a calculation method of the inter-sequence distance in a cluster calculation process;

and determining parameters (E, minPts) according to actual engineering requirements, obtaining a final clustering result, and constructing a classification model.

Preferably, the new matrix comprises:

overall displacement difference matrix:

first order differential matrix:

second order differential matrix:

clustering feature moment

Wherein dif _ij ＝x _i(j+1) -x _ij ，w ₁ ，w ₂ ，w ₃ As a weight, if more dependence on proximity features in cluster analysis increases w ₁ Reduce w ₂ And w ₃ The method comprises the steps of carrying out a first treatment on the surface of the Increasing w if more dependent on similarity features ₂ And w ₃ Reduce w ₁ 。

Preferably, the manhattan distance is as in formula (11):

preferably, identifying real-time monitoring data based on the classification model includes:

the method comprises the steps of carrying out clustering analysis on monitoring data collected in real time step by adopting the same window length w;

when the clustering results of all the measuring points are consistent with the clustering model, judging that the time data is normal;

when the clustering results are inconsistent and the outliers appear, the clustering results are regarded as distortion;

and when the clustering results are inconsistent and the outliers are not available, further carrying out data distortion identification.

On the other hand, the application provides application of the data distortion identification method, which is applied to health monitoring of wood structures.

Preferably, the measurement points include strain measurement points of the homogeneous component and strain measurement points of the heterogeneous component.

Compared with the prior art, the application has the following beneficial effects:

the application completes the data distortion recognition caused by hardware faults based on the gray B-type association density clustering method, can realize the fault diagnosis of multi-measuring point equipment by utilizing the association of multi-source data, and the improved gray B-type association degree realizes the determination and calculation of characteristic attributes, thereby improving the reliability and robustness of clustering and better recognizing the distorted data.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

FIG. 1 is a strain sensor layout of the present application;

FIG. 2 is a schematic representation of a strain and temperature and humidity sensor arrangement according to the present application;

FIG. 3 is a schematic diagram of a similar component strain measurement point cluster mode I according to the present application;

FIG. 4 is a schematic diagram of a similar component strain measurement point cluster mode II according to the present application;

FIG. 5 is a schematic diagram of a similar component strain measurement point cluster mode III of the present application.

FIG. 6 is a schematic diagram of a cluster mode one of strain measurement points of a heterogeneous component according to the present application;

FIG. 7 is a schematic diagram of a cluster pattern two of strain measurement points of a heterogeneous component according to the present application;

FIG. 8 is a schematic diagram of a cluster mode III of strain measurement points of a heterogeneous component according to the present application;

FIG. 9 is a flow chart of data distortion identification based on density clustering in accordance with the present application;

FIG. 10 is a schematic diagram of C2-4 outlier identification of the present application;

FIG. 11 is a schematic diagram of C2-4 constant offset identification of the present application;

FIG. 12 is a schematic diagram of C2-3 data stuck identification of the present application;

fig. 13 is a flow chart of the overall method of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, the application provides a data distortion identification method based on association density clustering, which is used for historical time series x= { X of n strain measuring points of a strain sensor ₁ (t)，x ₂ (t)，...，x _n (t) } wherein 0.ltoreq.t.ltoreq.m, based on gray type B correlations, by an overall displacement difference representing the similarity of n strain measurement points, and an overall first-order slope difference and a sum of the differences representing the similarity of the development trends, respectivelyThe degree of association between measuring point sequences is measured by the angle of the overall second-order slope difference of the speed-expanding similarity, a classification model is constructed, and real-time monitoring data are identified based on the classification model.

Wherein, in theory, closeness and similarity can be expressed as:

proximity:

similarity:

wherein the method comprises the steps ofFor the total displacement difference>The overall first-order slope difference represents the similarity of the development trend; />The overall second order slope difference represents the similarity of development speed. And further defining the association degree as:

the core idea of measuring the correlation degree between sequences through two angles of similarity and similarity in the original theory is utilized.

When the association degree is calculated, according to the characteristics of the monitoring data, balancing the data scale of two measurement indexes of similarity and similarity, simultaneously considering the actual application condition of engineering, and referring to the specific gravity of the two measurement indexes, obtaining final association characteristics by multiplying the total displacement difference, the total first-order slope difference and the total second-order slope difference by the weight coefficient respectively, wherein the final association characteristics are shown as a formula (5):

if more dependence on proximity features in cluster analysis increases the weight w ₁ Reduce the weight w ₂ And weight w ₃ . If the similarity feature is more dependent, the weight w is increased ₂ And weight w ₃ Reduce the weight w ₁ 。

The normalization of the historical time series is as follows:

Smoothing and denoising the standardized historical time sequence as follows:

wherein 2l+1 is the sliding window length, x _k Represents the center of the sliding window, h _i Fitting multiple by least squares for smoothing coefficientsObtaining a term;

The construction of the classification model comprises the following steps:

the new matrix comprises:

overall displacement difference matrix:

first order differential matrix:

second order differential matrix:

clustering feature moment

the manhattan distance is as in formula (11):

parameters (epsilon, minPts) are determined according to actual engineering requirements, a final clustering result is obtained, a classification model is constructed, minpts is the number of clusters generated by clustering, and epsilon refers to w1, w2 and w3 weight parameters.

Identifying real-time monitoring data based on the classification model as follows:

The application also provides application of the data distortion identification method to health monitoring of the wood structure. The measuring points comprise strain measuring points of the same type of component and strain measuring points of different type of component.

Specific examples are provided below for illustration:

as shown in fig. 1 and fig. 2, a wood structure ancient architecture of a health monitoring system with a certain layout structure is taken as an object, a related diagram of a strain sensor is shown in fig. 1 and fig. 2, sampling time intervals are 10 minutes, the length of a distributed clustering window is 24 hours, and 144 sampling points are arranged in each window.

And carrying out standardization and smooth denoising pretreatment according to the method. The characteristics of the acquired data determine that the smoothing order is 3 and the length of the moving window is 25, so that noise can be filtered, and the influence of a smoothing process on distorted data is reduced as much as possible.

When a new characteristic attribute matrix is constructed after smooth noise reduction, the strain data characteristics applied at the time and the requirements on similarity characteristics and similarity characteristics are considered, and the selection weight coefficients are respectively as followsw ₁ ＝0.1，w ₂ ＝0.45，w ₃ ＝0.45。

Cluster analysis of strain measuring points of similar components:

taking 12 through column strain measuring points as an example, training is carried out by utilizing historical strain data, and parameters (epsilon, minPts) in the clustering process are (5.4,1) which basically meet the requirement of obtaining a stable clustering result, so that the robustness is good. The final clustering result presents three modes:

cluster mode one: through cluster analysis, the 12 measuring points in the mode are all classified into two presentation modes, namely a timing diagram of the model presents a nearly sinusoidal form, and a non-fixed form is presented, as shown in figure 3.

Cluster mode two: in the mode, other measuring points except for three C1-2, C1-3 and C1-4 are classified into one type except that the three measuring points are easily classified into wild values, the expression forms of the measuring points classified into one type also show two states, namely, the time sequence diagram is similar to a sine curve form, and the other measuring points are expressed in a non-fixed form, as shown in figure 4.

Clustering mode three: in the mode, the time sequence diagram of each measuring point is disordered, the clustering condition is unobvious and regular, most measuring points cannot be classified, and a consistent clustering result is not provided, as shown in figure 5.

Cluster analysis of strain measurement points of heterogeneous components:

and adding frame beam strain on the basis of the through column strain analysis, and carrying out cluster analysis on 24 measuring points in total. The clustering result presented by the method under the two conditions of higher and lower environmental humidity is basically consistent with the simple column-through strain result. Except for the three post strain measuring points C1-2, C1-3 and C1-4, the measuring points B1-4, B3-2 and B3-4 are measuring points which are easy to be marked as wild values in the beam strain. Therefore, the strain clustering results of the heterogeneous components have consistent regularity, reflect that the clustering analysis method based on the gray theory is good in robustness, and can provide a reliable clustering model for the subsequent real-time identification of data distortion, as shown in fig. 6, 7 and 8.

And (3) verifying a distortion data identification method:

constructing a clustering model: a classification model is built by using 9 measuring points such as C1-1, C2-2, C2-3 and the like. The working condition during training can be adjusted to 2 conditions, and two clustering models are correspondingly constructed. Respectively "gather into a class" and "irregular". At this time, the "group" corresponding pattern one in training is shown in table 2, and the total duty ratio reaches 86%, which indicates that the clustering method can be used to process the data distortion condition of not less than 86% in the 9 measurement point time series at the same time. Therefore, when the cluster analysis of the real-time multi-measuring-point time sequence is carried out, if the cluster analysis is consistent with the first model, the representative data is normal; if the wild value appears, the corresponding measuring point can be judged as data distortion; and if different classification conditions occur, carrying out subsequent structural abnormality recognition to further judge. The specific identification flow is shown in fig. 9.

Table 2 build a clustering model

The method uses 9 through-column strain measuring points of 6 months in 2020 to perform analog distortion data identification verification, adds outliers, constant offset and data locking three forms of distortion analog data into strain sequences of 4, 6, 8, 14 and 15 days, and can identify corresponding data distortion measuring points, which are respectively shown in figures 10, 11 and 12.

To further explore the sensitivity of the method to three data distortion type identifications, hierarchical simulations were performed on the outlier data.

(1) Outliers

By analyzing the historical strain time sequence, the strain fluctuation range of each measuring point is less than 400 [ mu ] epsilon on average in years. The single point outlier anomalies were therefore classified into three classes, as shown in table 3, with 10 outlier distortion data applied for each class, with the anomalies applied for each simulation being random values within the class range. The probability of having outlier points is identified by the method, analyzed, at each level, as set forth in table 3. The results show that this method can effectively identify absolute outliers above 300 (mu epsilon), but cannot identify outliers below 200 (mu epsilon).

TABLE 3 outlier identification sensitivity control Table

Table 3Comparison of outlier identification sensitivity

(2) Data card death

For the continuous data distortion type of data stuck, the sensitivity of the method to the continuous distorted data time span needs to be explored. Under the sampling frequency of 10min average, through calculation, the method can effectively identify the data locking condition of more than 40 continuous sampling points, namely, the method is suitable for the condition of more than 6.7 hours of time span. Because the method performs cluster analysis under the condition of taking 24 hours as the window length, the data distortion category can be timely identified.

(3) Constant offset

The constant offset is also of a continuous data distortion type, and comprises two variables of a continuous distortion time span and an offset, wherein the variable is a control variable, and the sensitivity of the method to the offset is explored under the condition that the continuous time span is 7 hours by referring to the result of data blocking. The offset is identified as three classes, each class range is shown in table 4, each class applies 10 times of outlier distortion data, and each simulation applies an offset that is a random value within the class range. The result shows that the method can effectively identify the distortion data for the case that the absolute value of the offset is larger than 50 (mu epsilon).

TABLE 4 constant offset identification sensitivity comparison Table

Table 4Comparison of outlier identification sensitivity

By analyzing the sensitivity of three abnormal categories, the recognition sensitivity of outliers is lower than that of other types of data distortion recognition due to the limitation of a smooth noise reduction method, and the recognition sensitivity of continuous data distortion types such as constant offset, data locking and the like is better.

The above embodiments are only exemplary embodiments of the present application and are not intended to limit the present application, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this application will occur to those skilled in the art, and are intended to be within the spirit and scope of the application.

Claims

1. A data distortion identification method based on association density clustering, comprising:

historical time series x= { X for n strain measurement points of strain sensor ₁ (t)，x ₂ (t)，...，x _n And (t) measuring the degree of association between measuring point sequences based on gray type B association through the angle of total displacement difference representing the similarity of n strain measuring points and the total second-order slope difference representing the similarity of the development trend and the similarity of the development speed respectively, constructing a classification model, and identifying real-time monitoring data based on the classification model.

2. The data distortion identification method based on association density clustering as claimed in claim 1, further comprising a normalization process for the historical time series:

3. The data distortion identification method based on association density clustering as claimed in claim 2, further comprising smoothing and denoising the normalized historical time series:

4. A method of data distortion identification based on association density clustering as claimed in claim 3, wherein constructing the classification model comprises:

5. The method for identifying data distortion based on association density clustering as claimed in claim 4, wherein the new matrix comprises:

overall displacement difference matrix:

first order differential matrix:

second order differential matrix:

clustering feature moment

Wherein dif _ij ⁽⁰⁾ ＝x _(i+1)j -x _ij ，dif _ij ＝x _i(j+1) -x _ij ，w ₁ ，w ₂ ，w ₃ As a weight, if more dependence on proximity features in cluster analysis increases w ₁ Reduce w ₂ And w ₃ The method comprises the steps of carrying out a first treatment on the surface of the Increasing w if more dependent on similarity features ₂ And w ₃ Reduce w ₁ 。

6. The method for recognizing data distortion based on association density clustering as claimed in claim 5, wherein the manhattan distance is as follows:

7. the method of claim 6, wherein identifying real-time monitoring data based on the classification model comprises:

8. Use of a data distortion identification method as claimed in any of claims 1-7 for health monitoring of wood structures.

9. The use according to claim 8, wherein,

the measuring points comprise strain measuring points of the same type of component and strain measuring points of different type of component.