CN115628776A

CN115628776A - Water supply pipe network abnormal data detection method

Info

Publication number: CN115628776A
Application number: CN202211312033.0A
Authority: CN
Inventors: 李守俊; 李江; 金波; 何必仕; 徐哲
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-20

Abstract

The invention discloses a method for detecting abnormal data of a water supply network, and belongs to the technical field of online monitoring of the water supply network. According to the invention, monitoring points clustered based on normal working conditions are reasonably grouped. Secondly, on the basis of clustering results of normal working conditions of the groups and the measuring point groups, the distances from all samples in the groups to the class centers of the samples are calculated. Then, determining the difference threshold value of each type in each measuring point group by adopting a box type graph, and checking all sample data. And finally, finishing the actual abnormal data detection. The method adopts k-means clustering and box-type graph distinguishing methods, makes full use of the time-space correlation among the nodes, establishes a clustering model of local nodes, can accurately identify abnormal data detected by the water supply network, and provides guarantee for correctly analyzing the running state of the water supply network.

Description

Water supply pipe network abnormal data detection method

Technical Field

The invention relates to the technical field of water supply network online monitoring, in particular to a method for detecting abnormal data of a water supply network.

Background

With the development of the internet of things technology, a water supply network data acquisition and monitoring system SCADA is gradually popularized. The manual processing of the mass actual measurement data generated by the online monitoring is very difficult. In recent years, various data driving methods are developed, potential abnormal values in the data are automatically detected, massive data are primarily screened, and manual identification processing amount can be greatly reduced. Document [1] performs segmentation on 56 months online monitoring data of a certain water supply network monitoring station in Beijing, over time and season, constructs an autoregressive moving average (ARMA) model, and identifies abnormal values in an artificial simulation sequence through a confidence interval established by the ARMA model, thereby realizing self-identification of data of an independent node. In the document [2], the measured pressure data of a single pressure monitoring point in a certain cell is used as sample data, the sample data is denoised through wavelet analysis, 10% of reduction amount of the sample data is used as abnormal data, and abnormal data identification is performed based on a local outlier factor algorithm and a K-means algorithm. When the abnormal value detection is carried out on the sample, the sample is processed by a method of subtracting adjacent data of the sample to be detected, abnormal detection results of the sample data before and after the processing are compared, and detection results of abnormal data under different algorithms are compared. The result shows that the processed sample data has better abnormal detection effect, and the K-means algorithm has relatively better result. Document [3] provides a method for detecting abnormal data of a water supply network based on big data, which determines the abnormal data of each sensor after preprocessing, normal distribution processing and triple standard deviation (3 sigma) criterion detection are carried out on the big data for monitoring the water supply network, and the detection precision is high. Documents [1], [3] and [4] only use time sequence data of a single monitoring point, and do not use implicit laws that a plurality of monitoring points of a water supply network have space-time correlation, so that the false positive rate is high.

Document [4] performs time-interval segmentation on online monitoring data of 8 months of 10 monitoring points of a certain domestic water supply network, determines the number of the optimal monitoring points through analysis of spatial topological relations among the monitoring points, and selects data of the monitoring points to construct a Support Vector Regression (SVR) model. After the model is optimized, the confidence interval established by the model is used for identifying the artificial simulation abnormal value. The result shows that the SVR model has good fitting performance, the abnormal value detection rate is up to 90%, and the interactive identification between the selected monitoring point data is realized. Document [4] utilizes an SVR model to mine the implicit logical relationship between monitoring points to predict the variation trend of data of a specific monitoring point, and further provides a prediction confidence interval to identify abnormal values. However, to ensure the accuracy of the SVR model, the monitored data was sliced into 96 segments at 15-minute intervals to reduce data fluctuations. Thus, 96 SVR models must be constructed, and the actual processing is complicated and inefficient. If season change occurs, the data fluctuation of the segments is increased, the fitting performance of the SVR model is poor, and abnormal value identification is directly influenced.

Therefore, a water supply network abnormal data detection method which fully utilizes the time-space correlation among the monitoring points, is simple and convenient to model and calculate is needed.

Reference documents:

[1] liu Shuming, wu Buppon, and vehicle break quality control using self-identifying water supply network monitoring data quality control [ J ], university of Qinghua academic newspaper (Nature science edition), vol.57No.9 2017.

[2] Yan sailing, water supply network anomaly detection data identification research [ D ] Tianjin university of science institute, 2022. DOI.

[3] Liu boat, liu red, lang dynasty, tianwuping, hanxing, a water supply network abnormal data detection method [ P ] based on big data: CN112612824A,2021-04-06.

[4] Liu Shuming, wu Buppon, and vehicle Down, detection of water supply and water discharge [ J ] based on cross-identified water supply network data outliers, 2015, vol.41No.11.

Disclosure of Invention

Aiming at the problems, the invention provides a water supply network abnormal data detection method, which combines a k-means clustering model and a boxplot (Box-plot) discrimination method, selects flow pressure data of local adjacent nodes to establish a clustering model by utilizing the time-space correlation among water supply network nodes, determines various specific critical threshold values and realizes the detection of the water supply network flow pressure abnormal data.

The invention comprises the following steps:

and step 1, reasonably grouping the monitoring points based on normal working condition clustering.

The water supply network is composed of a plurality of nodes and connecting pipes. The water department generally arranges flow and pressure monitoring points at a tail end water-requiring node and a pipe network branch node so as to sense the hydraulic operation state of the pipe network. As the hydraulic states of the adjacent nodes have close relevance, the monitoring points are reasonably divided, and the measuring points with strong relevance form a group as much as possible, so that the operation condition of the involved area is favorably excavated.

Monitoring points covering the whole pipe network are divided into 1 group, 2 groups, \ 8230;, and N/2 groups (N is the number of the measuring points, and N is an odd number, and N/2 is rounded downwards), and the working condition analysis is performed on the measuring point groups under the L grouping mode by adopting k-means clustering respectively on the assumption that L grouping modes exist. And calculating the separability index epsilon (L) 'and the contour coefficient S (L)', L =1,2, \8230lof the cluster of each grouping mode according to the clustering result of each grouping mode. The larger the values of the two indexes are, the better the clustering effect of the grouping mode is, namely the more reasonable the grouping of the monitoring points is. Because the two indexes have different increasing trends along with the number of the measuring points, a biaxial display broken line graph taking a separability index epsilon (l) 'and a profile coefficient S (l)' as Y axes and the number of the monitoring points as X axes is drawn, and a reasonable adjacent monitoring point grouping mode is selected according to the intersection point of the broken lines of the two indexes.

The concrete process of reasonably grouping the monitoring points is as follows:

firstly, monitoring points covering the whole water supply network are divided into 1 group, 2 groups, \ 8230;, and N/2 groups (N is the number of measuring points, and N is an odd number, and N/2 is rounded) according to the adjacent principle, and L grouping modes are provided. And under each group, performing k-means clustering on normal flow and pressure data included in the measuring point group to obtain a clustering result of the measuring point group. When k-means is clustered, the value k of the clustering category number is selected according to an elbow method, the core index of the elbow method is SSE (sum of square error), and formula (1) is a calculation method of the SSE:

where k denotes the number of classes of the cluster, E _d Denotes the d-th category, a denotes E _d Sample point of (1), c _d Denotes E _d Of the center of (c). Calculating SSE (sum of squared error) under multiple k values, and taking X axis as the number of clustering classesMeasuring the k value, wherein the Y axis is an SSE value, drawing a curve graph to obtain an elbow graph, and selecting the k value corresponding to the elbow as the quantity of the clustering categories.

The k-means clustering comprises the following specific processes:

if a data set X is given, n sample points are included in the data set, and the dimension of each sample point is m-dimension, that is: x = { X ₁ ,X ₂ ,X ₃ ,…,X _n }。

(1) First, k sample points D = { D ] are randomly selected in the data set X ₁ ,D ₂ ,D ₃ ,…,D _k As initialization class center. Calculate each sample point X _a The Euclidean distances to all the clustering centers are calculated by the formula (2);

X _a represents the a sample point, a ∈ [1, n ]]，D _b Represents the b-th clustering center, b ∈ [1,k ]]，X _at The tth attribute, t e [1, m ], representing the a-th sample point]，D _bt The t-th attribute representing the b-th cluster center.

(2) Comparing the distance from each sample to each clustering center, finding out the class center with the minimum distance, dividing the sample points into the class where the clustering center with the minimum distance is located, and obtaining data sets { F) of k classes ₁ ,F ₂ ,F ₃ ,…,F _k }。

(3) Calculating the central point of each category as a new cluster center according to the categories divided in the step (2), wherein the calculation formula of the cluster center is shown as a formula (3):

C _g represents the g-th cluster center, g ∈ [1,k ]]，F _g Denotes the g-th class, N _g Is represented by F _g Including the number of sample points, X _h Is represented by F _g At the h-th sample point, h is [1, N ] _g ]。

(4) And (4) iterating the steps (2) and (3) until the cluster center is not changed any more.

Secondly, the separability index ε of the packet is calculated as formula (4)

Wherein D is _pq Representing Euclidean distance between class p and class q clustering centers, and calculating according to formula (5), wherein m represents the dimension of the clustering center point in formula (5), C _pt And C _qt Respectively representing the t-th attribute of the p-th class center and the q-th class center.

Wherein d is _p And d _q Representing the intra-class distance between the class p and the class q, wherein the intra-class distance sigma is the standard deviation of the distance from the sample point to the clustering center in each class, and is calculated according to the formula (6), wherein N represents the total number of the distances from the sample point to the clustering center in the same class, and x is _e Represents the distance from the sample point e to the center of the class, and μ represents the average of all distance values, calculated according to equation (7):

since there are usually multiple categories per cluster, there is a separability index between the two categories. Therefore, the mean value of all separability indicators is taken as the separability indicator epsilon (l)' of clustering in grouping mode, and the calculation mode is shown in formula (8)

Wherein epsilon _o Represents the separation index o of the packet, o is [1,U ]]U represents a total of U separability indexes in the grouping method.

Then, a grouping-mode contour coefficient S (l)' is calculated, and the specific calculation process of the contour coefficient is as follows:

(1) For a sample point w in a certain class, the distance between the point and all other elements in the same class is calculated, and the average value of the distances is taken and is denoted as a (w), and the degree of aggregation in the class is expressed.

(2) And taking another class except the class to which the sample point w belongs, calculating the distances between the sample point w and all sample points in the class, then calculating the average value of the distances, traversing all other classes, finding the class with the closest distance, and recording the average value of the distances from the point w to all the sample points in the class as b (w) to represent the separation degree between the classes.

(3) For a sample point w, the contour coefficient calculation formula is as follows

(4) And calculating the contour coefficients of all the sample points, and calculating the average value of all the contour coefficients, namely the contour coefficient S (l)' of the grouping mode.

And finally, obtaining separability indexes and contour coefficients under each grouping mode, drawing a biaxial display graph, obtaining a grouping mode with a better clustering effect, namely determining reasonable grouping of the measuring points.

And 2, calculating Euclidean distances from all samples in the group to the class center of the samples on the basis of reasonable grouping of the monitoring points clustered based on the normal working conditions in the step 1.

The i measurement point groups reasonably grouped and the j categories of each measurement point group cluster can be obtained in the step 1, then the Euclidean distance from all samples in each measurement point group reasonably grouped to the center of the category of the sample is calculated, and the Euclidean distance is calculated according to the formula (2). After all Euclidean distance values are obtained, marking the distance value Dis according to the measuring point group i and the cluster class j _ij It is denoted as Disiance.

Wherein, dis _ij Representing the Euclidean distance set from all sample points of the jth class of the test point group of the ith group to the cluster center thereof, wherein the cluster types in each test point group are possibly different, so that the value of j is related to the test point group, and j belongs to { j ∈ { j } j ₁ ,j ₂ ,j ₃ ,…,j _i Where j is _i And representing the clustering category number of the ith measuring point group.

And 3, determining the difference threshold of each type in each measuring point group by adopting a box type graph, and inspecting all sample data.

Analyzing the reasonably grouped Distance data Distance by using the boxed graph to obtain boxed graph parameters of all categories in each measuring point group, wherein a normal data judgment threshold interval is from the upper limit of the boxed graph to the lower limit of the boxed graph, sample data in the interval is abnormal data, and a judgment threshold upper limit max is marked according to the measuring point group i and the cluster category j _ij And difference threshold lower limit min _ij Is marked as [ min, max ]]。

Therein, max _ij And min _ij Respectively representing the upper limit and the lower limit of a discrimination threshold of the jth category of the ith measuring point group, wherein the value of j is related to the measuring point group, and j belongs to { j ∈ { j ₁ ,j ₂ ,j ₃ ,…,j _i H, where j _i And representing the clustering category number of the ith measuring point group.

The calculation process of the box plot parameters is as follows:

(1) Calculating the quartile Q on the distance data ₃ And lower quartile Q ₁ For example, there are v distance data, and the v data are sorted from small to large, Q ₃ Is the T th ₁ Number at position, Q ₁ Is the T th ₂ Number of positions, where T ₁ And T ₂ Is given by the formula (11 And formula (12)

(2) Calculate the quartile Range IQR, as in formula (13)

IQR＝Q ₃ -Q ₁ (13)

(3) Calculating the upper limit Max and the lower limit Min according to the formula (14) and the formula (15)

Max＝Q ₁ +W*IQR (14)

Min＝Q ₃ -W*IQR (15)

Where W is the weighting factor in front of the quartile range IQR.

And (3) dividing all sample data into corresponding groups according to the measuring point group conditions, calculating the distance from all sample data to the center of the class to which the sample data belongs, and calculating according to the formula (2). According to the distance, dividing the sample data into the nearest classes, and marking distance values dis according to the measuring point group class i and the cluster class j _ij And is denoted as distance.

Wherein dis _ij Representing a distance set from the jth class sample data of the ith measuring point group to the class center, wherein the value of j corresponds to the number of the class of the reasonably grouped clusters, and j belongs to { j ∈ { j ₁ ,j ₂ ,j ₃ ,…,j _i Where j is _i And representing the clustering category number of the ith measuring point group.

Will dis _ij Distance value in (2) and corresponding difference threshold interval [ max [ [ max ] _ij ,min _ij ]And comparing, namely, the abnormal condition is obtained outside the interval.

And then obtaining a confusion matrix of each measuring point group according to the detection result, and calculating the detection accuracy, wherein the confusion matrix of a single measuring point group is shown in a table 1.

TABLE 1 confusion matrix

Classification of	Is actually abnormal	Is actually normal
			Is judged to be abnormal	TP	FP
Is judged to be normal	FN	TN

Data detection accuracy Accurary of measuring point group _i Calculated according to equation (16):

among them, accuray _i And the accuracy of the detection point group of the ith group is shown, wherein TP represents the number of detected correct abnormal data, FP represents the number of detected normal data as abnormal data, FN represents the number of detected abnormal data as normal data, and TN represents the number of detected correct normal data.

If accuracy is accurate _i If the difference is less than 95%, the difference threshold interval of each category of the measuring point group needs to be adjusted, the weighting coefficient W in front of the IQR in the formula (14) and the formula (15) can be adjusted in a small scale, and the default value of the coefficient is 1.5. After small amplitude adjustment, all sample data inspection is carried out again, if the accuracy is accurate _i If the standard is not reached, the adjustment is continued until the standard is reached.

Step 4 actual anomaly data detection

Aiming at the node flow pressure data obtained by current sampling of the monitoring points:

(1) And grouping according to the measuring points, and constructing the current sample of each measuring point group.

(2) Calculating the center distance of each class in each measuring point group (denoted as r group) according to the current sample of each measuring point group, drawing a class s closest to the current sample in the group, and recording the class internal distance dis' _rs 。

(3) Current sample of each measuring point group, and class inner distance dis' _rs Distance value in (1) and the difference threshold interval [ max ] of the belonged class _rs ,min _rs ]And comparing, wherein if the interval is within the range, the normal state is judged, and otherwise, the abnormal state is judged.

And finishing the detection of the current sampling data after all the current samples of each group are detected.

The invention has the beneficial effects that: the method adopts k-means clustering and box-type graph distinguishing methods, makes full use of the time-space correlation among the nodes, establishes a clustering model of local nodes, can accurately identify abnormal data detected by the water supply network, and provides guarantee for correctly analyzing the running state of the water supply network.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is an elbow diagram in different groupings;

FIG. 3 is a line graph of the separability index and the contour coefficient in different grouping modes.

Detailed Description

1955 pieces of flow and pressure data in 7 months and 7 months in 2021 of JS water delivery engineering are selected to perform abnormal data detection description of the embodiment of the invention, wherein the 1955 pieces of data comprise 257 pieces of abnormal data and 1698 pieces of normal data, and actual abnormal data detection verification is performed by adopting partial data in 8 months and 6 days in 2021. The specific process of the invention is shown in figure 1, and the specific steps are as follows:

step 1, reasonably grouping monitoring points based on normal working condition clustering.

And grouping all monitoring points according to the distribution of the measuring points of the water supply network and the proximity principle. The diversion project has 21 monitoring points, and the monitoring points are distributed as shown in the following table 2 according to 1 group, 4 groups, 5 groups, 7 groups and 10 groups:

TABLE 2 monitoring Point grouping

Due to the influence of the number of the monitoring points, the number of each group of monitoring points cannot be exactly consistent due to grouping, so that the data corresponding to the number of the monitoring points which are most in each group in the grouping is subjected to preliminary clustering, and the clustering basis is an elbow diagram. The number of the measuring points in each group which are most occupied by the 5 grouping modes (the number of the groups is 1,4, 5, 7 and 10) is 21, 6,4, 3 and 2, and the corresponding elbow diagram is shown in figure 2. In fig. 2, the horizontal axis represents a K variable and the vertical axis represents an SSE variable. As K increases, the SSE initially falls faster and then more slowly, and the SSE curve corresponds to an elbow plot. When the decrease becomes gradual, it is shown that the SSE decrease effect by increasing the K value is no longer significant. Generally, it is appropriate to select the K value at the time when the SSE decreases significantly and gradually as the number of clusters. Observing the elbow diagram in fig. 2 in this way, it can be obtained that the K values of the grouping modes 1 to 5 are selected to be 12, 4, and 4 in sequence, wherein the elbow diagram of the grouping mode 1 tends to be gentle after K =12, but the SSE is obviously reduced before K =12, so 12 is selected as the clustering K value of the grouping mode 1.

The separability index ε (l)' and the contour coefficient S (l) were calculated for 5 groups as shown in Table 3.

Meter 3 Segressiveness indicator ε (l) 'and contour coefficient S (l)'

Grouping method	1	2	3	4	5
						Number of packet groups	1	4	5	7	10
Separability index ε'	5.08	4.61	4.55	4.44	4.23
						Contour coefficient S (i)	0.287	0.391	0.418	0.461	0.466

And drawing a corresponding separation index and contour coefficient double-axis display broken line diagram, and reasonably grouping the separation index and the contour coefficient into a grouping mode 4 according to the intersection point as shown in fig. 3, namely, dividing the monitoring points into 7 groups, wherein each group comprises 3 monitoring points.

Taking the 5 th and 7 th measuring point groups in the grouping mode 4 as an example, the Euclidean distances from all samples in the group to the class center are calculated according to the formula (2). Tagging a set of distance values Dis by group i and class j _ij And is marked as Distance.

Distance＝{{Dis ₅₁ ,Dis ₅₂ ,Dis ₅₃ ,Dis ₅₄ ,Dis ₅₅ }，{Dis ₇₁ ,Dis ₇₂ ,Dis ₇₃ ,Dis ₇₄ }}

And 3, determining the difference threshold value of each type in each measuring point group by adopting a box type graph, and inspecting all sample data.

Analyzing the Distance data set Dis of each class in the 5 th and 7 th groups of Distance value sets Distance by using a box rule _ij Obtaining the difference threshold interval [ max ] of the distance value in each class _ij ,min _ij ]As shown in table 4, since the distance has no negative value, the lower limit of the negative value is set to 0.

TABLE 4 discrimination thresholds for

groups

5 and 7

Dividing all sample data into measuring point groups according to a grouping mode 4, calculating the distance from each sample data to all class centers in the measuring point group to which the sample data belongs, attributing the sample data to the class with the closest distance according to the distance, and marking the distance value set dis according to the class i and the class j _ij And obtaining the distance.

And comparing the distance values from all the sample data to the center of the belonged class with the differential threshold interval of the belonged class, if the distance values are outside the interval, judging the sample data to be abnormal data, and if the distance values are within the interval, judging the sample data to be normal data, and finally obtaining a confusion matrix of all sample data detection, wherein the result is shown in a table 5.

TABLE 5

set

5, 7 sample data testing confusion matrix

Calculating accuracy of data detection ₅ 、Accurary ₇ The results are shown in Table 6.

TABLE 6 abnormal data detection accuracy

Group of	Group 5	Group 7
			Rate of accuracy	98％	99.2％

The detection accuracy of the data of the 5 th and 7 th measuring point groups reaches the standard, and the parameter W in the formula (14) and the formula (15) does not need to be adjusted.

And 4, detecting actual abnormal data.

The discrimination threshold interval (as shown in table 4) obtained in the first 3 steps was used to detect the 8-month and 6-day partial data. Here, the following description will be given by taking as an example:

(1) Grouping according to the measuring points to construct a 5 th group and a 7 th group of current samples.

(2) <xnotran> 5 , 7 , , j, 5 12 {4,4,4,4,1,4,3,2,2,2,2,2}, 7 12 {4,2,2,2,2,2,2,2,4,2,4,2}. </xnotran>

Recording the internal distance of the sample point class in the 5 th group of 1 st class as dis' ₅₁ And in class 2, the internal distance of sample points is dis' ₅₂ And the sample point internal distance in the 3 rd class is dis' ₅₃ And in class 4, the internal distance of sample points is dis' ₅₄ (ii) a Wherein dis' ₅₁ ＝{51.69}、dis′ ₅₂ ＝{62.16,70.18,65.95,65.41,65.07}、dis′ ₅₃ ＝{97.23}、dis′ ₅₄ = {45.27,45.56,45.58,45.15}; recording the internal distance of the sample point class of the 7 th group of the 2 nd class as dis' ₇₂ And the sample point internal distance in the 4 th class is dis' ₇₄ . Wherein dis' ₇₂ ＝{208.06,207.93,208.51,208.28,208.30,208.04,207.86,208.16,173.87}、dis′ ₇₄ ＝{53.25,53.48,53.73}。

(3) Group 5, group 7 current samples, their class inner distance dis' _ij The distance value in (1) and the difference threshold interval [ min ] of the belonged class _ij ,max _ij ]And comparing, wherein if the interval is within the range, the normal state is judged, and otherwise, the abnormal state is judged. Through comparison, the sample points of the 5 th group which are divided into the 1 st, the 2 nd, the 3 rd and the 4 th classes are found not to exceed the upper limit threshold or be lower than the lower limit threshold, and the sample points are judged to be normal; and judging that the distances from the sample points classified into the 7 th group to the class center of the 2 nd group exceed the upper limit threshold value, and judging that the distances from the sample points classified into the 4 th group to the class center of the 4 th group do not exceed the upper limit threshold value or are lower than the lower limit threshold value, and judging that the distances are normal.

The detected abnormal data was found to be in accordance with the actual data, and as shown in table 7, a good abnormal data detection effect was obtained.

TABLE 7 actual abnormal data detection results

Group 5	Is actually abnormal	Is actually normal
			Is predicted to be abnormal	0	0
Is predicted to be normal	0	12
			Group 7	Is actually abnormal	Is actually normal
Is predicted to be abnormal	8	0
			Is predicted to be normal	0	4

Claims

1. A method for detecting abnormal data of a water supply network is characterized by comprising the following steps:

step 1, reasonably grouping monitoring points clustered based on normal working conditions;

s1.1, dividing monitoring points covering the whole water supply pipe network into 1 group, 2 groups, \ 8230and N/2 groups according to the adjacent principle, wherein N is the number of the measuring points, and if N is an odd number, rounding N/2 downwards, and L grouping modes are provided;

under each group, performing k-means clustering on normal flow and pressure data included in the measuring point group to obtain a clustering result of the measuring point group;

when k-means is clustered, selecting the value k of the number of the clustering categories according to an elbow method, wherein the core index of the elbow method is error Square Sum (SSE);

calculating SSEs under a plurality of k values, drawing a curve graph by taking an X axis as a clustering category number k value and a Y axis as the SSE value to obtain an elbow diagram, and selecting the k value corresponding to the elbow as the clustering category number;

s1.2, calculating the separability index epsilon of the grouping, wherein the calculation is as shown in a formula (4)

Wherein D is _pq Representing the Euclidean distance between the cluster centers of class p and class q, calculated according to equation (5), d _p And d _q Representing the class inner distance between the class p and the class q;

where m represents the dimension of the cluster center point, C _pt And C _qt Respectively representing the t attribute of the p class center and the q class center;

the intra-class distance sigma is a standard deviation of distances from sample points to the clustering center in each class, and is calculated according to the formula (6):

wherein N represents the total number of distances from the sample point to the cluster center in the same category, x _e Represents the distance from the sample point e to the center of the class, and μ represents the average of all distance values, calculated according to equation (7):

taking the mean value of all separability indexes as the separability index epsilon (l)' of clustering in a grouping mode, wherein the calculation mode is as shown in a formula (8)

Wherein epsilon _o Represents the o-th separability index of the packet, o ∈ [1,U ]]U represents a total of U separability indexes in the grouping method;

s1.3, calculating a contour coefficient S (l)' of a grouping mode, wherein the specific calculation process of the contour coefficient is as follows:

s1.3.1, calculating the distance between a sample point w in a certain class and all other elements in the same class, and then taking the average value of the distances, and recording the average value as a (w) to represent the degree of aggregation in the class;

s1.3.2, taking another class except the class to which the sample point w belongs, calculating the distances between the sample point w and all sample points in the class, then calculating the average value of the distances, traversing all other classes, finding the class with the nearest distance, and recording the average value of the distances from the point w to all the sample points in the class as b (w) to represent the separation degree between the classes;

s1.3.3, for the sample point w, the contour coefficient calculation formula is as follows

S1.3.4, calculating the contour coefficients of all the sample points, and solving the average value of all the contour coefficients, namely the contour coefficient S (l)' of the grouping mode;

s1.4, obtaining separability indexes and contour coefficients under each grouping mode, drawing a double-axis display broken line graph with the separability indexes and the contour coefficients as Y axes and the number of monitoring points as X axes, selecting a reasonable adjacent monitoring point grouping mode according to the broken line intersection point of the two indexes, and determining reasonable grouping of the measuring points;

step 2, on the basis of reasonable grouping of the monitoring points clustered based on the normal working conditions in the step 1, calculating Euclidean distance values from all samples in the group to the class center of the samples;

after all the Euclidean distance values are obtained, marking the Euclidean distance value Dis according to the measuring point group i and the cluster type j _ij Is recorded as Distance;

step 3, determining the difference threshold of each type in each measuring point group by adopting a box type graph, and inspecting all sample data;

analyzing the Euclidean Distance data Distance reasonably grouped by using the boxed graphs to obtain boxed graph parameters of all categories in each measuring point group, wherein a normal data judgment threshold interval is from the upper limit of the boxed graph to the lower limit of the boxed graph and is marked as [ min, max ];

dividing all sample data into corresponding groups according to the measuring point group conditions, and calculating Euclidean distances from all sample data to the center of the corresponding class; according to the distance of the Euclidean distance, dividing the sample data into the class with the shortest Euclidean distance, and marking the Euclidean distance value dis according to the measuring point group i and the cluster class j _ij Denoted as disiance;

will dis _ij The Euclidean distance value in (1) and the corresponding judgment threshold value interval [ max [ ] _ij ,min _ij ]Comparing, namely abnormity outside the interval;

obtaining a confusion matrix according to the detection result, and calculating the detection accuracy;

step 4, detecting actual abnormal data;

node flow pressure data obtained by current sampling of a monitoring point;

s4.1, grouping according to the measuring points, and constructing current samples of each measuring point group;

s4.2, calculating the distance from the current sample of each measuring point group to the center of each type in the group, drawing a certain type S with the closest distance in the group, and recording the distance dis 'in the type' _rs ；

S4.3, obtaining the current sample of each measuring point group, and the similar inner distance dis' _rs Distance value in (1) and the difference threshold interval [ max ] of the belonged class _ij ,min _ij ]Comparing, if the interval is within the range, the interval is normal, otherwise, the interval is abnormal;

2. The water supply pipe network anomaly data detection method according to claim 1, wherein: the k-means clustering in the step 1 comprises the following specific processes:

given a data set X, the data set contains n sample points, and the dimension of each sample point is m-dimension, that is: x = { X ₁ ,X ₂ ,X ₃ ,…,X _n }；

S1.1.1, first, randomly select k sample points D = { D =, = { D) in the data set X ₁ ,D ₂ ,D ₃ ,…,D _k As initialization class center, calculating each sample point X _i The Euclidean distances to all the clustering centers are calculated by the formula (2);

X _a represents the a sample point a, a ∈ [1, n ]]，D _b Represents the b-th cluster center, b ∈ [1, k ]]，X _at The tth attribute, t e [1, m ], representing the a-th sample point]，D _bt A tth attribute representing a b-th cluster center;

s1.1.2, comparing the distance from each sample to each clustering center, finding out the class center with the minimum distance, dividing the sample point into the class where the clustering center with the nearest distance is located, and obtaining data sets { F) of k classes ₁ ,F ₂ ,F ₃ ,…,F _k }；

S1.1.3, calculating the central point of each category as a new clustering center according to the categories divided in S1.1.2, wherein the calculation formula of the clustering center is shown as formula (3):

wherein, C _g Represents the g-th cluster center, g ∈ [1,k ]]，F _g Denotes the g-th class, N _g Is represented by F _g Including the number of sample points, X _h Is represented by F _g The g-th sample point in (1, N), h ∈ _g ]；

S1.1.4, and iterating the steps S1.1.2 and S1.1.3 until the clustering center is not changed any more.

3. The method for detecting the abnormal data of the water supply pipe network according to claim 1, wherein: the calculation process of the boxed graph parameters in the step 3 is as follows:

s3.1, calculating quartile Q on distance data ₃ And lower quartile Q ₁ The distance data has v, and the v data are sorted from small to large, Q ₃ Is the T th ₁ Number, Q, at a position ₁ Is the T th ₂ Number of positions, where T ₁ And T ₂ Is shown as formula (11) and formula (12)

S3.2, calculating the quartile space IQR as shown in the formula (13)

IQR＝Q ₃ -Q ₁ (13)

S3.3, calculating an upper limit Max and a lower limit Min according to the formula (14) and the formula (15)

Max＝Q ₁ +W*IQR (14)

Min＝Q ₃ -W*IQR (15)

Wherein W is a weight coefficient of the quartile range IQR.

4. A water supply network anomaly data detection method according to claim 1 or 3, characterized in that: if the accuracy rate does not reach 95% in the step 3, adjusting the discrimination threshold interval of each category of the measuring point group, and slightly adjusting the weight coefficient W in front of the IQR in the formula (14) and the formula (15); after small-amplitude adjustment, all sample data inspection is carried out, and if the accuracy rate does not reach the standard, adjustment is continued until the accuracy rate reaches the standard.

5. The method for detecting the abnormal data of the water supply pipe network according to claim 4, wherein: the weight factor W default value is 1.5.