CN115628776A - Water supply pipe network abnormal data detection method - Google Patents

Water supply pipe network abnormal data detection method Download PDF

Info

Publication number
CN115628776A
CN115628776A CN202211312033.0A CN202211312033A CN115628776A CN 115628776 A CN115628776 A CN 115628776A CN 202211312033 A CN202211312033 A CN 202211312033A CN 115628776 A CN115628776 A CN 115628776A
Authority
CN
China
Prior art keywords
class
data
distance
sample
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211312033.0A
Other languages
Chinese (zh)
Inventor
李守俊
李江
金波
何必仕
徐哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202211312033.0A priority Critical patent/CN115628776A/en
Publication of CN115628776A publication Critical patent/CN115628776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01DMEASURING NOT SPECIALLY ADAPTED FOR A SPECIFIC VARIABLE; ARRANGEMENTS FOR MEASURING TWO OR MORE VARIABLES NOT COVERED IN A SINGLE OTHER SUBCLASS; TARIFF METERING APPARATUS; MEASURING OR TESTING NOT OTHERWISE PROVIDED FOR
    • G01D21/00Measuring or testing not otherwise provided for
    • G01D21/02Measuring two or more variables by means not covered by a single other subclass

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for detecting abnormal data of a water supply network, and belongs to the technical field of online monitoring of the water supply network. According to the invention, monitoring points clustered based on normal working conditions are reasonably grouped. Secondly, on the basis of clustering results of normal working conditions of the groups and the measuring point groups, the distances from all samples in the groups to the class centers of the samples are calculated. Then, determining the difference threshold value of each type in each measuring point group by adopting a box type graph, and checking all sample data. And finally, finishing the actual abnormal data detection. The method adopts k-means clustering and box-type graph distinguishing methods, makes full use of the time-space correlation among the nodes, establishes a clustering model of local nodes, can accurately identify abnormal data detected by the water supply network, and provides guarantee for correctly analyzing the running state of the water supply network.

Description

Water supply pipe network abnormal data detection method
Technical Field
The invention relates to the technical field of water supply network online monitoring, in particular to a method for detecting abnormal data of a water supply network.
Background
With the development of the internet of things technology, a water supply network data acquisition and monitoring system SCADA is gradually popularized. The manual processing of the mass actual measurement data generated by the online monitoring is very difficult. In recent years, various data driving methods are developed, potential abnormal values in the data are automatically detected, massive data are primarily screened, and manual identification processing amount can be greatly reduced. Document [1] performs segmentation on 56 months online monitoring data of a certain water supply network monitoring station in Beijing, over time and season, constructs an autoregressive moving average (ARMA) model, and identifies abnormal values in an artificial simulation sequence through a confidence interval established by the ARMA model, thereby realizing self-identification of data of an independent node. In the document [2], the measured pressure data of a single pressure monitoring point in a certain cell is used as sample data, the sample data is denoised through wavelet analysis, 10% of reduction amount of the sample data is used as abnormal data, and abnormal data identification is performed based on a local outlier factor algorithm and a K-means algorithm. When the abnormal value detection is carried out on the sample, the sample is processed by a method of subtracting adjacent data of the sample to be detected, abnormal detection results of the sample data before and after the processing are compared, and detection results of abnormal data under different algorithms are compared. The result shows that the processed sample data has better abnormal detection effect, and the K-means algorithm has relatively better result. Document [3] provides a method for detecting abnormal data of a water supply network based on big data, which determines the abnormal data of each sensor after preprocessing, normal distribution processing and triple standard deviation (3 sigma) criterion detection are carried out on the big data for monitoring the water supply network, and the detection precision is high. Documents [1], [3] and [4] only use time sequence data of a single monitoring point, and do not use implicit laws that a plurality of monitoring points of a water supply network have space-time correlation, so that the false positive rate is high.
Document [4] performs time-interval segmentation on online monitoring data of 8 months of 10 monitoring points of a certain domestic water supply network, determines the number of the optimal monitoring points through analysis of spatial topological relations among the monitoring points, and selects data of the monitoring points to construct a Support Vector Regression (SVR) model. After the model is optimized, the confidence interval established by the model is used for identifying the artificial simulation abnormal value. The result shows that the SVR model has good fitting performance, the abnormal value detection rate is up to 90%, and the interactive identification between the selected monitoring point data is realized. Document [4] utilizes an SVR model to mine the implicit logical relationship between monitoring points to predict the variation trend of data of a specific monitoring point, and further provides a prediction confidence interval to identify abnormal values. However, to ensure the accuracy of the SVR model, the monitored data was sliced into 96 segments at 15-minute intervals to reduce data fluctuations. Thus, 96 SVR models must be constructed, and the actual processing is complicated and inefficient. If season change occurs, the data fluctuation of the segments is increased, the fitting performance of the SVR model is poor, and abnormal value identification is directly influenced.
Therefore, a water supply network abnormal data detection method which fully utilizes the time-space correlation among the monitoring points, is simple and convenient to model and calculate is needed.
Reference documents:
[1] liu Shuming, wu Buppon, and vehicle break quality control using self-identifying water supply network monitoring data quality control [ J ], university of Qinghua academic newspaper (Nature science edition), vol.57No.9 2017.
[2] Yan sailing, water supply network anomaly detection data identification research [ D ] Tianjin university of science institute, 2022. DOI.
[3] Liu boat, liu red, lang dynasty, tianwuping, hanxing, a water supply network abnormal data detection method [ P ] based on big data: CN112612824A,2021-04-06.
[4] Liu Shuming, wu Buppon, and vehicle Down, detection of water supply and water discharge [ J ] based on cross-identified water supply network data outliers, 2015, vol.41No.11.
Disclosure of Invention
Aiming at the problems, the invention provides a water supply network abnormal data detection method, which combines a k-means clustering model and a boxplot (Box-plot) discrimination method, selects flow pressure data of local adjacent nodes to establish a clustering model by utilizing the time-space correlation among water supply network nodes, determines various specific critical threshold values and realizes the detection of the water supply network flow pressure abnormal data.
The invention comprises the following steps:
and step 1, reasonably grouping the monitoring points based on normal working condition clustering.
The water supply network is composed of a plurality of nodes and connecting pipes. The water department generally arranges flow and pressure monitoring points at a tail end water-requiring node and a pipe network branch node so as to sense the hydraulic operation state of the pipe network. As the hydraulic states of the adjacent nodes have close relevance, the monitoring points are reasonably divided, and the measuring points with strong relevance form a group as much as possible, so that the operation condition of the involved area is favorably excavated.
Monitoring points covering the whole pipe network are divided into 1 group, 2 groups, \ 8230;, and N/2 groups (N is the number of the measuring points, and N is an odd number, and N/2 is rounded downwards), and the working condition analysis is performed on the measuring point groups under the L grouping mode by adopting k-means clustering respectively on the assumption that L grouping modes exist. And calculating the separability index epsilon (L) 'and the contour coefficient S (L)', L =1,2, \8230lof the cluster of each grouping mode according to the clustering result of each grouping mode. The larger the values of the two indexes are, the better the clustering effect of the grouping mode is, namely the more reasonable the grouping of the monitoring points is. Because the two indexes have different increasing trends along with the number of the measuring points, a biaxial display broken line graph taking a separability index epsilon (l) 'and a profile coefficient S (l)' as Y axes and the number of the monitoring points as X axes is drawn, and a reasonable adjacent monitoring point grouping mode is selected according to the intersection point of the broken lines of the two indexes.
The concrete process of reasonably grouping the monitoring points is as follows:
firstly, monitoring points covering the whole water supply network are divided into 1 group, 2 groups, \ 8230;, and N/2 groups (N is the number of measuring points, and N is an odd number, and N/2 is rounded) according to the adjacent principle, and L grouping modes are provided. And under each group, performing k-means clustering on normal flow and pressure data included in the measuring point group to obtain a clustering result of the measuring point group. When k-means is clustered, the value k of the clustering category number is selected according to an elbow method, the core index of the elbow method is SSE (sum of square error), and formula (1) is a calculation method of the SSE:
Figure BDA0003907404860000031
where k denotes the number of classes of the cluster, E d Denotes the d-th category, a denotes E d Sample point of (1), c d Denotes E d Of the center of (c). Calculating SSE (sum of squared error) under multiple k values, and taking X axis as the number of clustering classesMeasuring the k value, wherein the Y axis is an SSE value, drawing a curve graph to obtain an elbow graph, and selecting the k value corresponding to the elbow as the quantity of the clustering categories.
The k-means clustering comprises the following specific processes:
if a data set X is given, n sample points are included in the data set, and the dimension of each sample point is m-dimension, that is: x = { X 1 ,X 2 ,X 3 ,…,X n }。
(1) First, k sample points D = { D ] are randomly selected in the data set X 1 ,D 2 ,D 3 ,…,D k As initialization class center. Calculate each sample point X a The Euclidean distances to all the clustering centers are calculated by the formula (2);
Figure BDA0003907404860000032
X a represents the a sample point, a ∈ [1, n ]],D b Represents the b-th clustering center, b ∈ [1,k ]],X at The tth attribute, t e [1, m ], representing the a-th sample point],D bt The t-th attribute representing the b-th cluster center.
(2) Comparing the distance from each sample to each clustering center, finding out the class center with the minimum distance, dividing the sample points into the class where the clustering center with the minimum distance is located, and obtaining data sets { F) of k classes 1 ,F 2 ,F 3 ,…,F k }。
(3) Calculating the central point of each category as a new cluster center according to the categories divided in the step (2), wherein the calculation formula of the cluster center is shown as a formula (3):
Figure BDA0003907404860000041
C g represents the g-th cluster center, g ∈ [1,k ]],F g Denotes the g-th class, N g Is represented by F g Including the number of sample points, X h Is represented by F g At the h-th sample point, h is [1, N ] g ]。
(4) And (4) iterating the steps (2) and (3) until the cluster center is not changed any more.
Secondly, the separability index ε of the packet is calculated as formula (4)
Figure BDA0003907404860000042
Wherein D is pq Representing Euclidean distance between class p and class q clustering centers, and calculating according to formula (5), wherein m represents the dimension of the clustering center point in formula (5), C pt And C qt Respectively representing the t-th attribute of the p-th class center and the q-th class center.
Figure BDA0003907404860000043
Wherein d is p And d q Representing the intra-class distance between the class p and the class q, wherein the intra-class distance sigma is the standard deviation of the distance from the sample point to the clustering center in each class, and is calculated according to the formula (6), wherein N represents the total number of the distances from the sample point to the clustering center in the same class, and x is e Represents the distance from the sample point e to the center of the class, and μ represents the average of all distance values, calculated according to equation (7):
Figure BDA0003907404860000044
Figure BDA0003907404860000045
since there are usually multiple categories per cluster, there is a separability index between the two categories. Therefore, the mean value of all separability indicators is taken as the separability indicator epsilon (l)' of clustering in grouping mode, and the calculation mode is shown in formula (8)
Figure BDA0003907404860000046
Wherein epsilon o Represents the separation index o of the packet, o is [1,U ]]U represents a total of U separability indexes in the grouping method.
Then, a grouping-mode contour coefficient S (l)' is calculated, and the specific calculation process of the contour coefficient is as follows:
(1) For a sample point w in a certain class, the distance between the point and all other elements in the same class is calculated, and the average value of the distances is taken and is denoted as a (w), and the degree of aggregation in the class is expressed.
(2) And taking another class except the class to which the sample point w belongs, calculating the distances between the sample point w and all sample points in the class, then calculating the average value of the distances, traversing all other classes, finding the class with the closest distance, and recording the average value of the distances from the point w to all the sample points in the class as b (w) to represent the separation degree between the classes.
(3) For a sample point w, the contour coefficient calculation formula is as follows
Figure BDA0003907404860000051
(4) And calculating the contour coefficients of all the sample points, and calculating the average value of all the contour coefficients, namely the contour coefficient S (l)' of the grouping mode.
And finally, obtaining separability indexes and contour coefficients under each grouping mode, drawing a biaxial display graph, obtaining a grouping mode with a better clustering effect, namely determining reasonable grouping of the measuring points.
And 2, calculating Euclidean distances from all samples in the group to the class center of the samples on the basis of reasonable grouping of the monitoring points clustered based on the normal working conditions in the step 1.
The i measurement point groups reasonably grouped and the j categories of each measurement point group cluster can be obtained in the step 1, then the Euclidean distance from all samples in each measurement point group reasonably grouped to the center of the category of the sample is calculated, and the Euclidean distance is calculated according to the formula (2). After all Euclidean distance values are obtained, marking the distance value Dis according to the measuring point group i and the cluster class j ij It is denoted as Disiance.
Figure BDA0003907404860000052
Wherein, dis ij Representing the Euclidean distance set from all sample points of the jth class of the test point group of the ith group to the cluster center thereof, wherein the cluster types in each test point group are possibly different, so that the value of j is related to the test point group, and j belongs to { j ∈ { j } j 1 ,j 2 ,j 3 ,…,j i Where j is i And representing the clustering category number of the ith measuring point group.
And 3, determining the difference threshold of each type in each measuring point group by adopting a box type graph, and inspecting all sample data.
Analyzing the reasonably grouped Distance data Distance by using the boxed graph to obtain boxed graph parameters of all categories in each measuring point group, wherein a normal data judgment threshold interval is from the upper limit of the boxed graph to the lower limit of the boxed graph, sample data in the interval is abnormal data, and a judgment threshold upper limit max is marked according to the measuring point group i and the cluster category j ij And difference threshold lower limit min ij Is marked as [ min, max ]]。
Figure BDA0003907404860000061
Therein, max ij And min ij Respectively representing the upper limit and the lower limit of a discrimination threshold of the jth category of the ith measuring point group, wherein the value of j is related to the measuring point group, and j belongs to { j ∈ { j 1 ,j 2 ,j 3 ,…,j i H, where j i And representing the clustering category number of the ith measuring point group.
The calculation process of the box plot parameters is as follows:
(1) Calculating the quartile Q on the distance data 3 And lower quartile Q 1 For example, there are v distance data, and the v data are sorted from small to large, Q 3 Is the T th 1 Number at position, Q 1 Is the T th 2 Number of positions, where T 1 And T 2 Is given by the formula (11 And formula (12)
Figure BDA0003907404860000062
Figure BDA0003907404860000063
(2) Calculate the quartile Range IQR, as in formula (13)
IQR=Q 3 -Q 1 (13)
(3) Calculating the upper limit Max and the lower limit Min according to the formula (14) and the formula (15)
Max=Q 1 +W*IQR (14)
Min=Q 3 -W*IQR (15)
Where W is the weighting factor in front of the quartile range IQR.
And (3) dividing all sample data into corresponding groups according to the measuring point group conditions, calculating the distance from all sample data to the center of the class to which the sample data belongs, and calculating according to the formula (2). According to the distance, dividing the sample data into the nearest classes, and marking distance values dis according to the measuring point group class i and the cluster class j ij And is denoted as distance.
Figure BDA0003907404860000071
Wherein dis ij Representing a distance set from the jth class sample data of the ith measuring point group to the class center, wherein the value of j corresponds to the number of the class of the reasonably grouped clusters, and j belongs to { j ∈ { j 1 ,j 2 ,j 3 ,…,j i Where j is i And representing the clustering category number of the ith measuring point group.
Will dis ij Distance value in (2) and corresponding difference threshold interval [ max [ [ max ] ij ,min ij ]And comparing, namely, the abnormal condition is obtained outside the interval.
And then obtaining a confusion matrix of each measuring point group according to the detection result, and calculating the detection accuracy, wherein the confusion matrix of a single measuring point group is shown in a table 1.
TABLE 1 confusion matrix
Classification of Is actually abnormal Is actually normal
Is judged to be abnormal TP FP
Is judged to be normal FN TN
Data detection accuracy Accurary of measuring point group i Calculated according to equation (16):
Figure BDA0003907404860000072
among them, accuray i And the accuracy of the detection point group of the ith group is shown, wherein TP represents the number of detected correct abnormal data, FP represents the number of detected normal data as abnormal data, FN represents the number of detected abnormal data as normal data, and TN represents the number of detected correct normal data.
If accuracy is accurate i If the difference is less than 95%, the difference threshold interval of each category of the measuring point group needs to be adjusted, the weighting coefficient W in front of the IQR in the formula (14) and the formula (15) can be adjusted in a small scale, and the default value of the coefficient is 1.5. After small amplitude adjustment, all sample data inspection is carried out again, if the accuracy is accurate i If the standard is not reached, the adjustment is continued until the standard is reached.
Step 4 actual anomaly data detection
Aiming at the node flow pressure data obtained by current sampling of the monitoring points:
(1) And grouping according to the measuring points, and constructing the current sample of each measuring point group.
(2) Calculating the center distance of each class in each measuring point group (denoted as r group) according to the current sample of each measuring point group, drawing a class s closest to the current sample in the group, and recording the class internal distance dis' rs
(3) Current sample of each measuring point group, and class inner distance dis' rs Distance value in (1) and the difference threshold interval [ max ] of the belonged class rs ,min rs ]And comparing, wherein if the interval is within the range, the normal state is judged, and otherwise, the abnormal state is judged.
And finishing the detection of the current sampling data after all the current samples of each group are detected.
The invention has the beneficial effects that: the method adopts k-means clustering and box-type graph distinguishing methods, makes full use of the time-space correlation among the nodes, establishes a clustering model of local nodes, can accurately identify abnormal data detected by the water supply network, and provides guarantee for correctly analyzing the running state of the water supply network.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is an elbow diagram in different groupings;
FIG. 3 is a line graph of the separability index and the contour coefficient in different grouping modes.
Detailed Description
1955 pieces of flow and pressure data in 7 months and 7 months in 2021 of JS water delivery engineering are selected to perform abnormal data detection description of the embodiment of the invention, wherein the 1955 pieces of data comprise 257 pieces of abnormal data and 1698 pieces of normal data, and actual abnormal data detection verification is performed by adopting partial data in 8 months and 6 days in 2021. The specific process of the invention is shown in figure 1, and the specific steps are as follows:
step 1, reasonably grouping monitoring points based on normal working condition clustering.
And grouping all monitoring points according to the distribution of the measuring points of the water supply network and the proximity principle. The diversion project has 21 monitoring points, and the monitoring points are distributed as shown in the following table 2 according to 1 group, 4 groups, 5 groups, 7 groups and 10 groups:
TABLE 2 monitoring Point grouping
Figure BDA0003907404860000081
Figure BDA0003907404860000091
Due to the influence of the number of the monitoring points, the number of each group of monitoring points cannot be exactly consistent due to grouping, so that the data corresponding to the number of the monitoring points which are most in each group in the grouping is subjected to preliminary clustering, and the clustering basis is an elbow diagram. The number of the measuring points in each group which are most occupied by the 5 grouping modes (the number of the groups is 1,4, 5, 7 and 10) is 21, 6,4, 3 and 2, and the corresponding elbow diagram is shown in figure 2. In fig. 2, the horizontal axis represents a K variable and the vertical axis represents an SSE variable. As K increases, the SSE initially falls faster and then more slowly, and the SSE curve corresponds to an elbow plot. When the decrease becomes gradual, it is shown that the SSE decrease effect by increasing the K value is no longer significant. Generally, it is appropriate to select the K value at the time when the SSE decreases significantly and gradually as the number of clusters. Observing the elbow diagram in fig. 2 in this way, it can be obtained that the K values of the grouping modes 1 to 5 are selected to be 12, 4, and 4 in sequence, wherein the elbow diagram of the grouping mode 1 tends to be gentle after K =12, but the SSE is obviously reduced before K =12, so 12 is selected as the clustering K value of the grouping mode 1.
The separability index ε (l)' and the contour coefficient S (l) were calculated for 5 groups as shown in Table 3.
Meter 3 Segressiveness indicator ε (l) 'and contour coefficient S (l)'
Grouping method 1 2 3 4 5
Number of packet groups 1 4 5 7 10
Separability index ε' 5.08 4.61 4.55 4.44 4.23
Contour coefficient S (i) 0.287 0.391 0.418 0.461 0.466
And drawing a corresponding separation index and contour coefficient double-axis display broken line diagram, and reasonably grouping the separation index and the contour coefficient into a grouping mode 4 according to the intersection point as shown in fig. 3, namely, dividing the monitoring points into 7 groups, wherein each group comprises 3 monitoring points.
And 2, calculating Euclidean distances from all samples in the group to the class center of the samples on the basis of reasonable grouping of the monitoring points clustered based on the normal working conditions in the step 1.
Taking the 5 th and 7 th measuring point groups in the grouping mode 4 as an example, the Euclidean distances from all samples in the group to the class center are calculated according to the formula (2). Tagging a set of distance values Dis by group i and class j ij And is marked as Distance.
Distance={{Dis 51 ,Dis 52 ,Dis 53 ,Dis 54 ,Dis 55 },{Dis 71 ,Dis 72 ,Dis 73 ,Dis 74 }}
And 3, determining the difference threshold value of each type in each measuring point group by adopting a box type graph, and inspecting all sample data.
Analyzing the Distance data set Dis of each class in the 5 th and 7 th groups of Distance value sets Distance by using a box rule ij Obtaining the difference threshold interval [ max ] of the distance value in each class ij ,min ij ]As shown in table 4, since the distance has no negative value, the lower limit of the negative value is set to 0.
TABLE 4 discrimination thresholds for groups 5 and 7
Figure BDA0003907404860000101
Dividing all sample data into measuring point groups according to a grouping mode 4, calculating the distance from each sample data to all class centers in the measuring point group to which the sample data belongs, attributing the sample data to the class with the closest distance according to the distance, and marking the distance value set dis according to the class i and the class j ij And obtaining the distance.
distance={{dis 51 ,dis 52 ,dis 53 ,dis 54 ,dis 55 },{dis 71 ,dis 72 ,dis 73 ,dis 74 }}
And comparing the distance values from all the sample data to the center of the belonged class with the differential threshold interval of the belonged class, if the distance values are outside the interval, judging the sample data to be abnormal data, and if the distance values are within the interval, judging the sample data to be normal data, and finally obtaining a confusion matrix of all sample data detection, wherein the result is shown in a table 5.
TABLE 5 set 5, 7 sample data testing confusion matrix
Figure BDA0003907404860000102
Figure BDA0003907404860000111
Calculating accuracy of data detection 5 、Accurary 7 The results are shown in Table 6.
TABLE 6 abnormal data detection accuracy
Group of Group 5 Group 7
Rate of accuracy 98% 99.2%
The detection accuracy of the data of the 5 th and 7 th measuring point groups reaches the standard, and the parameter W in the formula (14) and the formula (15) does not need to be adjusted.
And 4, detecting actual abnormal data.
The discrimination threshold interval (as shown in table 4) obtained in the first 3 steps was used to detect the 8-month and 6-day partial data. Here, the following description will be given by taking as an example:
(1) Grouping according to the measuring points to construct a 5 th group and a 7 th group of current samples.
(2) <xnotran> 5 , 7 , , j, 5 12 {4,4,4,4,1,4,3,2,2,2,2,2}, 7 12 {4,2,2,2,2,2,2,2,4,2,4,2}. </xnotran>
Recording the internal distance of the sample point class in the 5 th group of 1 st class as dis' 51 And in class 2, the internal distance of sample points is dis' 52 And the sample point internal distance in the 3 rd class is dis' 53 And in class 4, the internal distance of sample points is dis' 54 (ii) a Wherein dis' 51 ={51.69}、dis′ 52 ={62.16,70.18,65.95,65.41,65.07}、dis′ 53 ={97.23}、dis′ 54 = {45.27,45.56,45.58,45.15}; recording the internal distance of the sample point class of the 7 th group of the 2 nd class as dis' 72 And the sample point internal distance in the 4 th class is dis' 74 . Wherein dis' 72 ={208.06,207.93,208.51,208.28,208.30,208.04,207.86,208.16,173.87}、dis′ 74 ={53.25,53.48,53.73}。
(3) Group 5, group 7 current samples, their class inner distance dis' ij The distance value in (1) and the difference threshold interval [ min ] of the belonged class ij ,max ij ]And comparing, wherein if the interval is within the range, the normal state is judged, and otherwise, the abnormal state is judged. Through comparison, the sample points of the 5 th group which are divided into the 1 st, the 2 nd, the 3 rd and the 4 th classes are found not to exceed the upper limit threshold or be lower than the lower limit threshold, and the sample points are judged to be normal; and judging that the distances from the sample points classified into the 7 th group to the class center of the 2 nd group exceed the upper limit threshold value, and judging that the distances from the sample points classified into the 4 th group to the class center of the 4 th group do not exceed the upper limit threshold value or are lower than the lower limit threshold value, and judging that the distances are normal.
The detected abnormal data was found to be in accordance with the actual data, and as shown in table 7, a good abnormal data detection effect was obtained.
TABLE 7 actual abnormal data detection results
Group 5 Is actually abnormal Is actually normal
Is predicted to be abnormal 0 0
Is predicted to be normal 0 12
Group 7 Is actually abnormal Is actually normal
Is predicted to be abnormal 8 0
Is predicted to be normal 0 4

Claims (5)

1. A method for detecting abnormal data of a water supply network is characterized by comprising the following steps:
step 1, reasonably grouping monitoring points clustered based on normal working conditions;
s1.1, dividing monitoring points covering the whole water supply pipe network into 1 group, 2 groups, \ 8230and N/2 groups according to the adjacent principle, wherein N is the number of the measuring points, and if N is an odd number, rounding N/2 downwards, and L grouping modes are provided;
under each group, performing k-means clustering on normal flow and pressure data included in the measuring point group to obtain a clustering result of the measuring point group;
when k-means is clustered, selecting the value k of the number of the clustering categories according to an elbow method, wherein the core index of the elbow method is error Square Sum (SSE);
calculating SSEs under a plurality of k values, drawing a curve graph by taking an X axis as a clustering category number k value and a Y axis as the SSE value to obtain an elbow diagram, and selecting the k value corresponding to the elbow as the clustering category number;
s1.2, calculating the separability index epsilon of the grouping, wherein the calculation is as shown in a formula (4)
Figure FDA0003907404850000011
Wherein D is pq Representing the Euclidean distance between the cluster centers of class p and class q, calculated according to equation (5), d p And d q Representing the class inner distance between the class p and the class q;
Figure FDA0003907404850000012
where m represents the dimension of the cluster center point, C pt And C qt Respectively representing the t attribute of the p class center and the q class center;
the intra-class distance sigma is a standard deviation of distances from sample points to the clustering center in each class, and is calculated according to the formula (6):
Figure FDA0003907404850000013
wherein N represents the total number of distances from the sample point to the cluster center in the same category, x e Represents the distance from the sample point e to the center of the class, and μ represents the average of all distance values, calculated according to equation (7):
Figure FDA0003907404850000014
taking the mean value of all separability indexes as the separability index epsilon (l)' of clustering in a grouping mode, wherein the calculation mode is as shown in a formula (8)
Figure FDA0003907404850000015
Wherein epsilon o Represents the o-th separability index of the packet, o ∈ [1,U ]]U represents a total of U separability indexes in the grouping method;
s1.3, calculating a contour coefficient S (l)' of a grouping mode, wherein the specific calculation process of the contour coefficient is as follows:
s1.3.1, calculating the distance between a sample point w in a certain class and all other elements in the same class, and then taking the average value of the distances, and recording the average value as a (w) to represent the degree of aggregation in the class;
s1.3.2, taking another class except the class to which the sample point w belongs, calculating the distances between the sample point w and all sample points in the class, then calculating the average value of the distances, traversing all other classes, finding the class with the nearest distance, and recording the average value of the distances from the point w to all the sample points in the class as b (w) to represent the separation degree between the classes;
s1.3.3, for the sample point w, the contour coefficient calculation formula is as follows
Figure FDA0003907404850000021
S1.3.4, calculating the contour coefficients of all the sample points, and solving the average value of all the contour coefficients, namely the contour coefficient S (l)' of the grouping mode;
s1.4, obtaining separability indexes and contour coefficients under each grouping mode, drawing a double-axis display broken line graph with the separability indexes and the contour coefficients as Y axes and the number of monitoring points as X axes, selecting a reasonable adjacent monitoring point grouping mode according to the broken line intersection point of the two indexes, and determining reasonable grouping of the measuring points;
step 2, on the basis of reasonable grouping of the monitoring points clustered based on the normal working conditions in the step 1, calculating Euclidean distance values from all samples in the group to the class center of the samples;
after all the Euclidean distance values are obtained, marking the Euclidean distance value Dis according to the measuring point group i and the cluster type j ij Is recorded as Distance;
step 3, determining the difference threshold of each type in each measuring point group by adopting a box type graph, and inspecting all sample data;
analyzing the Euclidean Distance data Distance reasonably grouped by using the boxed graphs to obtain boxed graph parameters of all categories in each measuring point group, wherein a normal data judgment threshold interval is from the upper limit of the boxed graph to the lower limit of the boxed graph and is marked as [ min, max ];
dividing all sample data into corresponding groups according to the measuring point group conditions, and calculating Euclidean distances from all sample data to the center of the corresponding class; according to the distance of the Euclidean distance, dividing the sample data into the class with the shortest Euclidean distance, and marking the Euclidean distance value dis according to the measuring point group i and the cluster class j ij Denoted as disiance;
will dis ij The Euclidean distance value in (1) and the corresponding judgment threshold value interval [ max [ ] ij ,min ij ]Comparing, namely abnormity outside the interval;
obtaining a confusion matrix according to the detection result, and calculating the detection accuracy;
step 4, detecting actual abnormal data;
node flow pressure data obtained by current sampling of a monitoring point;
s4.1, grouping according to the measuring points, and constructing current samples of each measuring point group;
s4.2, calculating the distance from the current sample of each measuring point group to the center of each type in the group, drawing a certain type S with the closest distance in the group, and recording the distance dis 'in the type' rs
S4.3, obtaining the current sample of each measuring point group, and the similar inner distance dis' rs Distance value in (1) and the difference threshold interval [ max ] of the belonged class ij ,min ij ]Comparing, if the interval is within the range, the interval is normal, otherwise, the interval is abnormal;
and finishing the detection of the current sampling data after all the current samples of each group are detected.
2. The water supply pipe network anomaly data detection method according to claim 1, wherein: the k-means clustering in the step 1 comprises the following specific processes:
given a data set X, the data set contains n sample points, and the dimension of each sample point is m-dimension, that is: x = { X 1 ,X 2 ,X 3 ,…,X n };
S1.1.1, first, randomly select k sample points D = { D =, = { D) in the data set X 1 ,D 2 ,D 3 ,…,D k As initialization class center, calculating each sample point X i The Euclidean distances to all the clustering centers are calculated by the formula (2);
Figure FDA0003907404850000031
X a represents the a sample point a, a ∈ [1, n ]],D b Represents the b-th cluster center, b ∈ [1, k ]],X at The tth attribute, t e [1, m ], representing the a-th sample point],D bt A tth attribute representing a b-th cluster center;
s1.1.2, comparing the distance from each sample to each clustering center, finding out the class center with the minimum distance, dividing the sample point into the class where the clustering center with the nearest distance is located, and obtaining data sets { F) of k classes 1 ,F 2 ,F 3 ,…,F k };
S1.1.3, calculating the central point of each category as a new clustering center according to the categories divided in S1.1.2, wherein the calculation formula of the clustering center is shown as formula (3):
Figure FDA0003907404850000032
wherein, C g Represents the g-th cluster center, g ∈ [1,k ]],F g Denotes the g-th class, N g Is represented by F g Including the number of sample points, X h Is represented by F g The g-th sample point in (1, N), h ∈ g ];
S1.1.4, and iterating the steps S1.1.2 and S1.1.3 until the clustering center is not changed any more.
3. The method for detecting the abnormal data of the water supply pipe network according to claim 1, wherein: the calculation process of the boxed graph parameters in the step 3 is as follows:
s3.1, calculating quartile Q on distance data 3 And lower quartile Q 1 The distance data has v, and the v data are sorted from small to large, Q 3 Is the T th 1 Number, Q, at a position 1 Is the T th 2 Number of positions, where T 1 And T 2 Is shown as formula (11) and formula (12)
Figure FDA0003907404850000041
Figure FDA0003907404850000042
S3.2, calculating the quartile space IQR as shown in the formula (13)
IQR=Q 3 -Q 1 (13)
S3.3, calculating an upper limit Max and a lower limit Min according to the formula (14) and the formula (15)
Max=Q 1 +W*IQR (14)
Min=Q 3 -W*IQR (15)
Wherein W is a weight coefficient of the quartile range IQR.
4. A water supply network anomaly data detection method according to claim 1 or 3, characterized in that: if the accuracy rate does not reach 95% in the step 3, adjusting the discrimination threshold interval of each category of the measuring point group, and slightly adjusting the weight coefficient W in front of the IQR in the formula (14) and the formula (15); after small-amplitude adjustment, all sample data inspection is carried out, and if the accuracy rate does not reach the standard, adjustment is continued until the accuracy rate reaches the standard.
5. The method for detecting the abnormal data of the water supply pipe network according to claim 4, wherein: the weight factor W default value is 1.5.
CN202211312033.0A 2022-10-25 2022-10-25 Water supply pipe network abnormal data detection method Pending CN115628776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211312033.0A CN115628776A (en) 2022-10-25 2022-10-25 Water supply pipe network abnormal data detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211312033.0A CN115628776A (en) 2022-10-25 2022-10-25 Water supply pipe network abnormal data detection method

Publications (1)

Publication Number Publication Date
CN115628776A true CN115628776A (en) 2023-01-20

Family

ID=84905851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211312033.0A Pending CN115628776A (en) 2022-10-25 2022-10-25 Water supply pipe network abnormal data detection method

Country Status (1)

Country Link
CN (1) CN115628776A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117272216B (en) * 2023-11-22 2024-02-09 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station

Similar Documents

Publication Publication Date Title
CN109710636B (en) Unsupervised industrial system anomaly detection method based on deep transfer learning
WO2020063689A1 (en) Method and device for predicting thermal load of electrical system
CN106572493B (en) Rejecting outliers method and system in LTE network
CN111931868B (en) Time series data abnormity detection method and device
CN110634080B (en) Abnormal electricity utilization detection method, device, equipment and computer readable storage medium
US10571358B2 (en) Method for detecting anomalies in a distribution network
CN107463993B (en) Medium-and-long-term runoff forecasting method based on mutual information-kernel principal component analysis-Elman network
Opgen-Rhein et al. Inferring gene dependency networks from genomic longitudinal data: a functional data approach
CN109409425B (en) Fault type identification method based on neighbor component analysis
CN109816031B (en) Transformer state evaluation clustering analysis method based on data imbalance measurement
CN110561191B (en) Numerical control machine tool cutter abrasion data processing method based on PCA and self-encoder
CN112508105A (en) Method for detecting and retrieving faults of oil extraction machine
Abdulla et al. Probabilistic multiple model neural network based leak detection system: Experimental study
CN108829878B (en) Method and device for detecting abnormal points of industrial experimental data
CN117556714B (en) Preheating pipeline temperature data anomaly analysis method for aluminum metal smelting
CN115290316B (en) Fault diagnosis method for eccentric rotary valve
CN109240276B (en) Multi-block PCA fault monitoring method based on fault sensitive principal component selection
CN112766301B (en) Oil extraction machine indicator diagram similarity judging method
CN110889441A (en) Distance and point density based substation equipment data anomaly identification method
CN115628776A (en) Water supply pipe network abnormal data detection method
CN110503133A (en) A kind of centrifugal compressor failure prediction method based on deep learning
CN116066343A (en) Intelligent early warning method and system for fault model of oil delivery pump unit
CN113987033A (en) Main transformer online monitoring data group deviation identification and calibration method
CN113298162A (en) Bridge health monitoring method and system based on K-means algorithm
CN108470194B (en) Feature screening method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination