Disclosure of Invention
The invention solves the technical problem of providing a method for processing the on-line monitoring data of the DGA of the converter transformer by removing the difference and then repairing, wherein a first stage uses a piecewise linearization algorithm, and K-means clustering and APRIORI algorithm processing based on maximum and minimum distance improvement to discover abnormal values existing in the method; and in the second stage, the sampling points are repaired by using an improved particle swarm optimization support vector regression algorithm, so that the processing of the online DGA monitoring data of the power converter transformer is realized.
The invention is realized by the following technical scheme, and the method for processing the on-line monitoring data of the DGA of the converter transformer by removing the difference firstly and then repairing comprises the following steps:
s1, importing DGA online monitoring data, and setting the length and the sliding step length of a sliding window;
s2, piecewise linearization of sequence data: combining a variable number of points in the online data together according to a model by using a piecewise linearization algorithm of sequence data to form a multi-group data point set; the criteria for grouping data points are: wherein the error between the line segment fitted by all the points and the actual data point is less than a threshold value, and the fitted line segment is represented by using the slope and the line segment span of the line segment;
s3, constructing a model for describing the similarity of different line segments: constructing a similarity model based on the slope and span of the line segments, classifying the line segments by using a K-means clustering algorithm improved based on the maximum and minimum distances, giving symbols to the line segments of the same class, and completing the symbolization of sequence data;
s4, mining the relevance among different sequences: setting a minimum confidence coefficient and a support degree based on an Apriori algorithm, mining a frequent item set existing among different sequences, and quantifying the relevance among the different sequences;
s5, extracting and screening abnormal values existing in DGA online monitoring data: according to the strength of the correlation among the sequences, separating data of different abnormal modes from the abnormal numerical value types in the judged data;
s6, optimizing key parameters of support vector regression by improving a particle swarm algorithm, and repairing the screened abnormal numerical value points: defining the distance between particle solution sets, calculating the density of different particles based on the distance, and introducing an improved fuzzy inference rule according to the density to define different particle updating modes so as to improve the diversity and solving speed of particle swarm algorithm solution; and optimizing the key parameters supporting vector regression by using an improved particle swarm algorithm, improving the data regression precision, repairing the screened abnormal numerical points, and finishing the processing of DGA on-line monitoring data.
Further preferably, in step S1, DGA online monitoring data is imported, the length of the sliding window is set to L, and the sliding step length is set to L; traversing the online data set with a sliding window of a certain step size: dragging a sliding window to slide on the whole online monitoring data set by a sliding step length l until all data are traversed; let the length of the on-line monitoring data set be L
1After traversal, get
A data window, deriving the data in all windows to form a data set DS to be analyzed
i,i∈n。
Further preferably, the step S2 provides a piecewise linearization algorithm of the sequence data, which specifically comprises the following steps:
s2.1, for monitoring data XK={x1,x2,…,xkIntercepting data points by a window with the length of L (L < k), and carrying out piecewise linear fitting on the data points contained in the intercepted window on the basis of the idea of a sliding window;
s2.2, taking the first data point in the window as the fitting starting point of the initial line segment, and enabling the point to be xiAssuming that the fitting end point of the initial line segment is xi+m(m > 1), fitting the m +1 data points to a line segment;
the distance from the actual data points to the fitting line segment is used as a fitting error, and the fitting accuracy of the fitting line segment to the actual numerical points is improved; unlike the conventional least squares fitting, let dnAs index sequence number points XnAnd (3) calculating the linear distance from all actual data points in the step length of the fitting line segment to the fitting line segment, and taking the sum of the linear distances as the fitting overall error ER of the line segment:
Xithe sampling numerical value of the time i in the time sequence is represented, and m represents the number of numerical points contained in the fitting line segment; t is tnRepresents a time step;
s2.3, setting a fitting error threshold value to be ERrIf ER < ERrIf so, the line segment can still continue to increase the fitting points, and let m be m +1, and repeat the above steps; ER if anyrIf m is equal to m, taking the point as a line segment fitting end point to generate a line segment; if ER > ER is presentrIf the line segment can not be fitted, the fitting end point of the current line segment is stored as Xend=Xi+m-1And recording the data sampling time, returning to the step S2.2, resetting the parameter m, and fitting the next part of data by taking the current fitting end point as the fitting starting point of the next line segment until all data points in the sequence are fitted.
Preferably, in step S3, since there is a certain order of magnitude difference between different indicators in the DGA online monitoring data, all line segment triplets existing in the same sequence need to be shaped as
The standardization operation of (2);
during cluster analysis, establishing a standard for measuring the similarity of the line segments; describing the similarity between line segments by using Euclidean distance, wherein the consideration degree of different attributes of the line segments is expressed in a weight mode; the established line segment similarity model is shown as the following formula:
in the formula (ds)ijRepresenting line segment similarity, ωk、ωm、ωrAnd respectively representing the weight ratios occupied by the slope, the span and the growth rate in the line segment similarity model.
Further preferably, in step S3 of the present invention, the improved K-means algorithm based on the maximum and minimum distances includes the following main steps:
the maximum and minimum distances are also based on Euclidean distances, and the difference between the maximum and minimum distances and the K-means algorithm is that an object with a maximum distance is taken as a clustering center; for the sample set, a proportion coefficient theta (0 < theta < 1) is given, and the sample set s is taken arbitrarilynIs the initial clustering center, denoted as z1;
Optionally taking the distance z of the remaining n-1 samples1The farthest sample is the second cluster center, denoted as z2;
Calculate the remaining n-2 samples and z1And z2And finding the minimum value among them, namely:
Dij=||xi-zj||,j=1,2 (6)
Di=min(Di1,Di2),i=1,2,…,n (7)
if it is
Di=max{Di}>θ×||zi-z2|| (8)
Then select the corresponding sample siAs a third cluster center z3;
Assuming that K cluster centers are provided, the distance from the rest n-K samples to the cluster centers is calculated, and
comprises the following steps:
Dr=max{min(Di1,Di2,…Dik)}>θ×||z1-z2|| (9)
then the corresponding sample xrIs the K +1 cluster center and is marked as zK+1(ii) a The process is continuously circulated until no new clustering center appears;
when no new cluster center is present, the samples are assigned to each class according to the minimum distance principle. The improved K-means clustering algorithm based on the maximum and minimum distances has the advantages that the clustering centers are consistent during each clustering analysis, the randomness of selecting the clustering centers by the traditional K-means algorithm is eliminated, and the accuracy and the speed of the clustering analysis can be effectively improved.
Further preferably, in step S4 of the present invention, the process of mining the association between different sequences is as follows:
s4.1, setting parameters of minimum support degree and minimum confidence degree; judging the basis of sequence association and frequent item sets when the confidence coefficient and the support degree threshold value exist, wherein a proper threshold value parameter is favorable for enhancing the reliability of the association relation, and the minimum support degree threshold value of frequent-1 and frequent-2 item sets is recorded as min sup1And min sup2The minimum confidence threshold in the sequence association mining is min con;
s4.2, generating a frequent item set; using the summed two-signed sequence as a transaction set, denoted
Wherein
All symbol categories corresponding to the two sequences are: { A
1,A
2,…,A
CAAnd { B }
1,B
2,…,B
CBAnd obtaining a frequent item set of the sequence by scanning the transaction set in two stages based on the basic idea of an Apriori algorithm. The confidence for each symbol in the sequence is calculated according to equation (10):
wherein X and Y represent two index objects needing to mine association rules, NtRepresenting the number of transaction sets, namely the number of elements in the sequence, representing the proportion of items in the transaction sets by the support degree, and when a frequent-1 item set is explored, the support degree is greater than min sup1The items of (a) are divided into a set of frequent-1 item sets;
the collection of frequent-1 item sets of two sequences in the association mining is recorded as PA、PBPairing the items in the set according to the index parameters to form the form (P)Ai,PBi) Format 2-item set, computing support of each item in the 2-item setDegree, will support more than min sup2Is divided into a frequent-2 item set, denoted as { PA,PB}freq;
S4.3, mining of sequence relevance: combining all the sequences pairwise, and respectively counting the support degree of the frequent-2 item concentrated items in the sequences and the confidence degree between the corresponding association mining sequences;
and (3) firstly, accumulating the support degrees of all frequent-2 item sets between two index parameters according to the formula (7), and taking the accumulated support degrees as the support degrees of the two parameter sequences in all multivariate sequences.
σ(XA)=sum(σ(PA)) (12)
σ(XB)=sum(σ(PB)) (13)
Wherein, m is CA + CB, CA and CB are the total number of the line segment categories divided after the clustering analysis of the two sequences, and m is the number of the line segment categories after the two sequences are summed; at the same time, the minimum support threshold of the index sequence layer is min sup3If the support degree of the parameter index level is larger than the set threshold value, calculating the confidence degree con (X) of the combination of the symbol item sets in the two sequencesA→XB) As shown in formula (14):
when the confidence is greater than the set minimum confidence threshold, the association rule X is reservedA→XBAnd describing the strength of the association between the two indexes by using the confidence coefficient, and judging that the two indexes have strong association.
The improved particle swarm algorithm in the step S6 mainly comprises the following steps:
s6.1, defining the number m of variables, and generating N m-dimensional particles in a feasible solution space, S
tIs the t-th generation particle in the iteration, wherein the element is
Wherein the elements are expressed as
S6.2, determining the inertial weight: the self-adaptive weight method can better find a balance point between the two, and the inertia weight is properly increased when the target values of all particles tend to be consistent; when the target values of the particles are relatively dispersed, the inertia weight value is properly reduced, and the specific expression is as follows:
wherein, waAnd wzRepresenting maximum and minimum values of inertial weight, fz,fpjRespectively representing the fitness value of the particle, the minimum fitness value of all the particles and the average fitness value of all the particles;
s6.3, defining fuzzy inference rule input variables: the population density of the particles is expressed by Euclidean distance:
the calculation formula of the particle density is obtained:
niis the number of particles in the i particle population, N is the number of solution-concentrated particles generated, ciRecording the density of the particles, normalizing the density and the current iteration times k, taking the normalized density and the current iteration times k as two input variables of the fuzzy rule, and respectively calculating the membership degrees of the fuzzy rule to different states;
s6.4, fuzzy inference rules: separately defining fuzzy sets of normalization variables for inputsLow, medium, high density, membership function expression as shown in1-c3Is an interval threshold of the membership function,
wherein x can be two input variables of particle density and iteration times, the calculated membership degree of particle density and the membership degree of iteration times are combined in a cross way to form a particle state fuzzy matrix K with the dimensionality of 3 multiplied by 3, and the particle state fuzzy matrix K and a vector c formed by the membership degree of particle densitylMultiplying to obtain probability vector c of particle dependent different density intervalb,
And taking the maximum density interval in which the current particles are positioned, and respectively formulating different particle updating modes.
S6.5, particle updating rule: two learning factors mu of the initialization algorithm1、μ2,
When c is going toblAt maximum, the particles are solved only to the optimal direction of the particles, and the speed updating mode is as follows:
when c is going tobmOr cbhAnd when the maximum is reached, the algorithm is solved towards the global optimum and subgroup optimum directions, and the same updating mode as the traditional particle population algorithm is adopted.
The invention has the technical effects that a two-stage online DGA data processing method based on the thought of 'removing the difference firstly and then repairing' is provided, and the online data is equivalent to a time sequence according to the characteristics of the returned data; the method comprises the steps that the idea of a sliding window algorithm is introduced in the first stage, a piecewise linearization algorithm is used for dividing sequence data into a plurality of line segments characterized by slopes and spans, then K-means clustering based on maximum and minimum distance improvement is used for symbolizing online monitoring data, finally an APRIORI algorithm is used for mining the relevance among different indexes in DGA, and abnormal values existing in the DGA are mined; and in the second stage, according to the screened abnormal numerical sampling points, an improved particle swarm optimization support vector regression algorithm is provided, the distance between particle solution sets is defined, different types of particles are divided by using a fuzzy inference rule, different updating formulas are defined according to the different types of particles, the solving speed and solving diversity of the algorithm are guaranteed, key parameters in the support vector regression algorithm are optimized to repair the sampling points, and the processing of the on-line DGA monitoring data of the power converter transformer is realized.
Detailed Description
The present invention will be explained in further detail with reference to examples.
Referring to fig. 1, a method for processing on-line monitoring data of a converter transformer DGA with exception removal and repair, which comprises the following steps:
s1, importing DGA online monitoring data, and setting the length and the sliding step length of a sliding window;
s2, piecewise linearization of sequence data: since online data is usually a numerical variable, the method is not suitable for relevance mining of sequence data; combining a variable number of points in the online data together according to a model by using a piecewise linearization algorithm of sequence data to form a multi-group data point set; the data point grouping is normalized in that the error between the line segment fitted to all points and the actual data point is less than a threshold, and the fitted line segment is characterized using the slope and the line segment span of the line segment;
s3, constructing a model for describing the similarity of different line segments: constructing a similarity model based on the slope and span of the line segments, classifying the line segments by using a K-means clustering algorithm improved based on the maximum and minimum distances, giving symbols to the line segments of the same class, and completing the symbolization of sequence data;
s4, mining the relevance among different sequences: setting a minimum confidence coefficient and a support degree based on an Apriori algorithm, mining a frequent item set existing among different sequences, and quantifying the relevance among the different sequences;
s5, extracting and screening abnormal values existing in DGA online monitoring data: according to the strength of the correlation among the sequences, separating data of different abnormal modes from the abnormal numerical value types in the judged data;
s6, optimizing key parameters of support vector regression by improving a particle swarm algorithm, and repairing the screened abnormal numerical value points: defining the distance between particle solution sets, calculating the density of different particles based on the distance, and introducing an improved fuzzy inference rule according to the density to define different particle updating modes so as to improve the diversity and solving speed of particle swarm algorithm solution; and optimizing the key parameters supporting vector regression by using an improved particle swarm algorithm, improving the data regression precision, repairing the screened abnormal numerical points, and finishing the processing of DGA on-line monitoring data.
Specifically, in step S1, DGA online monitoring data is imported, the length of the sliding window is set to L, and the sliding step length is set to L; traversing the online data set with a sliding window of a certain step size: dragging a sliding window to slide on the whole online monitoring data set by a sliding step length l until all data are traversed; let the length of the on-line monitoring data set be L
1After traversal, get
A data window, deriving the data in all windows to form a data set DS to be analyzed
i,i∈n。
Specifically, the specific steps of the piecewise linearization algorithm of the sequence data set forth in step S2 are:
s2.1, for monitoring data XK={x1,x2,…,xkAnd intercepting data points by using a window with the length of L (L < k), and carrying out piecewise linear fitting on the data points contained in the intercepted data points on the basis of the idea of a sliding window.
S2.2, taking the first data point in the window as the fitting starting point of the initial line segment, and enabling the point to be xiAssuming that the fitting end point of the initial line segment is xi+m(m > 1), the m +1 data points are fitted to a line segment.
The distance from the actual data points to the fitting line segment is used as a fitting error, and the fitting accuracy of the fitting line segment to the actual numerical points is improved; unlike the conventional least squares fitting, let dnAs index sequence number points XnAnd (3) calculating the linear distance from all actual data points in the step length of the fitting line segment to the fitting line segment, and taking the sum of the linear distances as the fitting overall error ER of the line segment:
Xithe sampling numerical value of the time i in the time sequence is represented, and m represents the number of numerical points contained in the fitting line segment; t is tnRepresents a time step;
s2.3, setting a fitting error threshold value to be ERrIf ER < ERrIf so, the line segment can still continue to increase the fitting points, and let m be m +1, and repeat the above steps; ER if anyrIf m is equal to m, the point is used as the line segment fitting end pointGenerating a line segment; if ER > ER is presentrIf the line segment can not be fitted, the fitting end point of the current line segment is stored as Xend=Xi+m-1And recording the data sampling time, returning to the step S2.2, resetting the parameter m, and fitting the next part of data by taking the current fitting end point as the fitting starting point of the next line segment until all data points in the sequence are fitted.
Assume that the slope of the fitted line segment is k
iThe number of fitting numerical points in the line segment is m
iThen the actual growth rate of the line segment fit data can be expressed as:
with k
i,m
i,r
iThe three elements constitute a triplet (k) of line segments
i,m
i,r
i) And representing a fitted line segment by the array.
Since the piecewise linearization is a data fitting process, the quality of the fitting effect is related to the error magnitude. The present invention uses the slope k of a line segment in consideration of the characteristic properties of a general line segmentiLength of fit liAnd the growth rate r of the line segmentiFormed as { k }i,li,riThe array set represents each line segment.
In particular, in step S3, during the cluster analysis, a standard for measuring the similarity of the line segments needs to be established; the DGA online monitoring data reflects real-time indexes of equipment, the change trend and the form of the parameters can reflect the change of the running state of the equipment most, the invention extracts two key parameters of the slope and the span of a line segment, describes the similarity between the line segments by using Euclidean distance and defines a similarity model of the line segment. Based on a similarity model, performing clustering analysis on the line segment set by using a K-means algorithm improved based on the maximum and minimum distances, and dividing similar line segments into the same category.
In particular, in step S3, since there is a certain order of magnitude difference between different indicators in the online DGA monitoring data, all line segment triplets existing in the same sequence need to be shaped as
The standardization operation of (2);
specifically, in step S3, during the clustering analysis, a criterion for measuring the similarity of the line segments is established; describing the similarity between line segments by using Euclidean distance, wherein the consideration degree of different attributes of the line segments is expressed in a weight mode; the established line segment similarity model is shown as the following formula:
in the formula (ds)ijRepresenting line segment similarity, ωk、ωm、ωrAnd respectively representing the weight ratios occupied by the slope, the span and the growth rate in the line segment similarity model.
In step S3 of the present invention, the K-means algorithm improved based on the maximum and minimum distances includes the following main steps:
the maximum and minimum distances are also based on Euclidean distances, and the difference between the maximum and minimum distances and the K-means algorithm is that an object with a maximum distance is taken as a clustering center; for the sample set, a proportion coefficient theta (0 < theta < 1) is given, and the sample set s is taken arbitrarilynIs the initial clustering center, denoted as z1;
Optionally taking the distance z of the remaining n-1 samples1The farthest sample is the second cluster center, denoted as z2;
Calculate the remaining n-2 samples and z1And z2And finding the minimum value among them, namely:
Dij=||xi-zj||,j=1,2 (6)
Di=min(Di1,Di2),i=1,2,…,n (7)
if it is
Di=max{Di}>θ×||zi-z2|| (8)
Then select the corresponding sample siAs a third cluster center z3;
Assuming that there are K cluster centers, the distance from the remaining n-K samples to the cluster centers is calculated, and the following steps are carried out:
Dr=max{min(Di1,Di2,…Dik)}>θ×||z1-z2|| (9)
then the corresponding sample xrIs the K +1 cluster center and is marked as zK+1(ii) a The process is continuously circulated until no new clustering center appears;
when no new cluster center is present, the samples are assigned to each class according to the minimum distance principle. The improved K-means clustering algorithm based on the maximum and minimum distances has the advantages that the clustering centers are consistent during each clustering analysis, the randomness of selecting the clustering centers by the traditional K-means algorithm is eliminated, and the accuracy and the speed of the clustering analysis can be effectively improved.
In step S4 of the present invention, the process of mining the association between different sequences is as follows:
s4.1, setting parameters of minimum support degree and minimum confidence degree; judging the basis of sequence association and frequent item sets when the confidence coefficient and the support degree threshold value exist, wherein a proper threshold value parameter is favorable for enhancing the reliability of the association relation, and the minimum support degree threshold value of frequent-1 and frequent-2 item sets is recorded as min sup1And min sup2The minimum confidence threshold in the sequence association mining is min con.
S4.2, generating a frequent item set; using the summed two-signed sequence as a transaction set, denoted
Wherein
All symbol categories corresponding to the two sequences are: { A
1,A
2,…,A
CAAnd { B }
1,B
2,…,B
CBBased on the basic idea of Apriori algorithm, the invention obtains a frequent item set of the sequence by scanning the transaction set in two stages. The confidence for each symbol in the sequence is calculated according to equation (10):
wherein X and Y represent two index objects needing to mine association rules, NtRepresenting the number of transaction sets, namely the number of elements in the sequence, representing the proportion of items in the transaction sets by the support degree, and when a frequent-1 item set is explored, the support degree is greater than min sup1The items of (a) are divided into a set of frequent-1 item sets.
The collection of frequent-1 item sets of two sequences in the association mining is recorded as PA、PBPairing the items in the set according to the index parameters to form the form (P)Ai,PBi) Form 2-item set, calculating the support degree of each item in the 2-item set, and enabling the support degree to be greater than min sup2Is divided into a frequent-2 item set, denoted as { PA,PB}freq。
S4.3, mining of sequence relevance: combining all the sequences pairwise, and respectively counting the support degree of the frequent-2 item concentrated items in the sequences and the confidence degree between the corresponding association mining sequences;
and (3) firstly, accumulating the support degrees of all frequent-2 item sets between two index parameters according to the formula (7), and taking the accumulated support degrees as the support degrees of the two parameter sequences in all multivariate sequences.
σ(XA)=sum(σ(PA)) (12)
σ(XB)=sum(σ(PB)) (13)
And m is CA + CB, wherein CA and CB are the total number of the line segment categories divided after the two sequences are subjected to clustering analysis, and m is the number of the line segment categories after the two sequences are subjected to the total grouping. At the same time, the minimum support threshold of the index sequence layer is min sup3If the support degree of the parameter index level is larger than the set threshold value, calculating the symbolConfidence con (X) of item set combinations in two sequencesA→XB) As shown in formula (14):
when the confidence is greater than the set minimum confidence threshold, the association rule X is reservedA→XBAnd describing the strength of the association between the two indexes by using the confidence coefficient, and judging that the two indexes have strong association.
The invention sets minimum support threshold min sup of different levels for two sequences completing the total operation based on the idea of Apriori algorithmiAnd continuously mining a frequent item set existing among the sequences and finally judging the strength of the association relation among the indexes.
The main idea of improving particle swarm optimization support vector regression in step S6 of the invention is as follows: and for vacant numerical points caused by deletion of abnormal values, a support vector regression algorithm for improving particle swarm optimization is provided for repairing, and in the classification and regression problems of a support vector machine, a kernel function is introduced to convert the nonlinear problem of an input space into the linear problem of a high-dimensional space, so that the complexity of the algorithm can be effectively reduced. The present invention uses a Radial Basis (RBF) kernel. In order to obtain the optimal parameters of the RBF function, the mean square error is used as a fitness function, and the parameters C and gamma of the support vector machine are optimized by using an improved particle swarm optimization.
The particle swarm optimization is easy to converge prematurely and fall into local optimization, the density of different particles is defined through the Euclidean distance between the particles in the particle iteration process of the particle swarm optimization, the particles are updated in different updating modes according to the density of clusters to which the particles belong, the convergence speed of the algorithm solution can be guaranteed, the diversity of solution sets can be guaranteed, and the local optimization is avoided. The particle swarm optimization algorithm is improved by the following main steps:
1) defining the number m of variables, generating N m-dimensional particles in a space of feasible solutions,S
tis the t-th generation particle in the iteration, wherein the element is
Wherein the elements are expressed as
2) An inertial weight is determined, which represents the magnitude of the particle's inheritance to the velocity at the last iteration. When the value is larger, the global optimizing capability of the population is stronger, and the local optimizing capability is weaker; the learning ability of the particles is strong when the value is small, and the particles can be converged to a local optimal value at a higher speed. The self-adaptive weight method can better find a balance point between the two, and the inertia weight is properly increased when the target values of all particles tend to be consistent; when the target values of the particles are relatively dispersed, the inertia weight value is properly reduced, and the specific expression is as follows:
wherein, waAnd wzRepresenting maximum and minimum values of inertial weight, fz,fpjRespectively, the fitness value of a particle, the minimum fitness value of all particles, and the average fitness value of all particles.
3) Defining fuzzy inference rule input variables, density of population where particles are located, and expressing the distance between each particle by Euclidean distance:
the calculation formula of the particle density is obtained:
niis the number of particles in the i particle population, N isNumber of particles in the generated solution, ciAnd (4) recording the density of the particles, normalizing the density and the current iteration times k, taking the normalized density and the current iteration times k as two input variables of the fuzzy rule, and respectively calculating the membership degrees of the fuzzy rule to different states.
4) The fuzzy inference rule defines low density (L), medium density (M) and high density (H) for the fuzzy set of the input normalization variables, and the expression of the membership function is shown as the following formula. c. C1-c3Is the interval threshold of the membership function.
Where x may be two input variables of particle density and iteration number. The calculated membership degree of the particle density and the membership degree of the iteration times are combined in a cross way to form a fuzzy matrix K of the particle state with the dimensionality of 3 multiplied by 3, and the fuzzy matrix K of the particle state and a vector c formed by the membership degree of the particle density are combinedlMultiplying to obtain probability vector c of particle dependent different density intervalb,
And taking the maximum density interval in which the current particles are positioned, and respectively formulating different particle updating modes.
5) Two learning factors mu of particle updating rule and initialization algorithm1、μ2。
When c is going toblAt maximum, the particles are solved only to the optimal direction of the particles, and the speed updating mode is as follows:
when c is going tobmOr cbhAnd when the maximum is reached, the algorithm is solved towards the global optimum and subgroup optimum directions, and the same updating mode as the traditional particle population algorithm is adopted. By self-optimal here is meant a locally optimal solution in a population of low density particles. The flow of the APSO-SVR algorithm is shown in FIG. 2.
Application case
1. The method comprises the steps of taking hydrogen and methane gas indexes in DGA historical online monitoring data of certain main transformer equipment as research objects, and considering that the online monitoring data of the oil chromatogram generally takes day as a sampling period; therefore, the present invention takes the number of sample points (720 points) that is approximately two years as the data window length and drags the data window across the entire historical data set with the number of sample points (90 points) that are quarterly as the step size.
2. Piecewise linearization of sequence data: the method provided by the invention is used for carrying out piecewise linearization fitting on the intercepted window sequence data, and the specific closing result of each index data is shown in the following figures 3 and 4. As can be seen from fig. 3 and 4, the indicator fitting of the online monitoring data of DGA is successful, and a line segment formed by connecting two end points represents all data points in a line segment span.
3. Constructing a model for describing the similarity of different line segments: and constructing a similarity model based on the slope and the span of the line segment, classifying the line segment by using a K-means clustering algorithm improved based on the maximum and minimum distances, giving symbols to the line segments of the same class, and completing the symbolization of the sequence data.
4. And (3) mining the relevance between different sequences: after the corresponding frequent item set is obtained, the relevance between the two indexes is analyzed by using the method provided by the invention, so that the support degree is convenient for representing the strength of the relevance relation by the confidence coefficient, and H is obtained2→CH4The support degree and the confidence degree of (2) are 0.5050 and 0.6804 respectively, which are both larger than the set related minimum threshold value, and indicate that the rule is a strong association rule, which indicates that a strong association relationship exists between the hydrogen and methane indexes.
5. Extracting and screening abnormal values existing in DGA online monitoring data: under the condition that the strong correlation between hydrogen and methane is known, abnormal value detection is carried out on the two index sequences, abnormal values of the hydrogen online data are found in the 42 th to 54 th, 85 th to 91 th and 201 th to 206 th numerical value sampling points, and the methane gas online data are not abnormal in the sampling time period near the points, so that the abnormal sampling points are judged to be caused by abnormal operation state of the monitoring device, and a cleaned data set is drawn as the basis for judging the operation state of the online monitoring device. And at 466 to 471 sampling points, the online monitoring data of methane is abnormal, at 466 to 473 sampling points, the online monitoring data of hydrogen is abnormal, the abnormal time periods of the online monitoring data of two indexes are similar, the index data of the nearby sampling time period is reserved and marked as the abnormal point of the running state of the equipment.
6. Improving a particle swarm optimization algorithm to optimize the key parameters of the support vector regression, and repairing the screened abnormal numerical value points: in order to verify the effectiveness of the APSO-SVR algorithm, a certain section of converter transformer online data running in a normal state is intercepted, and the data correction algorithm provided by the text is verified. A regression analysis model is constructed by respectively using a common PSO and APSO in the text, DGA online monitoring data is used as a verification object, hydrogen in the DGA online monitoring data is used as a test data set, and other four gases are used as training data sets, so that the optimization process and regression results of different models are obtained as shown in fig. 5.
By comparison of fig. 6, the prediction result of the SVR prediction model optimized by the APSO algorithm is closer to the actual value, and the relative prediction error is smaller; the effectiveness of the proposed data repair strategy herein is demonstrated. The results of on-line monitoring data of the DGA by using the support vector regression algorithm of the improved particle swarm optimization are shown in FIG. 6. It can be seen that the data points that were screened out, all values returned to normal levels after being repaired using the method herein by relying on several other characteristic gases, and the online data was effectively cleaned.