CN115238234B - Abnormal data determining method, electronic equipment and storage medium - Google Patents
Abnormal data determining method, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115238234B CN115238234B CN202210840814.0A CN202210840814A CN115238234B CN 115238234 B CN115238234 B CN 115238234B CN 202210840814 A CN202210840814 A CN 202210840814A CN 115238234 B CN115238234 B CN 115238234B
- Authority
- CN
- China
- Prior art keywords
- data
- vector
- data vector
- size information
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 259
- 230000002776 aggregation Effects 0.000 claims abstract description 4
- 238000004220 aggregation Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 6
- 230000000295 complement effect Effects 0.000 claims description 5
- 238000006116 polymerization reaction Methods 0.000 abstract 2
- 238000005070 sampling Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 238000003064 k means clustering Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an abnormal data determining method, electronic equipment and storage medium, comprising the following steps: acquiring an original data vector set A according to the first time length; carrying out vector dimension filling on each original data vector in the A to obtain a first data vector set B; traversing each first data vector in B according to a preset data threshold, and calculating bi j Counting when the number is larger than or equal to a preset data threshold value to obtain a first number set S; performing first clustering treatment on the B according to the S to obtain a first clustering result V; obtaining a mean vector set U according to the V; b is subjected to second polymerization treatment according to the U, and a second polymerization result is obtained; determining whether an isolated data vector exists in the B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector. The method and the device can complete the determination of the abnormal data only according to the RTU using the nonstandard protocol and the data uploaded by the sensor.
Description
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a method for determining abnormal data, an electronic device, and a storage medium.
Background
IEC104 is an international standard communication protocol widely applied to industries such as electric power, urban rail transit and the like, and has the advantages of large communication data, convenient upgrading, good real-time performance, high reliability and the like, and a management system sends monitoring data acquired by a remote terminal (RTU, remote Terminal Unit) to a dispatching center through an IEC104 protocol for use by control personnel.
However, as the demands for personalization are increasing, many RTUs modify the IEC104 when uploading data/data packets, and use the modified non-standard protocol to upload data. Since the data/data packets obtained by the dispatching center are uploaded by adopting a non-standard protocol, abnormal data in the data/data packets uploaded by adopting the non-standard protocol cannot be determined by adopting an abnormal data method corresponding to IEC 104.
Disclosure of Invention
In view of the foregoing, the present application provides an abnormal data determining method, an electronic device, and a storage medium, which at least partially solve the problems existing in the prior art.
According to an aspect of the present invention, there is provided an abnormal data determination method including:
step S100, according to the first time length L, a set of raw data vectors a= { A1, A2, A3, am }, ai= (Ai 1 ,ai 2 ,ai 3 ,...,ai n(i) ) The method comprises the steps of carrying out a first treatment on the surface of the Where i=1, 2..m, ai is the original data vector corresponding to the i-th RTU, m is the number of RTUs, ai g G = 1,2, n (i) for the g-th raw data size information in the i-th raw data vector; n (i) is the number of original data size information in the ith original data vector; each RTU has a unique corresponding target sensor using a non-standard protocol;
step S200, performing vector dimension filling on each original data vector in the original data vector set a to obtain a first data vector set b= { B1, B2, B3, & gt, bm }, bi= (Bi 1 ,bi 2 ,bi 3 ,...,bi W ) So that the number of dimensions of each first data vector is the same; wherein Bi is a first data vector obtained by vector dimension filling of Ai, bi j For the j-th first data size information in the i-th first data vector, j=1, 2, & gt, W is the number of dimensions in each first data vector, w=max (n (1), n (2), n (3), and & gt, n (m)), and when vector dimension filling is performed, the data size information of the complementary dimensions is 0;
step S300, respectively traversing each first data vector in the first data vector set B according to a preset data threshold value, to obtain a first number set s= { S1, S2, S3, & gt, sm }; wherein si is the amount of first data size information in Bi which is greater than or equal to a preset data threshold;
step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first number set S, to obtain a first clustering result v= { V1, V2, V3,..vk }, vx= { VX 1 ,VX 2 ,VX 3 ,...,VX c(X) X=1, 2., k, VX is the X-th second set of data vectors, k is the number of said second set of data vectors, k < m, VX c(X) C (X) th second data vector in the X th second data vector set, c (X) being the number of second data vectors in the X th second data vector set;
step S500, obtaining a mean vector set u= { U1, U2, U3, & gt, uk }, according to each second data vector set, where uX is a mean vector corresponding to VX; ux= (uX) 1 ,uX 2 ,uX 3 ,...uX W ),uX j =(∑ c(X) e=1 VX e j ) C (X); where j=1, 2, W, uX j VX is the j-th mean data size information in uX e j J second data size information for the e second data vector in VX, e=1, 2, c (X);
step S600, performing second clustering on the first data vectors in the first data vector set B according to the average value vector set U to obtain a second clustering result; wherein the number of cluster categories processed by the second clustering process is k, uX is used as a clustering initial vector of the X-th cluster category, and the clustering condition is similarity F Xt Less than a corresponding similarity threshold lambda X ,F Xt For the similarity of Bt to uX, bt is the t first data vector in B, t=1, 2,. -%, m;
step S700, determining whether an isolated data vector exists in the first data vector set B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector.
wherein bt r For the r first data size information in Bt, uX r And the data size information is the r mean value data in the uX.
In an exemplary embodiment of the present application, λ X Is in accordance with the followingConditions are as follows:
therein, uY r For the r-th mean data size information in uY, uY is the mean vector corresponding to VY, VY is the Y-th second data vector set in V, y=x+1; uZ r For the r-th mean data size information in uZ, uZ is the mean vector corresponding to VZ, where VZ is the Z-th second data vector set in V, and z=x-1.
In an exemplary embodiment of the present application, before the step S100, the method further includes:
the data message for each candidate sensor is identified to determine a candidate sensor of the number of candidate sensors that uses a non-standard protocol as a target sensor.
In an exemplary embodiment of the present application, before the step S100, the method further includes:
determining a data uploading period corresponding to each RTU to obtain a period set Q= { Q1, Q2, Q3, & gt, qm }, wherein Qi is the data uploading period corresponding to the ith RTU;
obtaining a maximum period max (Q), wherein max () is a preset maximum value determining function;
determining a first time length L according to the maximum period max (Q); wherein L is equal to or greater than max (Q).
In an exemplary embodiment of the present application, l=h×max (Q), H being a positive integer greater than 1.
In one exemplary embodiment of the present application, h=10.
In an exemplary embodiment of the present application, the preset data threshold is 0.8kb.
According to one aspect of the present invention, there is provided an electronic device including a processor and a memory;
the processor is configured to perform the steps of any of the methods described above by invoking a program or instruction stored in the memory.
According to one aspect of the present invention, there is provided a non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of any of the methods described above.
According to the abnormal data determining method, vector dimension filling is firstly carried out on each original data vector in the A, so that the number of dimensions of each first data vector is the same, namely the length of each first data vector is the same. And then, determining the quantity of the first data size information which is larger than or equal to the preset data threshold value in each first data vector according to the preset data threshold value, and obtaining S. And then clustering the first data vectors in the first data vector sets according to the first quantity set to obtain a plurality of second data vector sets. Wherein the first number (which may be understood as the number of valid original data size information in the original data vector) corresponding to the second data vector in each second data vector set is similar (the number difference is smaller than the threshold). And determining the mean value vector of each second data vector set according to the second data vector in each second data vector set, so as to obtain the number k of clustering categories used by the second clustering process and the clustering initial vector corresponding to each clustering category, and carrying out the second clustering process. And determining the first data vector which cannot be clustered in the second clustering process as an isolated data vector, and finally determining the abnormal data vector from the original data vector set according to the corresponding relation between the original data vector set and the first data vector set. Thus, the determination of the abnormal data is completed only according to the RTU using the nonstandard protocol and the data uploaded by the sensor, and the protocol content of the nonstandard protocol used by the RTU and the sensor is not needed to be known. Meanwhile, since the lengths of the first data vectors are the same, when the mean value vector is obtained, the mean value vector can be directly obtained for the first data vector corresponding to each second data vector set. The problem that the mean value vector cannot be obtained due to different numbers of original data vector dimensions (different lengths) is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic block diagram of a scenario to which the abnormal data determination method provided in this embodiment is applied.
Detailed Description
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
It should be noted that, without conflict, the following embodiments and features in the embodiments may be combined with each other; and, based on the embodiments in this disclosure, all other embodiments that may be made by one of ordinary skill in the art without inventive effort are within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
Referring to fig. 1, according to an aspect of the present invention, an abnormal data determining method is provided and applied to a host computer, which may be a server, a PC, or other electronic devices capable of receiving data information and having a certain processing capability. The upper computer is in communication connection with the RTUs and can receive sampling data uploaded by the RTUs, wherein the sampling data can comprise uploading time, original sampling data (data acquired from a sensor acquired from a corresponding sensor), original sampling data size information and the like. The RTUs are configured to upload sample data once every end time of one data upload period (each RTU has its own corresponding data upload period). In practical application, the RTU is affected by network fluctuation, and when the end time of the data uploading period is not reached, a sample data is uploaded to the upper computer. In this embodiment, the sampled data may be a traffic packet.
The method specifically comprises the following steps:
step S000, the data message of each candidate sensor is identified, so that the candidate sensor using a nonstandard protocol in a plurality of candidate sensors is determined as a target sensor; each target sensor is provided with a unique corresponding RTU, and the RTU is used for uploading sampling data of the corresponding target sensor. The nonstandard protocol refers to a customized IEC104 protocol, namely a modified IEC104 protocol. The target sensor may be a temperature sensor, a humidity sensor, a pressure sensor, or the like.
Step S100, according to the first time length L, a set of raw data vectors a= { A1, A2, A3, am }, ai= (Ai 1 ,ai 2 ,ai 3 ,...,ai n(i) ) The method comprises the steps of carrying out a first treatment on the surface of the Where i=1, 2..m, ai is the original data vector corresponding to the i-th RTU, m is the number of RTUs, ai g G = 1,2, n (i) for the g-th raw data size information in the i-th raw data vector; n (i) is the number of original data size information in the ith original data vector; each RTU has a unique corresponding target sensor using a non-standard protocol. The original data vector can be obtained according to the sampled data uploaded by the corresponding RTU in the first time period. Meanwhile, due to the data uploading period of each RTUAnd the start working time, and the number of times of false uploading caused by network fluctuation is also different, so that the quantity of the original data size information in each original data vector is also different. Therefore, in this embodiment, n () is not a set processing function, but a unique determined value that can be obtained according to the change of the value of i, where the value of i is different, and the value of the corresponding n (i) may be different.
Step S200, performing vector dimension filling on each original data vector in the original data vector set a to obtain a first data vector set b= { B1, B2, B3, & gt, bm }, bi= (Bi 1 ,bi 2 ,bi 3 ,...,bi W ) So that the number of dimensions of each first data vector is the same; wherein Bi is a first data vector obtained by vector dimension filling of Ai, bi j For the j-th first data size information in the i-th first data vector, j=1, 2,..w, W is the number of dimensions in each first data vector, w=max (n (1), n (2), n (3), and..n (m)), and when vector dimension filling is performed, the data size information of the complementary dimensions is 0. Specifically, in this embodiment, the dimension vector is not supplemented with a different number of consecutive 0 s at the head or tail of each original data vector. But a certain amount of 0 is supplemented between two adjacent first data size information according to the actual time of the original first data size information in each original data vector. Wherein, the "certain quantity" is determined by the time intervals corresponding to the adjacent two pieces of first data size information, and the longer the time interval is, the more 0 is supplemented. So that the corresponding times of the data in the same dimension are the same or similar in the different first data vectors. The approximation corresponds to a time difference of less than 0.01 seconds to 0.1 seconds. The data size information of the complementary dimension is 0, so that the data size information of the dimension does not affect the actual value of the corresponding first data vector in the subsequent processing, but rather, the first data size information in the different first data vectors is aligned in time and position.
Step S300, each first data vector in the first data vector set B is traversed according to a predetermined data threshold, and bi is calculated as j Counting when the number is greater than or equal to a preset data threshold value, so as to obtain a first quantity set s= { S1, S2, S3., sm }; wherein si is the number of the first data size information greater than or equal to the preset data threshold in Bi, i.e. si is the first number corresponding to Bi. The first amount may be understood as the amount of data in the original data vector that is valid, i.e. the amount of original data size information that is not generated due to network fluctuations.
Step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first number set S, to obtain a first clustering result v= { V1, V2, V3,..vk }, vx= { VX 1 ,VX 2 ,VX 3 ,...,VX c(X) X=1, 2., where k, VX is the X-th second set of data vectors, k is the number of said second set of data vectors, VX c(X) C (X) th second data vector in the X th second data vector set, c (X) being the number of second data vectors in the X th second data vector set, k < m;
and clustering the B according to the quantity of the first data size information which is larger than or equal to a preset data threshold value in each first data vector, and clustering the first data vectors with similar acquisition periods, similar starting and ending time and similar actual sampling time length into a second data vector set. That is, the first number of similar first data vectors, the data upload periods used by their corresponding RTUs, may be the same or similar, or the start and end times may be the same or similar, or the actual sampling time lengths may be the same or similar. The first number is determined by adopting the preset data threshold value, so that the influence caused by the difference of the lengths of the original data vectors and the difference of the number of the complementary dimensions due to network fluctuation and the vector dimension compensation can be avoided. Specifically, the clustering condition may be that any two first number differences are smaller than a set number difference, and the set number difference can be clustered into one type, wherein the value of the set number difference is 1-5, and specifically, the set number difference is 2. The clustering method may be an existing clustering method. It will be appreciated that the second data vector of the second set of data vectors, actually also the first data vector of the first set of data vectors, is not modified by any of the first data vector to the first data size information in the vector, but is merely reclassified. In this embodiment, different names are used for writing, which is only for convenience of distinction. Specifically, the preset data threshold may be a mean value or a maximum value in the data size information marked as being generated due to network fluctuations in the historical data, and the maximum value is adopted in this embodiment. In this embodiment, the preset data threshold is 0.8kb.
Step S500, obtaining a mean vector set u= { U1, U2, U3, & gt, uk }, according to each second data vector set, where uX is a mean vector corresponding to VX; ux= (uX) 1 ,uX 2 ,uX 3 ,...uX W ),uX j =(∑ c(X) e=1 VX e j ) C (X); where j=1, 2, W, uX j VX is the j-th mean data size information in uX e j E=1, 2,..c (X) for the j-th second data size information of the e-th second data vector in VX.
In this embodiment, since the vector dimension is compensated for each original data vector, the length of each first data vector is the same as the length of each second data vector. Thus, when U is derived from V, it can be determined by the formula uX j =(∑ c(X) t= 1 VX t j ) And (c) directly carrying out co-position (co-dimensionality) averaging on all the second data vectors in the VX to obtain an average value vector corresponding to each second data vector set. The problem that the mean value vector cannot be obtained due to different dimension numbers of the original data vectors is avoided. And when the mean value vector is obtained, all the original data size information (including the original data size information generated by network fluctuation) in the original data vector is used, so that the data accuracy of the mean value vector is higher.
Step S600, performing second clustering on the first data vectors in the first data vector set B according to the average value vector set U to obtain a second clustering result; wherein the number of cluster categories processed by the second clustering process is k, uX is used as a clustering initial vector of the X-th cluster category, and the clustering condition is similarity F Xt Less than a corresponding similarity threshold lambda X ,F Xt For similarity of Bt to uX, bt is the t first data vector in B, t=1, 2,..m. Specifically, the second clustering process may be a K-means clustering process. The number of the cluster categories is the "K value" used in the K-means clustering process, and u1, u2, u 3. The cluster initial value is actually derived from the second data vector in B, and the vector set for which K-means clusters are also B. Thus, the clustering of the second data vector in the B can be more accurately realized.
wherein bt r For the r first data size information in Bt, uX r And the data size information is the r mean value data in the uX. The vector distance (i.e. similarity) between each second data vector and each clustering initial vector can be obtained through the formula, and concretely F Xt The smaller the description the more similar.
λ X Meets the following conditions:
therein, uY r For the r-th mean data size information in uY, uY is the mean vector corresponding to VY, VY is the Y-th second data vector set in V, y=x+1; uZ r For the r-th mean data size information in uZ, uZ is the mean vector corresponding to VZ, where VZ is the Z-th second data vector set in V, and z=x-1.
In the clustering condition, the corresponding similarity threshold value of each clustering type is not a fixed value, but is determined through the vector distance between the current initial clustering vector and one or two adjacent initial clustering vectors, so that the final clustering result is more accurate. Thus, clustering of the second data vector in B is achieved. In this embodiment, the reason why the isolated data vector is not determined according to the first clustering process is that the clustering basis of the first clustering process is S, and each first number in S is a positive integer, and only clustering according to the first number can better complete clustering of the first data vector with similar acquisition period, similar start and end time and similar actual sampling time length. Therefore, in this embodiment, in order to realize more accurate determination of abnormal data, two clustering is adopted. And the number of cluster categories used by the second clustering process that can be obtained by the first clustering process and the cluster initial value of each cluster category. So that the clustering accuracy of the second clustering process is improved by the first clustering process.
Step S700, determining whether an isolated data vector exists in the first data vector set B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector.
The second clustering result may exist in the form of a cluster map or cluster set, with isolated data vectors being the first data vectors that are not clustered into any cluster type. That is, the isolated data vector has a large difference from each first data vector, which can indicate that there is abnormal first data size information in the isolated data vector. And finally, determining the abnormal data vector in the A according to the corresponding relation between the original data vector set and the first data vector set, and marking correspondingly. Wherein, the corresponding relationship is that A1 corresponds to B1, A2 corresponds to B2, and the like, namely Ai corresponds to Bi.
According to the abnormal data determining method provided by the embodiment, vector dimension compensation is firstly carried out on each original data vector in the A, so that the number of dimensions of each first data vector is the same, namely the length of each first data vector is the same. And then, determining the quantity of the first data size information which is larger than or equal to the preset data threshold value in each first data vector according to the preset data threshold value, and obtaining S. And then clustering the first data vectors in the first data vector sets according to the first quantity set to obtain a plurality of second data vector sets. Wherein the first number (which may be understood as the number of valid original data size information in the original data vector) corresponding to the second data vector in each second data vector set is similar (the number difference is smaller than the threshold). And determining the mean value vector of each second data vector set according to the second data vector in each second data vector set, so as to obtain the number k of clustering categories used by the second clustering process and the clustering initial vector corresponding to each clustering category, and carrying out the second clustering process. And determining the first data vector which cannot be clustered in the second clustering process as an isolated data vector, and finally determining the abnormal data vector from the original data vector set according to the corresponding relation between the original data vector set and the first data vector set. Thus, the determination of the abnormal data is completed only according to the RTU using the nonstandard protocol and the data uploaded by the sensor, and the protocol content of the nonstandard protocol used by the RTU and the sensor is not needed to be known. Meanwhile, since the lengths of the first data vectors are the same, when the mean value vector is obtained, the mean value vector can be directly obtained for the first data vector corresponding to each second data vector set. The problem that the mean value vector cannot be obtained due to different numbers of original data vector dimensions (different lengths) is avoided.
In an exemplary embodiment of the present application, before the step S100, the method further includes:
determining a data uploading period corresponding to each RTU to obtain a period set Q= { Q1, Q2, Q3, & gt, qm }, wherein Qi is the data uploading period corresponding to the ith RTU;
obtaining a maximum period max (Q), wherein max () is a preset maximum value determining function;
determining a first time length L according to the maximum period max (Q); wherein L is equal to or greater than max (Q). Specifically, L is greater than or equal to H, max (Q), and H is a positive integer greater than 1. Preferably, h=10. L has a definite valueStart time L start And a determined end time, thereby obtaining the corresponding original data vector of each RTU.
In order to ensure that the amount of valid data in a can support subsequent abnormal data determination, in this embodiment, when determining L, it is required to ensure that L is greater than max (Q), that is, that there is at least one valid data in each original data vector. And since in the subsequent processing, the clustering condition of the first clustering process is related to the first number, and the clustering condition of the second clustering process is related to the first number and the actual value of the first data size information in each first data vector. If there is only one valid data in one original data vector, the final clustering result may be affected, so in this embodiment, L is greater than or equal to 10×max (Q), so as to ensure that each original data vector contains at least 10 valid data.
Further, in an exemplary embodiment of the present application, after step S300, the method may further include:
and determining the acquisition time of the first original data size information in the original data vector corresponding to each RTU. And determining a maximum acquisition time T among the plurality of acquisition times start max 。
Obtain hΔ= (L) start -T start max ) Max (Q), wherein hΔ is rounded up.
Traversing S, if sα is smaller than H-HΔ, deleting the first data vector corresponding to sα (namely Bα) from B. Alpha has a value of 1 to m.
And determining the original data vector corresponding to the Bα in the A as an abnormal data vector. Since L is greater than or equal to h×max (Q), it indicates that, if the RTU is normal, the number of first data size information in the corresponding first data vector, which is greater than or equal to the preset data threshold, is at least H-hΔ. Therefore, if sα is smaller than H-hΔ, it can be explained that the corresponding RTU has a problem of missing data, and the original data vector corresponding to the RTU can be directly determined as the abnormal data vector.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
Those skilled in the art will appreciate that the various aspects of the present application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device according to this embodiment of the present application. The electronic device is only one example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.
The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components, including the memory and the processor.
Wherein the memory stores program code that is executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the present application described in the above section of the "exemplary method" of the present specification.
The storage may include readable media in the form of volatile storage, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).
The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The bus may be one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, the various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the present application as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described figures are only illustrative of the processes involved in the method according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. An abnormal data determination method, comprising:
step S100, according to the first time length L, a set of raw data vectors a= { A1, A2, A3, am }, ai= (Ai 1 ,ai 2 ,ai 3 ,...,ai n(i) ) The method comprises the steps of carrying out a first treatment on the surface of the Where i=1, 2..m, ai is the original data vector corresponding to the i-th RTU, m is the number of RTUs, ai g G = 1,2, n (i) for the g-th raw data size information in the i-th raw data vector; n (i) is the number of original data size information in the ith original data vector; each RTU has a unique corresponding target sensor using a non-standard protocol;
step S200, performing vector dimension filling on each original data vector in the original data vector set a to obtain a first data vector set b= { B1, B2, B3, & gt, bm }, bi= (Bi 1 ,bi 2 ,bi 3 ,...,bi W ) So that the number of dimensions of each first data vector is the same; wherein Bi is a first data vector obtained by vector dimension filling of Ai, bi j For the j-th first data size information in the i-th first data vector, j=1, 2, & gt, W is the number of dimensions in each first data vector, w=max (n (1), n (2), n (3), and & gt, n (m)), and when vector dimension filling is performed, the data size information of the complementary dimensions is 0;
step S300, respectively traversing each first data vector in the first data vector set B according to a preset data threshold value, to obtain a first number set s= { S1, S2, S3, & gt, sm }; wherein si is the amount of first data size information in Bi which is greater than or equal to a preset data threshold;
step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first number set S, to obtain a first clustering result v= { V1, V2, V3,..vk }, vx= { VX 1 ,VX 2 ,VX 3 ,...,VX c(X) X=1, 2., k, VX is the X-th second set of data vectors, k is the number of said second set of data vectors, k < m, VX c(X) C (X) th second data vector in the X th second data vector set, c (X) being the number of second data vectors in the X th second data vector set;
step S500, obtaining a mean vector set u= { U1, U2, U3, & gt, uk }, according to each second data vector set, where uX is a mean vector corresponding to VX; ux= (uX) 1 ,uX 2 ,uX 3 ,...uX W ),uX j =(∑ c(X) e=1 VX e j ) C (X); where j=1, 2, W, uX j VX is the j-th mean data size information in uX e j J second data size information for the e second data vector in VX, e=1, 2, c (X);
step S600, performing second clustering on the first data vectors in the first data vector set B according to the average value vector set U to obtain a second clustering result; wherein the number of cluster categories processed by the second clustering process is k, uX is used as a clustering initial vector of the X-th cluster category, and the clustering condition is similarity F Xt Less than a corresponding similarity threshold lambda X ,F Xt For the similarity of Bt to uX, bt is the t first data vector in B, t=1, 2,. -%, m;
step S700, determining whether an isolated data vector exists in the first data vector set B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector.
3. The abnormal data determination method according to claim 2, wherein λ X Meets the following conditions:
therein, uY r For the r-th mean data size information in uY, uY is the mean vector corresponding to VY, VY is the Y-th second data vector set in V, y=x+1; uZ r For the r-th mean data size information in uZ, uZ is the mean vector corresponding to VZ, where VZ is the Z-th second data vector set in V, and z=x-1.
4. The abnormal data determination method according to claim 1, characterized in that, before said step S100, said method further comprises:
the data message for each candidate sensor is identified to determine a candidate sensor of the number of candidate sensors that uses a non-standard protocol as a target sensor.
5. The abnormal data determination method according to claim 1, further comprising, prior to the step S100:
determining a data uploading period corresponding to each RTU to obtain a period set Q= { Q1, Q2, Q3, & gt, qm }, wherein Qi is the data uploading period corresponding to the ith RTU;
obtaining a maximum period max (Q), wherein max () is a preset maximum value determining function;
determining a first time length L according to the maximum period max (Q); wherein L is equal to or greater than max (Q).
6. The abnormal data determination method according to claim 5, wherein l=h×max (Q), H being a positive integer greater than 1.
7. The abnormal data determination method according to claim 6, wherein h=10.
8. The abnormal data determination method according to claim 1, wherein the preset data threshold is 0.8kb.
9. An electronic device comprising a processor and a memory;
the processor is adapted to perform the steps of the method according to any of claims 1 to 7 by invoking a program or instruction stored in the memory.
10. A non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210840814.0A CN115238234B (en) | 2022-07-18 | 2022-07-18 | Abnormal data determining method, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210840814.0A CN115238234B (en) | 2022-07-18 | 2022-07-18 | Abnormal data determining method, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115238234A CN115238234A (en) | 2022-10-25 |
CN115238234B true CN115238234B (en) | 2023-04-28 |
Family
ID=83673920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210840814.0A Active CN115238234B (en) | 2022-07-18 | 2022-07-18 | Abnormal data determining method, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115238234B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2995864A1 (en) * | 2017-03-09 | 2018-09-09 | General Electric Company | Multi-modal, multi-disciplinary feature discovery to detect cyber threats in electric power grid |
CN109978070A (en) * | 2019-04-03 | 2019-07-05 | 北京市天元网络技术股份有限公司 | A kind of improved K-means rejecting outliers method and device |
CN111612037A (en) * | 2020-04-24 | 2020-09-01 | 平安直通咨询有限公司上海分公司 | Abnormal user detection method, device, medium and electronic equipment |
CN112632609A (en) * | 2020-12-23 | 2021-04-09 | 深圳云天励飞技术股份有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
WO2022105318A1 (en) * | 2020-11-18 | 2022-05-27 | 长鑫存储技术有限公司 | Machine bench operating-state monitoring method and apparatus, storage medium, and electronic device |
CN114710369A (en) * | 2022-06-06 | 2022-07-05 | 山东云天安全技术有限公司 | Abnormal data detection method and device, computer equipment and storage medium |
-
2022
- 2022-07-18 CN CN202210840814.0A patent/CN115238234B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2995864A1 (en) * | 2017-03-09 | 2018-09-09 | General Electric Company | Multi-modal, multi-disciplinary feature discovery to detect cyber threats in electric power grid |
CN109978070A (en) * | 2019-04-03 | 2019-07-05 | 北京市天元网络技术股份有限公司 | A kind of improved K-means rejecting outliers method and device |
CN111612037A (en) * | 2020-04-24 | 2020-09-01 | 平安直通咨询有限公司上海分公司 | Abnormal user detection method, device, medium and electronic equipment |
WO2022105318A1 (en) * | 2020-11-18 | 2022-05-27 | 长鑫存储技术有限公司 | Machine bench operating-state monitoring method and apparatus, storage medium, and electronic device |
CN112632609A (en) * | 2020-12-23 | 2021-04-09 | 深圳云天励飞技术股份有限公司 | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium |
CN114710369A (en) * | 2022-06-06 | 2022-07-05 | 山东云天安全技术有限公司 | Abnormal data detection method and device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
孙瑞勇.一款基于主动防御机制的伪装诱捕与威胁感知产品.2021年国家网络安全宣传周"网络安全产业发展论坛".2021,全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN115238234A (en) | 2022-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115186158B (en) | Abnormal data determination method, electronic device and storage medium | |
CN111949710B (en) | Data storage method, device, server and storage medium | |
US11444861B2 (en) | Method and apparatus for detecting traffic | |
JP7073952B2 (en) | Data collection system and data collection method | |
CN110928561B (en) | Vehicle controller software version management method and device, vehicle and storage medium | |
CN113810492B (en) | Data point table generation method and device based on intelligent gateway and computer equipment | |
CN115238234B (en) | Abnormal data determining method, electronic equipment and storage medium | |
CN113691310B (en) | Fault monitoring method, device, equipment and storage medium of optical fiber link | |
CN112783827A (en) | Multi-sensor data storage method and device | |
CN116028917A (en) | Authority detection method and device, storage medium and electronic equipment | |
CN116259165A (en) | Monitoring data processing method and device, electronic equipment and storage medium | |
CN213986631U (en) | Telecommunication control device | |
CN113591787B (en) | Method, device, equipment and storage medium for identifying optical fiber link component | |
CN115348320A (en) | Communication data conversion method and device and electronic equipment | |
CN115499393A (en) | TCP connection message processing method based on network interface chip | |
CN112630529A (en) | Telecommunication control system | |
CN113281565A (en) | Load identification method based on double-core intelligent electric meter | |
CN114449052B (en) | Data compression method and device, electronic equipment and storage medium | |
CN117040938B (en) | Abnormal IP detection method and device, electronic equipment and storage medium | |
CN118092353B (en) | Industrial Internet of things inspection system and method based on online video | |
CN116820539B (en) | System software operation maintenance system and method based on Internet | |
CN112350839B (en) | Event recording method and device for Ethernet, computer equipment and storage medium | |
CN114881018B (en) | File processing method and device, electronic equipment and storage medium | |
CN112527467B (en) | Storage structure, query method, deletion method, device, equipment and medium of container mirror image | |
CN118070207B (en) | Detection data verification method, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A method for determining abnormal data, electronic devices, and storage media Granted publication date: 20230428 Pledgee: Rizhao Bank Co.,Ltd. Jinan Branch Pledgor: Shandong Yuntian Safety Technology Co.,Ltd. Registration number: Y2024980008627 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |