CN115238234B

CN115238234B - Abnormal data determining method, electronic equipment and storage medium

Info

Publication number: CN115238234B
Application number: CN202210840814.0A
Authority: CN
Inventors: 李峰; 孙瑞勇; 靳海燕
Original assignee: Shandong Yuntian Safety Technology Co ltd
Current assignee: Shandong Yuntian Safety Technology Co ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2023-04-28
Anticipated expiration: 2042-07-18
Also published as: CN115238234A

Abstract

The application discloses an abnormal data determining method, electronic equipment and storage medium, comprising the following steps: acquiring an original data vector set A according to the first time length; carrying out vector dimension filling on each original data vector in the A to obtain a first data vector set B; traversing each first data vector in B according to a preset data threshold, and calculating bi _j Counting when the number is larger than or equal to a preset data threshold value to obtain a first number set S; performing first clustering treatment on the B according to the S to obtain a first clustering result V; obtaining a mean vector set U according to the V; b is subjected to second polymerization treatment according to the U, and a second polymerization result is obtained; determining whether an isolated data vector exists in the B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector. The method and the device can complete the determination of the abnormal data only according to the RTU using the nonstandard protocol and the data uploaded by the sensor.

Description

Abnormal data determination method, electronic device and storage medium

技术领域Technical Field

本申请涉及数据处理领域，尤其涉及一种异常数据确定方法、电子设备及存储介质。The present application relates to the field of data processing, and in particular to an abnormal data determination method, an electronic device, and a storage medium.

背景技术Background Art

IEC104是一种广泛应用于电力、城市轨道交通等行业的国际标准通信规约，其具有通信数据大、便于升级、实时性好、可靠性高等优点，管理系统通过IEC104协议将远程终端(RTU,Remote Terminal Unit)采集的监测数据发送至调度中心，以供控制人员进行使用。IEC104 is an international standard communication protocol widely used in industries such as power and urban rail transit. It has the advantages of large communication data, easy upgrade, good real-time performance and high reliability. The management system sends the monitoring data collected by the remote terminal (RTU, Remote Terminal Unit) to the dispatching center through the IEC104 protocol for use by control personnel.

但现在由于个性化的需求越来越多，很多的RTU在进行数据/数据包的上传时，会在IEC104的基础上进行修改，并使用这种修改后的非标准协议进行数据上传。由于调度中心获得的数据/数据包是采用非标准协议进行上传的，也就导致了无法采用IEC104对应的异常数据方法来确定使用非标准协议上传的数据/数据包中的异常数据。However, due to the increasing number of personalized needs, many RTUs will modify the IEC104 when uploading data/data packets, and use this modified non-standard protocol to upload data. Since the data/data packets obtained by the dispatch center are uploaded using a non-standard protocol, it is impossible to use the abnormal data method corresponding to IEC104 to determine the abnormal data in the data/data packets uploaded using the non-standard protocol.

发明内容Summary of the invention

有鉴于此，本申请提供一种异常数据确定方法、电子设备及存储介质，至少部分解决现有技术中存在的问题。In view of this, the present application provides an abnormal data determination method, an electronic device and a storage medium, which at least partially solve the problems existing in the prior art.

根据本发明的一个方面，提供一种异常数据确定方法，包括：According to one aspect of the present invention, there is provided a method for determining abnormal data, comprising:

步骤S100，根据第一时间长度L，获取原始数据向量集A＝{A1,A2,A3,...,Am}，Ai＝(ai₁,ai₂,ai₃,...,ai_n(i))；其中，i＝1,2,...,m，Ai为第i个RTU对应的原始数据向量，m为RTU的数量，ai_g为第i个原始数据向量中第g个原始数据大小信息，g＝1,2,...,n(i)；n(i)为第i个原始数据向量中原始数据大小信息的数量；每一RTU均具有一个唯一对应的使用非标准协议的目标传感器；Step S100, according to the first time length L, obtain a raw data vector set A={A1, A2, A3, ..., Am}, Ai=(ai ₁ , ai ₂ , ai ₃ , ..., ain _(i) ); wherein i=1, 2, ..., m, Ai is the raw data vector corresponding to the i-th RTU, m is the number of RTUs, _aig is the g-th raw data size information in the i-th raw data vector, g=1, 2, ..., n(i); n(i) is the number of raw data size information in the i-th raw data vector; each RTU has a unique corresponding target sensor using a non-standard protocol;

步骤S200，对原始数据向量集A中每一原始数据向量进行向量维度补齐，得到第一数据向量集B＝{B1,B2,B3,...,Bm}，Bi＝(bi₁,bi₂,bi₃,...,bi_W)，以使得每一第一数据向量的维度的数量相同；其中，Bi为对Ai进行向量维度补齐后得到的第一数据向量，bi_j为第i个第一数据向量中第j个第一数据大小信息，j＝1,2,...,W，W为每一第一数据向量中维度的数量，W＝max(n(1),n(2),n(3),...,n(m))，进行向量维度补齐时，补充的维度的数据大小信息为0；Step S200, padding the vector dimension of each original data vector in the original data vector set A to obtain a first data vector set B = {B1, B2, B3, ..., Bm}, Bi = (bi ₁ , bi ₂ , bi ₃ , ..., bi _W ), so that the number of dimensions of each first data vector is the same; wherein Bi is the first data vector obtained after padding the vector dimension of Ai, bi _j is the jth first data size information in the i-th first data vector, j = 1, 2, ..., W, W is the number of dimensions in each first data vector, W = max(n(1), n(2), n(3), ..., n(m)), and when padding the vector dimension, the data size information of the supplemented dimension is 0;

步骤S300，根据预设数据阈值分别对第一数据向量集B内的每一第一数据向量进行遍历，得到第一数量集S＝{s1,s2,s3,...,sm}；其中，si为Bi中大于或等于预设数据阈值的第一数据大小信息的数量；Step S300, traverse each first data vector in the first data vector set B according to the preset data threshold, and obtain a first quantity set S={s1, s2, s3, ..., sm}; wherein si is the number of first data size information in Bi that is greater than or equal to the preset data threshold;

步骤S400，根据第一数量集S对第一数据向量集B中的第一数据向量进行第一聚类处理，得到第一聚类结果V＝{V1,V2,V3,...,Vk}，VX＝{VX₁,VX₂,VX₃,...,VX_c(X)}，其中，X＝1,2,...,k，VX为第X个第二数据向量集，k为所述第二数据向量集的数量，k＜m，VX_c(X)为第X个第二数据向量集中的第c(X)个第二数据向量，c(X)为第X个第二数据向量集中第二数据向量的数量；Step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first quantity set S, to obtain a first clustering result V={V1,V2,V3,...,Vk}, VX={VX ₁ ,VX ₂ ,VX ₃ ,...,VX _c(X) }, wherein X=1,2,...,k, VX is the Xth second data vector set, k is the number of the second data vector sets, k＜m, VX _c(X) is the c(X)th second data vector in the Xth second data vector set, and c(X) is the number of the second data vectors in the Xth second data vector set;

步骤S500，分别根据每一第二数据向量集，得到均值向量集U＝{u1,u2,u3,...,uk}，其中，uX为VX对应的均值向量；uX＝(uX₁,uX₂,uX₃,...uX_W)，uX_j＝(∑^c(X) _e＝1VX_e ^j)/c(X)；其中，j＝1,2,...,W，uX_j为uX中第j个均值数据大小信息，VX_e ^j为VX中第e个第二数据向量的第j个第二数据大小信息，e＝1,2,...,c(X)；Step S500, respectively, according to each second data vector set, obtain a mean vector set U = {u1, u2, u3, ..., uk}, wherein uX is the mean vector corresponding to VX; uX = (uX ₁ , uX ₂ , uX ₃ , ... uX _W ), uX _j = (∑ ^c(X) _{e = 1} VX _e ^j ) / c(X); wherein j = 1, 2, ..., W, uX _j is the j-th mean data size information in uX, VX _e ^j is the j-th second data size information of the e-th second data vector in VX, and e = 1, 2, ..., c(X);

步骤S600，根据均值向量集U对第一数据向量集B中的第一数据向量进行第二聚类处理，得到第二聚类结果；其中，第二聚类处理的聚类类别的数量为k，且使用uX作为第X个聚类类别的聚类初始向量，聚类条件为相似度F_Xt小于对应的相似度阈值λ_X，F_Xt为Bt与uX的相似度，Bt为B中第t个第一数据向量，t＝1,2,...,m；Step S600, performing a second clustering process on the first data vectors in the first data vector set B according to the mean vector set U to obtain a second clustering result; wherein the number of clustering categories of the second clustering process is k, and uX is used as the clustering initial vector of the X-th clustering category, and the clustering condition is that the similarity F _Xt is less than the corresponding similarity threshold λ _X , F _Xt is the similarity between Bt and uX, Bt is the t-th first data vector in B, t=1,2,...,m;

步骤S700，根据所述第二聚类结果，确定第一数据向量集B中是否存在孤立数据向量；若存在，则根据所述孤立数据向量，从原始数据向量集A中确定出异常数据向量。Step S700: Determine whether there is an isolated data vector in the first data vector set B according to the second clustering result; if there is, determine an abnormal data vector from the original data vector set A according to the isolated data vector.

在本申请的一种示例性实施例中，其特征在于，

In an exemplary embodiment of the present application, it is characterized in that,

其中，bt_r为Bt中第r个第一数据大小信息，uX_r为uX中第r个均值数据大小信息。Among them, bt _r is the r-th first data size information in Bt, and uX _r is the r-th mean data size information in uX.

在本申请的一种示例性实施例中，λ_X符合如下条件：In an exemplary embodiment of the present application, _λX meets the following conditions:

其中，uY_r为uY中第r个均值数据大小信息，uY为VY对应的均值向量，VY为V中第Y个第二数据向量集，Y＝X+1；uZ_r为uZ中第r个均值数据大小信息，uZ为VZ对应的均值向量，VZ为V中第Z个第二数据向量集，Z＝X-1。Among them, uY _r is the rth mean data size information in uY, uY is the mean vector corresponding to VY, VY is the Yth second data vector set in V, Y=X+1; uZ _r is the rth mean data size information in uZ, uZ is the mean vector corresponding to VZ, VZ is the Zth second data vector set in V, Z=X-1.

在本申请的一种示例性实施例中，在所述步骤S100之前，所述方法还包括：In an exemplary embodiment of the present application, before step S100, the method further includes:

对每一候选传感器的数据报文进行识别，以将若干候选传感器中使用非标准协议的候选传感器确定为目标传感器。The data message of each candidate sensor is identified to determine a candidate sensor using a non-standard protocol among the candidate sensors as a target sensor.

在本申请的一种示例性实施例中，所述步骤S100之前，还包括：In an exemplary embodiment of the present application, before step S100, the method further includes:

确定每一RTU对应的数据上传周期，得到周期集Q＝{Q1,Q2,Q3,...,Qm}，其中，Qi为第i个RTU对应的数据上传周期；Determine the data upload period corresponding to each RTU, and obtain a period set Q = {Q1, Q2, Q3, ..., Qm}, where Qi is the data upload period corresponding to the i-th RTU;

获取最大周期max(Q)，max()为预设的最大值确定函数；Get the maximum period max(Q), where max() is a preset maximum value determination function;

根据最大周期max(Q)确定第一时间长度L；其中，L大于等于max(Q)。A first time length L is determined according to the maximum period max(Q); wherein L is greater than or equal to max(Q).

在本申请的一种示例性实施例中，L＝H*max(Q)，H为大于1的正整数。In an exemplary embodiment of the present application, L=H*max(Q), where H is a positive integer greater than 1.

在本申请的一种示例性实施例中，H＝10。In an exemplary embodiment of the present application, H=10.

在本申请的一种示例性实施例中，预设数据阈值为0.8kb。In an exemplary embodiment of the present application, the preset data threshold is 0.8 kb.

根据本发明的一个方面，提供一种电子设备，包括处理器和存储器；According to one aspect of the present invention, there is provided an electronic device, comprising a processor and a memory;

所述处理器通过调用所述存储器存储的程序或指令，用于执行上述任一项所述方法的步骤。The processor is used to execute the steps of any of the above methods by calling the program or instruction stored in the memory.

根据本发明的一个方面，提供一种非瞬时性计算机可读存储介质，所述非瞬时性计算机可读存储介质存储程序或指令，所述程序或指令使计算机执行上述任一项所述方法的步骤。According to one aspect of the present invention, a non-transitory computer-readable storage medium is provided, wherein the non-transitory computer-readable storage medium stores a program or instruction, wherein the program or instruction enables a computer to execute the steps of any one of the above methods.

本申请提供的一种异常数据确定方法，首先对A中的每一原始数据向量进行向量维度补齐，使得每一第一数据向量的维度的数量相同，即每一第一数据向量的长度相同。然后，根据预设数据阈值确定出每一第一数据向量中大于或等于预设数据阈值的第一数据大小信息的数量，得到S。然后根据第一数量集对第一数据向量集中的第一数据向量进行聚类，得到若干个第二数据向量集。其中，每一第二数据向量集中的第二数据向量对应的第一数量(可理解为原始数据向量中的有效的原始数据大小信息的数量)相近(数量差值小于阈值)。再分别根据每一第二数据向量集中的第二数据向量确定出每一第二数据向量集的均值向量，以此得到了第二聚类处理使用的聚类类别的数量k和每一聚类类别对应的聚类初始向量，并进行第二聚类处理。从而将在第二聚类处理中无法完成聚类的第一数据向量确定为孤立数据向量，再根据原始数据向量集和第一数据向量集的对应关系，最终从原始数据向量集中确定出异常数据向量。由此，完成了仅根据使用非标准协议的RTU和传感器上传的数据本身，即可完成异常数据的确定，而并不用了解RTU和传感器使用的非标准协议的协议内容。同时，由于每一第一数据向量的长度相同，故而在获取均值向量时，可以直接对每一第二数据向量集对应的第一数据向量求取均值向量。避免了由于原始数据向量维度的数量不同(长度不同)导致的无法求取均值向量的问题。The present application provides a method for determining abnormal data. First, each original data vector in A is padded with vector dimensions so that the number of dimensions of each first data vector is the same, that is, the length of each first data vector is the same. Then, the number of first data size information greater than or equal to the preset data threshold in each first data vector is determined according to a preset data threshold to obtain S. Then, the first data vectors in the first data vector set are clustered according to the first quantity set to obtain a plurality of second data vector sets. Among them, the first quantity (which can be understood as the number of valid original data size information in the original data vector) corresponding to the second data vector in each second data vector set is similar (the quantity difference is less than the threshold). Then, the mean vector of each second data vector set is determined according to the second data vector in each second data vector set, thereby obtaining the number k of clustering categories used in the second clustering process and the clustering initial vector corresponding to each clustering category, and performing the second clustering process. Thus, the first data vector that cannot be clustered in the second clustering process is determined as an isolated data vector, and then, according to the corresponding relationship between the original data vector set and the first data vector set, the abnormal data vector is finally determined from the original data vector set. Thus, the determination of abnormal data can be completed based on the data uploaded by the RTU and sensor using the non-standard protocol, without understanding the protocol content of the non-standard protocol used by the RTU and the sensor. At the same time, since the length of each first data vector is the same, when obtaining the mean vector, the mean vector can be directly obtained for the first data vector corresponding to each second data vector set. The problem of being unable to obtain the mean vector due to the different number of dimensions (different lengths) of the original data vector is avoided.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本实施例提供的异常数据确定方法所应用场景的结构示意框图。FIG1 is a schematic block diagram of the structure of a scenario in which the abnormal data determination method provided in this embodiment is applied.

具体实施方式DETAILED DESCRIPTION

下面结合附图对本申请实施例进行详细描述。The embodiments of the present application are described in detail below with reference to the accompanying drawings.

需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合；并且，基于本公开中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。It should be noted that the following embodiments and features in the embodiments may be combined with each other in the absence of conflict; and, based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in the field without making any creative work are within the scope of protection of the present disclosure.

需要说明的是，下文描述在所附权利要求书的范围内的实施例的各种方面。应显而易见，本文中所描述的方面可体现于广泛多种形式中，且本文中所描述的任何特定结构及/或功能仅为说明性的。基于本公开，所属领域的技术人员应了解，本文中所描述的一个方面可与任何其它方面独立地实施，且可以各种方式组合这些方面中的两者或两者以上。举例来说，可使用本文中所阐述的任何数目个方面来实施设备及/或实践方法。另外，可使用除了本文中所阐述的方面中的一或多者之外的其它结构及/或功能性实施此设备及/或实践此方法。It should be noted that various aspects of the embodiments within the scope of the appended claims are described below. It should be apparent that the aspects described herein may be embodied in a wide variety of forms, and any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, it should be understood by those skilled in the art that an aspect described herein may be implemented independently of any other aspect, and two or more of these aspects may be combined in various ways. For example, any number of aspects described herein may be used to implement the device and/or practice the method. In addition, other structures and/or functionalities other than one or more of the aspects described herein may be used to implement this device and/or practice this method.

请参考图1，根据本发明的一个方面，提供一种异常数据确定方法，应用于上位机，该上位机可以服务器、PC等具有能够接收数据信息和一定处理能力的电子设备。上位机与多个RTU通讯连接，能够接收RTU上传的采样数据，该采样数据可以包括上传时间、原始采样数据(从对应传感器获取的传感器采集到的数据)、原始采样数据大小信息等。RTU被配置为每到一个数据上传周期(每一RTU具有自己对应的数据上传周期)的结束时刻上传一次采样数据。且在实际应用时，由于RTU会受到网络波动的影响，会在未到达本次数据上传周期的结束时刻时，也对上位机上传一个采样数据。本实施例中，采样数据可以为流量包。Please refer to Figure 1. According to one aspect of the present invention, a method for determining abnormal data is provided, which is applied to a host computer, which can be a server, PC, or other electronic device that can receive data information and has a certain processing capability. The host computer is connected to multiple RTUs for communication, and can receive sampled data uploaded by the RTUs. The sampled data may include upload time, original sampled data (data collected by sensors obtained from corresponding sensors), original sampled data size information, etc. The RTU is configured to upload sampled data once at the end of each data upload cycle (each RTU has its own corresponding data upload cycle). In actual application, since the RTU will be affected by network fluctuations, it will also upload a sampled data to the host computer before reaching the end of this data upload cycle. In this embodiment, the sampled data may be a traffic packet.

所述方法具体包括以下步骤：The method specifically comprises the following steps:

步骤S000，对每一候选传感器的数据报文进行识别，以将若干候选传感器中使用非标准协议的候选传感器确定为目标传感器；其中，每一目标传感器均具有一个唯一对应的RTU，所述RTU用于将对应的目标传感器的采样数据进行上传。其中，非标准协议指自定义的IEC104协议，即修改后的IEC104协议。目标传感器可以为温度传感器、湿度传感器和压力传感器等。Step S000, identifying the data message of each candidate sensor to determine the candidate sensor using the non-standard protocol among the candidate sensors as the target sensor; wherein each target sensor has a unique corresponding RTU, and the RTU is used to upload the sampling data of the corresponding target sensor. The non-standard protocol refers to the customized IEC104 protocol, that is, the modified IEC104 protocol. The target sensor can be a temperature sensor, a humidity sensor, a pressure sensor, etc.

步骤S100，根据第一时间长度L，获取原始数据向量集A＝{A1,A2,A3,...,Am}，Ai＝(ai₁,ai₂,ai₃,...,ai_n(i))；其中，i＝1,2,...,m，Ai为第i个RTU对应的原始数据向量，m为RTU的数量，ai_g为第i个原始数据向量中第g个原始数据大小信息，g＝1,2,...,n(i)；n(i)为第i个原始数据向量中原始数据大小信息的数量；每一RTU均具有一个唯一对应的使用非标准协议的目标传感器。原始数据向量可以根据对应的RTU在第一时间长度中上传的采样数据得到。同时，由于每一RTU的数据上传周期和开始工作时间的不同，且网络波动导致误上传的次数也不同，故而每个原始数据向量中的原始数据大小信息的数量也是不同的。因此，本实施例中，n()并不是一个设定好的处理函数，而是一个根据i值的变化，能够得到的唯一确定数值，i的取值不同，对应的n(i)的值可以不同。Step S100, according to the first time length L, obtain the original data vector set A = {A1, A2, A3, ..., Am}, Ai = (ai ₁ , ai ₂ , ai ₃ , ..., ain _(i) ); wherein i = 1, 2, ..., m, Ai is the original data vector corresponding to the i-th RTU, m is the number of RTUs, _aig is the g-th original data size information in the i-th original data vector, g = 1, 2, ..., n (i); n (i) is the number of original data size information in the i-th original data vector; each RTU has a unique corresponding target sensor using a non-standard protocol. The original data vector can be obtained according to the sampled data uploaded by the corresponding RTU in the first time length. At the same time, since the data upload cycle and the start time of each RTU are different, and the number of wrong uploads caused by network fluctuations is also different, the number of original data size information in each original data vector is also different. Therefore, in this embodiment, n() is not a set processing function, but a unique determined value that can be obtained according to the change of the value of i. Different values of i may correspond to different values of n(i).

步骤S200，对原始数据向量集A中每一原始数据向量进行向量维度补齐，得到第一数据向量集B＝{B1,B2,B3,...,Bm}，Bi＝(bi₁,bi₂,bi₃,...,bi_W)，以使得每一第一数据向量的维度的数量相同；其中，Bi为对Ai进行向量维度补齐后得到的第一数据向量，bi_j为第i个第一数据向量中第j个第一数据大小信息，j＝1,2,...,W，W为每一第一数据向量中维度的数量，W＝max(n(1),n(2),n(3),...,n(m))，进行向量维度补齐时，补充的维度的数据大小信息为0。具体的，本实施例中，进行维度向量补齐并不是在每一原始数据向量中的头部或尾部补充连续的数量不同的0。而是根据每一原始数据向量中原有的第一数据大小信息的实际时间，在相邻的两个第一数据大小信息之间补充一定数量的0。其中，“一定数量”通过相邻两个第一数据大小信息对应的时间间隔确定，时间间隔越长，补充的0的数量越多。以使不同的第一数据向量中，同一维度中的数据对应的时间相同或相近。该相近为对应的时间差值小于0.01秒到0.1秒。补充的维度的数据大小信息为0是为了后续处理中，维度的数据大小信息不会对对应的第一数据向量的实际值产生影响，而是为了使得不同第一数据向量中的第一数据大小信息在时间和位置上进行对齐。Step S200, pad the vector dimension of each original data vector in the original data vector set A to obtain a first data vector set B = {B1, B2, B3, ..., Bm}, Bi = (bi ₁ , bi ₂ , bi ₃ , ..., bi _W ), so that the number of dimensions of each first data vector is the same; wherein Bi is the first data vector obtained after padding the vector dimension of Ai, bi _j is the jth first data size information in the i-th first data vector, j = 1, 2, ..., W, W is the number of dimensions in each first data vector, W = max (n (1), n (2), n (3), ..., n (m)), when padding the vector dimension, the data size information of the supplemented dimension is 0. Specifically, in this embodiment, the dimension vector padding does not mean to supplement the head or tail of each original data vector with different numbers of consecutive 0s. Instead, a certain number of 0s is supplemented between two adjacent first data size information according to the actual time of the original first data size information in each original data vector. Among them, the "certain number" is determined by the time interval corresponding to two adjacent first data size information. The longer the time interval, the more 0s are added. In this way, the time corresponding to the data in the same dimension in different first data vectors is the same or similar. The similarity means that the corresponding time difference is less than 0.01 seconds to 0.1 seconds. The data size information of the supplemented dimension is 0 so that in subsequent processing, the data size information of the dimension will not affect the actual value of the corresponding first data vector, but to align the first data size information in different first data vectors in time and position.

步骤S300，根据预设数据阈值分别对第一数据向量集B内的每一第一数据向量进行遍历，且当bi_j大于或等于预设数据阈值时进行计数，得到第一数量集S＝{s1,s2,s3,...,sm}；其中，si为Bi中大于或等于预设数据阈值的第一数据大小信息的数量，即si为Bi对应的第一数量。第一数量可以理解为原始数据向量中有效的数据的数量，即不是因为网络波动而产生的原始数据大小信息的数量。Step S300, traverse each first data vector in the first data vector set B according to the preset data threshold, and count when bi _j is greater than or equal to the preset data threshold, and obtain a first quantity set S = {s1, s2, s3, ..., sm}; wherein si is the number of first data size information in Bi that is greater than or equal to the preset data threshold, that is, si is the first quantity corresponding to Bi. The first quantity can be understood as the number of valid data in the original data vector, that is, the number of original data size information that is not generated due to network fluctuations.

步骤S400，根据第一数量集S对第一数据向量集B中的第一数据向量进行第一聚类处理，得到第一聚类结果V＝{V1,V2,V3,...,Vk}，VX＝{VX₁,VX₂,VX₃,...,VX_c(X)}，其中，X＝1,2,...,k，VX为第X个第二数据向量集，k为所述第二数据向量集的数量，VX_c(X)为第X个第二数据向量集中的第c(X)个第二数据向量，c(X)为第X个第二数据向量集中第二数据向量的数量，k＜m；Step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first quantity set S, to obtain a first clustering result V={V1,V2,V3,...,Vk}, VX={VX ₁ ,VX ₂ ,VX ₃ ,...,VX _c(X) }, wherein X=1,2,...,k, VX is the Xth second data vector set, k is the number of the second data vector sets, VX _c(X) is the c(X)th second data vector in the Xth second data vector set, c(X) is the number of the second data vectors in the Xth second data vector set, and k<m;

根据每一第一数据向量中的大于或等于预设数据阈值的第一数据大小信息的数量对B进行聚类，可以将采集周期相近、开始和结束时间相近和实际采样时间长度相近的第一数据向量聚类为一个第二数据向量集。即，第一数量相似的第一数据向量，其对应的RTU所使用的数据上传周期可以是相同或相似的，或开始和结束时间是相同或相似的，也可以是实际采样时间长度是相同或相似的。采用预设数据阈值确定第一数量，可以避免因网络波动和上述的向量维度补齐导致的原始数据向量的长度不同和补充的维度的数量的不同带来的影响。具体的，聚类条件可以为任意两个第一数量的差值小于设定数量差值即可被聚类为一类，设定数量差值的取值为1-5，具体的，设定数量差值为2。聚类方法可以为现有的聚类方法。可以理解的是，第二数据向量集中的第二数据向量，实际还是第一数据向量集的第一数据向量，并未对任何第一数据向向量中的第一数据大小信息作出修改，仅是进行了重新“分类”。本实施例中采用不同的名字进行撰写，仅是为了方便进行区分。具体的，预设数据阈值可以通过对历史数据中被标记为是由于网络波动产生的数据大小信息中的均值或最大值，本实施例中采用的是最大值。本实施例中，预设数据阈值为0.8kb。According to the number of first data size information greater than or equal to the preset data threshold in each first data vector, B is clustered, and the first data vectors with similar collection periods, similar start and end times, and similar actual sampling time lengths can be clustered into a second data vector set. That is, for first data vectors with similar first quantities, the data upload periods used by the corresponding RTUs can be the same or similar, or the start and end times can be the same or similar, or the actual sampling time lengths can be the same or similar. The use of the preset data threshold to determine the first quantity can avoid the influence caused by the different lengths of the original data vectors and the different numbers of supplemented dimensions caused by network fluctuations and the above-mentioned vector dimension padding. Specifically, the clustering condition can be that any two first quantities can be clustered into one category if the difference is less than the set quantity difference, and the set quantity difference is 1-5, and specifically, the set quantity difference is 2. The clustering method can be an existing clustering method. It can be understood that the second data vector in the second data vector set is actually still the first data vector in the first data vector set, and the first data size information in any first data vector is not modified, but only re-"classified". In this embodiment, different names are used for the convenience of distinction. Specifically, the preset data threshold can be the mean or maximum value of the data size information marked as being caused by network fluctuations in the historical data. In this embodiment, the maximum value is used. In this embodiment, the preset data threshold is 0.8kb.

步骤S500，分别根据每一第二数据向量集，得到均值向量集U＝{u1,u2,u3,...,uk}，其中，uX为VX对应的均值向量；uX＝(uX₁,uX₂,uX₃,...uX_W)，uX_j＝(∑^c(X) _e＝1VX_e ^j)/c(X)；其中，j＝1,2,...,W，uX_j为uX中第j个均值数据大小信息，VX_e ^j为VX中第e个第二数据向量的第j个第二数据大小信息，e＝1,2,...,c(X)。Step S500, according to each second data vector set, obtain the mean vector set U = {u1, u2, u3, ..., uk}, where uX is the mean vector corresponding to VX; uX = (uX ₁ , uX ₂ , uX ₃ , ... uX _W ), uX _j = (∑ ^c(X) _{e = 1} VX _e ^j ) / c(X); where j = 1, 2, ..., W, uX _j is the j-th mean data size information in uX, VX _e ^j is the j-th second data size information of the e-th second data vector in VX, and e = 1, 2, ..., c(X).

本实施例中，由于对每一原始数据向量做了向量维度补齐，使得每一第一数据向量和每一第二数据向量的长度相同。故而，在根据V得到U时，可以通过公式uX_j＝(∑^c(X) _t＝ ₁VX_t ^j)/c(X)直接对VX内的所有第二数据向量进行同位(同维度)求均值，以得到每一第二数据向量集对应的均值向量。避免了因为原始数据向量的维度数量不同导致无法求取均值向量的问题。且求取均值向量时，使用了原始数据向量中的所有原始数据大小信息(包括了网络波动产生的原始数据大小信息)，使得均值向量的数据准确性更高。In this embodiment, since the vector dimension of each original data vector is padded, the length of each first data vector and each second data vector is the same. Therefore, when U is obtained according to V, the formula uX _j = (∑ ^c(X) _{t =} ₁ VX _t ^j )/c(X) can be used to directly average all second data vectors in VX in the same position (same dimension) to obtain the mean vector corresponding to each set of second data vectors. The problem of being unable to obtain the mean vector due to the different number of dimensions of the original data vectors is avoided. In addition, when obtaining the mean vector, all the original data size information in the original data vector (including the original data size information generated by network fluctuations) is used, so that the data accuracy of the mean vector is higher.

步骤S600，根据均值向量集U对第一数据向量集B中的第一数据向量进行第二聚类处理，得到第二聚类结果；其中，第二聚类处理的聚类类别的数量为k，且使用uX作为第X个聚类类别的聚类初始向量，聚类条件为相似度F_Xt小于对应的相似度阈值λ_X，F_Xt为Bt与uX的相似度，Bt为B中第t个第一数据向量，t＝1,2,...,m。具体的，第二聚类处理可以为K-means聚类处理。其中，聚类类别的数量即为K-means聚类处理使用的“K值”，而u1,u2,u3,...,uk，则为k个聚类类别的聚类初始值。该聚类初始值实际是由B中的第二数据向量得到的，且K-means聚类针对的向量集也是B。以此，能够更加准确的实现对B中的第二数据向量的聚类。Step S600, performing a second clustering process on the first data vector in the first data vector set B according to the mean vector set U, and obtaining a second clustering result; wherein the number of clustering categories of the second clustering process is k, and uX is used as the clustering initial vector of the X-th clustering category, and the clustering condition is that the similarity F _Xt is less than the corresponding similarity threshold λ _X , F _Xt is the similarity between Bt and uX, Bt is the t-th first data vector in B, t=1,2,...,m. Specifically, the second clustering process can be a K-means clustering process. wherein the number of clustering categories is the "K value" used in the K-means clustering process, and u1,u2,u3,...,uk are the clustering initial values of the k clustering categories. the clustering initial value is actually obtained from the second data vector in B, and the vector set for K-means clustering is also B. in this way, the clustering of the second data vector in B can be achieved more accurately.

同时，实施例中，

Meanwhile, in the embodiment,

其中，bt_r为Bt中第r个第一数据大小信息，uX_r为uX中第r个均值数据大小信息。通过该公式能够得到每一第二数据向量与每一聚类初始向量的向量距离(即相似度)，具体的，F_Xt越小，说明越相似。Among them, bt _r is the rth first data size information in Bt, and uX _r is the rth mean data size information in uX. Through this formula, the vector distance (i.e., similarity) between each second data vector and each cluster initial vector can be obtained. Specifically, the smaller F _Xt is, the more similar it is.

λ_X符合如下条件：λ _X meets the following conditions:

聚类条件中，每一个聚类类型的对应的相似度阈值并没有采用固定值，而是通过当前的聚类初始向量与相邻的一个或两个聚类初始向量之间的向量距离确定的，使得最终的聚类结果更加准确。以此，实现对B中的第二数据向量的聚类。而本实施例中，不根据第一聚类处理确定出孤立数据向量的原因为，第一聚类处理的聚类依据为S，而S中的各个第一数量均为正整数，仅根据第一数量进行聚类只能较好的完成采集周期相近、开始和结束时间相近和实际采样时间长度相近的第一数据向量的聚类。故而，本实施例中，为了实现更加精准的异常数据的确定，采取了两次聚类。且能够通过第一聚类处理得到的第二聚类处理使用的聚类类别的数量和每一聚类类别的聚类初始值。使得通过第一聚类处理提高了第二聚类处理的聚类准确度。In the clustering condition, the similarity threshold corresponding to each clustering type does not adopt a fixed value, but is determined by the vector distance between the current clustering initial vector and one or two adjacent clustering initial vectors, so that the final clustering result is more accurate. In this way, the clustering of the second data vector in B is achieved. In this embodiment, the reason for not determining the isolated data vector according to the first clustering process is that the clustering basis of the first clustering process is S, and each first number in S is a positive integer. Clustering based on the first number can only better complete the clustering of the first data vectors with similar acquisition cycles, similar start and end times, and similar actual sampling time lengths. Therefore, in this embodiment, in order to achieve more accurate determination of abnormal data, two clusterings are adopted. And the number of clustering categories used in the second clustering process and the clustering initial value of each clustering category that can be obtained by the first clustering process. The clustering accuracy of the second clustering process is improved by the first clustering process.

第二聚类结果可以采用聚类示意图或聚类集合的形式存在，孤立数据向量为没有被聚类为任意聚类类型的第一数据向量。即该孤立数据向量与每一第一数据向量的差异度都很大，可以说明这个孤立数据向量中存在异常的第一数据大小信息。最终根据原始数据向量集和第一数据向量集的对应关系，在A中确定出异常数据向量，并进行相应的标记。其中，对应关系为，A1与B1对应，A2与B2对应，以此类推，即Ai与Bi对应。The second clustering result can be in the form of a clustering diagram or a clustering set, and the isolated data vector is a first data vector that is not clustered into any cluster type. That is, the difference between the isolated data vector and each first data vector is very large, which can indicate that there is abnormal first data size information in the isolated data vector. Finally, according to the correspondence between the original data vector set and the first data vector set, the abnormal data vector is determined in A and marked accordingly. Among them, the corresponding relationship is that A1 corresponds to B1, A2 corresponds to B2, and so on, that is, Ai corresponds to Bi.

本实施例提供的一种异常数据确定方法，首先对A中的每一原始数据向量进行向量维度补齐，使得每一第一数据向量的维度的数量相同，即每一第一数据向量的长度相同。然后，根据预设数据阈值确定出每一第一数据向量中大于或等于预设数据阈值的第一数据大小信息的数量，得到S。然后根据第一数量集对第一数据向量集中的第一数据向量进行聚类，得到若干个第二数据向量集。其中，每一第二数据向量集中的第二数据向量对应的第一数量(可理解为原始数据向量中的有效的原始数据大小信息的数量)相近(数量差值小于阈值)。再分别根据每一第二数据向量集中的第二数据向量确定出每一第二数据向量集的均值向量，以此得到了第二聚类处理使用的聚类类别的数量k和每一聚类类别对应的聚类初始向量，并进行第二聚类处理。从而将在第二聚类处理中无法完成聚类的第一数据向量确定为孤立数据向量，再根据原始数据向量集和第一数据向量集的对应关系，最终从原始数据向量集中确定出异常数据向量。由此，完成了仅根据使用非标准协议的RTU和传感器上传的数据本身，即可完成异常数据的确定，而并不用了解RTU和传感器使用的非标准协议的协议内容。同时，由于每一第一数据向量的长度相同，故而在获取均值向量时，可以直接对每一第二数据向量集对应的第一数据向量求取均值向量。避免了由于原始数据向量维度的数量不同(长度不同)导致的无法求取均值向量的问题。The present embodiment provides a method for determining abnormal data. First, each original data vector in A is dimensionally padded so that the number of dimensions of each first data vector is the same, that is, the length of each first data vector is the same. Then, the number of first data size information greater than or equal to the preset data threshold in each first data vector is determined according to the preset data threshold to obtain S. Then, the first data vectors in the first data vector set are clustered according to the first quantity set to obtain a plurality of second data vector sets. Among them, the first quantity (which can be understood as the number of valid original data size information in the original data vector) corresponding to the second data vector in each second data vector set is similar (the quantity difference is less than the threshold). Then, the mean vector of each second data vector set is determined according to the second data vector in each second data vector set, thereby obtaining the number k of clustering categories used in the second clustering process and the clustering initial vector corresponding to each clustering category, and performing the second clustering process. Thus, the first data vector that cannot be clustered in the second clustering process is determined as an isolated data vector, and then, according to the corresponding relationship between the original data vector set and the first data vector set, the abnormal data vector is finally determined from the original data vector set. Thus, the determination of abnormal data can be completed based on the data uploaded by the RTU and sensor using the non-standard protocol, without understanding the protocol content of the non-standard protocol used by the RTU and the sensor. At the same time, since the length of each first data vector is the same, when obtaining the mean vector, the mean vector can be directly obtained for the first data vector corresponding to each second data vector set. The problem of being unable to obtain the mean vector due to the different number of dimensions (different lengths) of the original data vector is avoided.

根据最大周期max(Q)确定第一时间长度L；其中，L大于等于max(Q)。具体的，L≥H*max(Q)，H为大于1的正整数。优选的，H＝10。L具有一个确定的开始时间L_start和一个确定的结束时间，以此来获取每一RTU对应的原始数据向量。The first time length L is determined according to the maximum period max(Q); wherein L is greater than or equal to max(Q). Specifically, L≥H*max(Q), and H is a positive integer greater than 1. Preferably, H=10. L has a certain start time L _start and a certain end time, so as to obtain the original data vector corresponding to each RTU.

为了保证A中的有效的数据的数量能够支持后续的异常数据确定，本实施例中，在确定L时，需要保证L大于max(Q)，即保证每一原始数据向量中，至少具有一个有效的数据。而由于后续处理中，第一聚类处理的聚类条件与第一数量相关，第二聚类处理的聚类条件与第一数量和每一第一数据向量中第一数据大小信息的实际值相关。若一个原始数据向量中，仅具有一个有效的数据，可能会对最终的聚类结果产生影响，故而本实施例中，L≥10*max(Q)，以保证每一原始数据向量中，至少包含10个有效的数据。In order to ensure that the number of valid data in A can support the subsequent determination of abnormal data, in this embodiment, when determining L, it is necessary to ensure that L is greater than max(Q), that is, to ensure that each original data vector has at least one valid data. In the subsequent processing, the clustering condition of the first clustering process is related to the first quantity, and the clustering condition of the second clustering process is related to the first quantity and the actual value of the first data size information in each first data vector. If there is only one valid data in an original data vector, it may affect the final clustering result. Therefore, in this embodiment, L≥10*max(Q) to ensure that each original data vector contains at least 10 valid data.

进一步的，在本申请的一种示例性实施例中，在步骤步骤S300之后，还可以包括：Furthermore, in an exemplary embodiment of the present application, after step S300, the following may also be included:

确定每一RTU对应的原始数据向量中的首个原始数据大小信息的获取时间。并确定出多个获取时间中的最大获取时间T_start ^max。Determine the acquisition time of the first original data size information in the original data vector corresponding to each RTU, and determine the maximum acquisition time T _start ^max among the multiple acquisition times.

获取HΔ＝(L_start-T_start ^max)/max(Q)，其中，HΔ向上取整。HΔ=(L _start −T _start ^max )/max(Q) is obtained, where HΔ is rounded up.

遍历S，若sα小于H-HΔ，则将sα对应的第一数据向量(即Bα)从B中删除。α的取值为1到m。Traverse S, if sα is less than H-HΔ, then delete the first data vector (i.e., Bα) corresponding to sα from B. The value of α ranges from 1 to m.

并将A中与Bα对应的原始数据向量确定为异常数据向量。由于L≥H*max(Q)，则说明若是正常的RTU，其对应的第一数据向量中，大于或等于预设数据阈值的第一数据大小信息的数量至少为H-HΔ个。故而，若sα小于H-HΔ，则可以说明其对应的RTU发生了漏传数据的问题，则可直接将该RTU对应的原始数据向量确定为异常数据向量。And the original data vector corresponding to Bα in A is determined as an abnormal data vector. Since L≥H*max(Q), it means that if it is a normal RTU, the number of first data size information greater than or equal to the preset data threshold in the corresponding first data vector is at least H-HΔ. Therefore, if sα is less than H-HΔ, it can be said that the corresponding RTU has a problem of missing data transmission, and the original data vector corresponding to the RTU can be directly determined as an abnormal data vector.

此外，尽管在附图中以特定顺序描述了本公开中方法的各个步骤，但是，这并非要求或者暗示必须按照该特定顺序来执行这些步骤，或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的，可以省略某些步骤，将多个步骤合并为一个步骤执行，以及/或者将一个步骤分解为多个步骤执行等。In addition, although the steps of the method in the present disclosure are described in a specific order in the drawings, this does not require or imply that the steps must be performed in this specific order, or that all the steps shown must be performed to achieve the desired results. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step, and/or one step may be decomposed into multiple steps, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementation methods, it is easy for those skilled in the art to understand that the example implementation methods described here can be implemented by software, or by software combined with necessary hardware. Therefore, the technical solution according to the implementation methods of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the implementation methods of the present disclosure.

在本公开的示例性实施例中，还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

所属技术领域的技术人员能够理解，本申请的各个方面可以实现为系统、方法或程序产品。因此，本申请的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art will appreciate that various aspects of the present application may be implemented as a system, method or program product. Therefore, various aspects of the present application may be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which may be collectively referred to as "circuit", "module" or "system" herein.

根据本申请的这种实施方式的电子设备。电子设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。The electronic device according to this embodiment of the present application is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.

电子设备以通用计算设备的形式表现。电子设备的组件可以包括但不限于：上述至少一个处理器、上述至少一个储存器、连接不同系统组件(包括储存器和处理器)的总线。The electronic device is presented in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: the at least one processor mentioned above, the at least one storage device mentioned above, and a bus connecting different system components (including storage devices and processors).

其中，所述储存器存储有程序代码，所述程序代码可以被所述处理器执行，使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。The storage stores program codes, which can be executed by the processor, so that the processor executes the steps described in the above “exemplary method” section of this specification according to various exemplary embodiments of the present application.

储存器可以包括易失性储存器形式的可读介质，例如随机存取储存器(RAM)和/或高速缓存储存器，还可以进一步包括只读储存器(ROM)。The memory may include readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory, and may further include read only memory (ROM).

储存器还可以包括具有一组(至少一个)程序模块的程序/实用工具，这样的程序模块包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage may also include a program/utility having a set (at least one) of program modules, such program modules including but not limited to: an operating system, one or more application programs, other program modules and program data, each of which or some combination may include the implementation of a network environment.

总线可以为表示几类总线结构中的一种或多种，包括储存器总线或者储存器控制器、外围总线、图形加速端口、处理器或者使用多种总线结构中的任意总线结构的局域总线。The bus may represent one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor or a local bus using any of a variety of bus architectures.

电子设备也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信，还可与一个或者多个使得用户能与该电子设备交互的设备通信，和/或与使得该电子设备能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口进行。并且，电子设备还可以通过网络适配器与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。网络适配器通过总线与电子设备的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device may also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), may communicate with one or more devices that enable a user to interact with the electronic device, and/or may communicate with any device (e.g., routers, modems, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface. Furthermore, the electronic device may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) through a network adapter. The network adapter communicates with other modules of the electronic device through a bus. It should be understood that, although not shown in the figure, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementation, it is easy for those skilled in the art to understand that the example implementation described here can be implemented by software, or by software combined with necessary hardware. Therefore, the technical solution according to the implementation of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.) or on a network, including several instructions to enable a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the implementation of the present disclosure.

在本公开的示例性实施例中，还提供了一种计算机可读存储介质，其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中，本申请的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当所述程序产品在终端设备上运行时，所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the above method of the present specification is stored. In some possible implementations, various aspects of the present application can also be implemented in the form of a program product, which includes a program code, and when the program product is run on a terminal device, the program code is used to enable the terminal device to execute the steps according to various exemplary implementations of the present application described in the above "Exemplary Method" section of the present specification.

所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples (non-exhaustive list) of readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了可读程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质，该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。Computer readable signal media may include data signals propagated in baseband or as part of a carrier wave, which carry readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Readable signal media may also be any readable medium other than a readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、有线、光缆、RF等等，或者上述的任意合适的组合。The program code embodied on the readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present application may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., via the Internet using an Internet service provider).

此外，上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明，而不是限制目的。易于理解，上述附图所示的处理并不表明或限制这些处理的时间顺序。另外，也易于理解，这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned figures are only schematic illustrations of the processes included in the method according to the exemplary embodiments of the present application, and are not intended to be limiting. It is easy to understand that the processes shown in the above-mentioned figures do not indicate or limit the time sequence of these processes. In addition, it is also easy to understand that these processes can be performed synchronously or asynchronously, for example, in multiple modules.

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本公开的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that, although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. On the contrary, the features and functions of one module or unit described above can be further divided into multiple modules or units to be embodied.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed in the present application should be included in the protection scope of the present application. Therefore, the protection scope of the present application shall be based on the protection scope of the claims.

Claims

1. An abnormal data determination method, comprising:

step S100, according to the first time length L, a set of raw data vectors a= { A1, A2, A3, am }, ai= (Ai ₁ ,ai ₂ ,ai ₃ ,...,ai _n(i) ) The method comprises the steps of carrying out a first treatment on the surface of the Where i=1, 2..m, ai is the original data vector corresponding to the i-th RTU, m is the number of RTUs, ai _g G = 1,2, n (i) for the g-th raw data size information in the i-th raw data vector; n (i) is the number of original data size information in the ith original data vector; each RTU has a unique corresponding target sensor using a non-standard protocol;

step S200, performing vector dimension filling on each original data vector in the original data vector set a to obtain a first data vector set b= { B1, B2, B3, & gt, bm }, bi= (Bi ₁ ,bi ₂ ,bi ₃ ,...,bi _W ) So that the number of dimensions of each first data vector is the same; wherein Bi is a first data vector obtained by vector dimension filling of Ai, bi _j For the j-th first data size information in the i-th first data vector, j=1, 2, & gt, W is the number of dimensions in each first data vector, w=max (n (1), n (2), n (3), and & gt, n (m)), and when vector dimension filling is performed, the data size information of the complementary dimensions is 0;

step S300, respectively traversing each first data vector in the first data vector set B according to a preset data threshold value, to obtain a first number set s= { S1, S2, S3, & gt, sm }; wherein si is the amount of first data size information in Bi which is greater than or equal to a preset data threshold;

step S400, performing a first clustering process on the first data vectors in the first data vector set B according to the first number set S, to obtain a first clustering result v= { V1, V2, V3,..vk }, vx= { VX ₁ ,VX ₂ ,VX ₃ ,...,VX _c(X) X=1, 2., k, VX is the X-th second set of data vectors, k is the number of said second set of data vectors, k < m, VX _c(X) C (X) th second data vector in the X th second data vector set, c (X) being the number of second data vectors in the X th second data vector set;

step S500, obtaining a mean vector set u= { U1, U2, U3, & gt, uk }, according to each second data vector set, where uX is a mean vector corresponding to VX; ux= (uX) ₁ ,uX ₂ ,uX ₃ ,...uX _W )，uX _j ＝(∑ ^c(X) _e＝1 VX _e ^j ) C (X); where j=1, 2, W, uX _j VX is the j-th mean data size information in uX _e ^j J second data size information for the e second data vector in VX, e=1, 2, c (X);

step S600, performing second clustering on the first data vectors in the first data vector set B according to the average value vector set U to obtain a second clustering result; wherein the number of cluster categories processed by the second clustering process is k, uX is used as a clustering initial vector of the X-th cluster category, and the clustering condition is similarity F _Xt Less than a corresponding similarity threshold lambda _X ，F _Xt For the similarity of Bt to uX, bt is the t first data vector in B, t=1, 2,. -%, m;

step S700, determining whether an isolated data vector exists in the first data vector set B according to the second aggregation result; if so, determining an abnormal data vector from the original data vector set A according to the isolated data vector.

2. The method for determining abnormal data according to claim 1, wherein,

wherein bt _r For the r first data size information in Bt, uX _r And the data size information is the r mean value data in the uX.

3. The abnormal data determination method according to claim 2, wherein λ _X Meets the following conditions:

therein, uY _r For the r-th mean data size information in uY, uY is the mean vector corresponding to VY, VY is the Y-th second data vector set in V, y=x+1; uZ _r For the r-th mean data size information in uZ, uZ is the mean vector corresponding to VZ, where VZ is the Z-th second data vector set in V, and z=x-1.

4. The abnormal data determination method according to claim 1, characterized in that, before said step S100, said method further comprises:

the data message for each candidate sensor is identified to determine a candidate sensor of the number of candidate sensors that uses a non-standard protocol as a target sensor.

5. The abnormal data determination method according to claim 1, further comprising, prior to the step S100:

determining a data uploading period corresponding to each RTU to obtain a period set Q= { Q1, Q2, Q3, & gt, qm }, wherein Qi is the data uploading period corresponding to the ith RTU;

obtaining a maximum period max (Q), wherein max () is a preset maximum value determining function;

determining a first time length L according to the maximum period max (Q); wherein L is equal to or greater than max (Q).

6. The abnormal data determination method according to claim 5, wherein l=h×max (Q), H being a positive integer greater than 1.

7. The abnormal data determination method according to claim 6, wherein h=10.

8. The abnormal data determination method according to claim 1, wherein the preset data threshold is 0.8kb.

9. An electronic device comprising a processor and a memory;

the processor is adapted to perform the steps of the method according to any of claims 1 to 7 by invoking a program or instruction stored in the memory.

10. A non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 7.