WO2020010677A1 - Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium - Google Patents

Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium Download PDF

Info

Publication number
WO2020010677A1
WO2020010677A1 PCT/CN2018/103333 CN2018103333W WO2020010677A1 WO 2020010677 A1 WO2020010677 A1 WO 2020010677A1 CN 2018103333 W CN2018103333 W CN 2018103333W WO 2020010677 A1 WO2020010677 A1 WO 2020010677A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
data
feature
value
target time
Prior art date
Application number
PCT/CN2018/103333
Other languages
French (fr)
Chinese (zh)
Inventor
郑立颖
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020010677A1 publication Critical patent/WO2020010677A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Fuzzy Systems (AREA)
  • Algebra (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The present invention discloses a method for acquiring consecutive missing values, a data analysis device, a data analysis terminal, and a computer-readable storage medium. The method for acquiring consecutive missing values comprises: if it is detected that a target time sequence acquired on the basis of a preset time interval has consecutive missing values, acquiring, according to the preset time interval, all of sequence feature values from all of time sequence samples, so as to generate a feature data sequence of each time sequence sample; performing anomaly detection calculation on each feature data sequence, so as to determine normal data sequences among all of the feature data sequences; acquiring a corresponding target time point of the consecutive missing values in the target data sequence, and acquiring sequence feature values at all of the target time points in all of the normal data sequences; and calculating a mean value of all of the sequence feature values at the target time points, and using feature mean values as filling reference values of the consecutive missing values corresponding to the target time point. The present invention improves the authenticity of data of a time sequence.

Description

连续缺失值填充方法、数据分析装置、终端及存储介质  Continuous missing value filling method, data analysis device, terminal and storage medium Ranch
本申请要求于2018年7月9日提交中国专利局、申请号为201810748247.X、发明名称为“连续缺失值填充方法、数据分析装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed on July 9, 2018, with the Chinese Patent Office, application number 201810748247.X, and the invention name "Continuous Missing Value Filling Method, Data Analysis Device, Terminal, and Storage Medium", which The entire contents are incorporated in the application by reference.
技术领域Technical field
本发明涉及数据分析技术领域,尤其涉及一种连续缺失值填充方法、数据分析装置、数据分析终端及计算机可读存储介质。The invention relates to the technical field of data analysis, and in particular, to a method for filling continuous missing values, a data analysis device, a data analysis terminal, and a computer-readable storage medium.
背景技术Background technique
在现实生活中,人们会对采集到的指标数据进行统计,通常指标数据的连续变化能够体现一种历史走势,并对后续走势起到预测作用。但是,指标数据在统计过程中,经常会出现一些意外,例如在系统故障或设备替换的时间段中无法继续采集统计指标数据,导致在该段连续的时间序列中指标数据出现连续的缺失值。而现有的统一均值填充会造成填充值不符合时间序列本身的分布,而移动均值填充会引入异常数据值。因此,传统的单点缺失值填充方法容易造成填充后的指标数据发生较大的偏移,无法保障数据的真实性。In real life, people collect statistics on the collected index data. Usually, continuous changes of the index data can reflect a historical trend and predict the subsequent trend. However, in the statistical process of the indicator data, there are often some accidents. For example, the statistical indicator data cannot be collected during the time period of system failure or equipment replacement, resulting in continuous missing values of the indicator data in the continuous time series. However, the existing uniform mean filling will cause the filling value to not conform to the distribution of the time series itself, and the moving mean filling will introduce abnormal data values. Therefore, the traditional single-point missing value filling method is likely to cause a large deviation of the filled indicator data, which cannot guarantee the authenticity of the data.
发明内容Summary of the invention
本发明的主要目的在于提供一种连续缺失值填充方法、数据分析装置、数据分析终端及计算机可读存储介质,旨在解决传统的单点缺失值填充方法在对连续缺失值的填充计算过程容易引入异常数据值,使得计算出来的填充值偏移量较大,导致数据真实性降低的技术问题。The main purpose of the present invention is to provide a continuous missing value filling method, a data analysis device, a data analysis terminal, and a computer-readable storage medium. The purpose is to solve the traditional single-point missing value filling method in the process of filling continuous missing values easily. The introduction of abnormal data values makes the calculated padding values have a large offset, which leads to technical problems that reduce the authenticity of the data.
为实现上述目的,本发明实施例提供一种连续缺失值填充方法,所述连续缺失值填充方法包括:To achieve the foregoing objective, an embodiment of the present invention provides a continuous missing value filling method. The continuous missing value filling method includes:
当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;
对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;
获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;
对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.
本发明还提供一种数据分析装置,所述数据分析装置包括:采集模块,用于当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;检测模块,用于对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;获取模块,用于获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;填充模块,用于对所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。The present invention also provides a data analysis device. The data analysis device includes: an acquisition module, configured to detect continuous missing values in a target time series collected based on a preset time interval, from all the preset time intervals. Collect all sequence characteristic values in the time series samples to generate characteristic data sequences for each time series sample; a detection module for performing anomaly detection calculations on each characteristic data sequence to determine the normal data sequence in all characteristic data sequences; obtain A module for obtaining the target time points corresponding to the continuous missing values in the target time series, and for obtaining the sequence feature values at all the target time points in all normal data sequences; a filling module for making an average of all the sequence feature values Calculate to obtain the feature mean value at each target time point, and use the feature mean value as the filling reference value of the consecutive missing values corresponding to the target time point.
此外,为实现上述目的,本发明还提供一种数据分析终端,所述数据分析终端包括:存储器、处理器、通信总线以及存储在所述存储器上的计算机可读指令,其中所述计算机可读指令被所述处理器执行时,实现如上述的连续缺失值填充方法的步骤。In addition, in order to achieve the above object, the present invention further provides a data analysis terminal, the data analysis terminal includes: a memory, a processor, a communication bus, and computer-readable instructions stored on the memory, where the computer-readable When the instructions are executed by the processor, the steps of the continuous missing value filling method described above are implemented.
此外,为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中所述计算机可读指令被处理器执行时,实现如上述的连续缺失值填充方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the implementation is as described above. Steps of continuous missing value population method.
本发明通过当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。本发明从时间序列样本中抽取序列特征值,通过异常检测判定正常数据序列,从多个正常数据序列中的目标时间点上的特征值进行均值计算,并把均值作为连续缺失值在对应时间点上的填充值,减少了异常特征值的干扰,保证了填充参考值的数据可靠性,提升了连续缺失值填充效率,解决了传统的单点缺失值填充方法在对连续缺失值的填充计算过程容易引入异常数据值,使得计算出来的填充值偏移量较大,导致数据真实性降低的技术问题,而保留了时间序列本身的分布特性,并降低了计算复杂度。In the present invention, when continuous missing values are detected in a target time series collected based on a preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate characteristic data of each time series sample. Sequence; perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences; obtain target time points corresponding to the continuous missing values in the target time sequence, and obtain all Sequence feature value at the target time point; average calculation of all sequence feature values at each target time point to obtain the feature average value at each target time point, and use the feature average value as the continuous missing value at the corresponding target time point The reference value for the fill. The present invention extracts sequence feature values from time series samples, determines normal data sequences through anomaly detection, performs average calculation from feature values at target time points in multiple normal data sequences, and uses the mean value as consecutive missing values at corresponding time points. The padding value on the surface reduces the interference of abnormal eigenvalues, ensures the data reliability of padding reference values, improves the filling efficiency of continuous missing values, and solves the traditional single-point missing value filling method in the process of filling continuous missing values. It is easy to introduce abnormal data values, which causes a large offset of the calculated padding value, leading to a technical problem of reducing the authenticity of the data, while retaining the distribution characteristics of the time series itself, and reducing the computational complexity.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明连续缺失值填充方法第一实施例的流程示意图;FIG. 1 is a schematic flowchart of a first embodiment of a continuous missing value filling method according to the present invention; FIG.
图2为图1中步骤S20的细化流程示意图;FIG. 2 is a detailed flowchart of step S20 in FIG. 1; FIG.
图3为本发明数据分析装置的功能模块示意图;3 is a schematic diagram of functional modules of a data analysis device of the present invention;
图4为本发明实施例方法涉及的硬件运行环境的设备结构示意图。FIG. 4 is a schematic structural diagram of a device in a hardware operating environment involved in a method according to an embodiment of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional characteristics and advantages of the present invention will be further described with reference to the embodiments and the drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.
本发明提供一种连续缺失值填充方法,在连续缺失值填充方法第一实施例中,参照图1,所述连续缺失值填充方法包括:The present invention provides a continuous missing value filling method. In a first embodiment of the continuous missing value filling method, referring to FIG. 1, the continuous missing value filling method includes:
步骤S10,当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;In step S10, when continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate characteristic data of each time series sample. sequence;
所述目标时间序列是指系统基于预设时间间隔采集到的数据指标集合,所述连续缺失值指的是在目标时间序列中由于特殊原因无法正常记录的序列特征值。当目标时间序列中存在连续缺失值时,为补充该连续缺失值,系统将按照预设时间间隔从时间序列样本中采集所有的序列特征值,以作为目标时间序列的参考数据。The target time series refers to a data index set collected by the system based on a preset time interval, and the continuous missing value refers to a sequence feature value that cannot be recorded normally in the target time series due to special reasons. When there are continuous missing values in the target time series, in order to supplement the continuous missing values, the system will collect all sequence feature values from the time series samples at preset time intervals as the reference data of the target time series.
可以理解的是,目标时间序列中的序列特征值在被采集阶段时所采用的时间间隔可以与本发明中从时间序列样本中采集序列特征时的预设时间间隔不一样,但是目标时间序列中的时间间隔必须大于或等于从时间序列样本中采集数据时的预设时间间隔。例如目标时间序列的时间间隔为每隔1小时采集一个数据,那么时间序列样本中的预设时间间隔必须小于或等于1小时采集一个数据的时间间隔,如每隔30分钟采集一个数据、每隔20分钟采集一个数据等等。这样从时间序列样本中采集到的所有序列特征值能够作为目标时间序列的参考值。否则,若时间序列样本采用的预设时间间隔为2小时采集一个数据,大于目标时间样本的时间间隔,则在一天之内,时间序列样本所采集的序列特征值有12个(按预设时间间隔为2小时),而目标时间序列的序列特征值有24个(按时间间隔为1小时)。二者根本不对等,若目标时间序列的3个连续缺失值发生在连续的1.5个小时内,则作为时间序列样本的1.5小时内至多只有1个序列特征值作参考,无法解决本发明的技术问题。It can be understood that the time interval used for the sequence feature value in the target time series during the acquisition phase may be different from the preset time interval when the sequence feature is collected from the time series sample in the present invention, but in the target time series The time interval must be greater than or equal to the preset time interval when collecting data from a time series sample. For example, the time interval of the target time series is to collect data every 1 hour, then the preset time interval in the time series sample must be less than or equal to the time interval of collecting one data, such as collecting one data every 30 minutes, Collect a data in 20 minutes and so on. In this way, all sequence characteristic values collected from the time series samples can be used as reference values for the target time series. Otherwise, if the time series sample uses a preset time interval of 2 hours to collect one data, which is greater than the target time sample time interval, then within a day, there are 12 sequence feature values collected by the time series sample (based on the preset time) The interval is 2 hours), and the target time series has 24 sequence feature values (1 hour by time interval). The two are not equal at all. If three consecutive missing values of the target time series occur within consecutive 1.5 hours, then at most one sequence characteristic value is used as a reference within 1.5 hours of the time series sample, which cannot solve the technology of the present invention. problem.
系统检测到连续缺失值后,将由预设时间间隔从所有时间序列样本中采集所有序列特征值,所有序列特征值映射到各自时间序列样本中,生成个时间序列样本的 特征数据序列。After detecting consecutive missing values, the system will collect all sequence feature values from all time series samples at a preset time interval, and map all sequence feature values to their respective time series samples to generate a time series sample. Characteristic data sequence.
为方便理解,本步骤可为以下举例所述:假设用电量统计序列(即目标时间序列)是1月13号每隔1小时采集的,而在用电量统计序列中从15时至18时的用电量未采集到,那么15时至18时一共有3个用电量数值即为连续缺失值。此时系统将按照每隔1小时采集一个数据的时间间隔对历史用电量统计序列(时间序列样本)中13号当天的用电量进行采集,并获取到24个时间点对应的用电量数值。该24个用电量数值即为特征数据序列。To facilitate understanding, this step can be described in the following example: Assume that the electricity consumption statistics sequence (ie, the target time series) was collected every hour on January 13th, and the electricity consumption statistics sequence is from 15:00 to 18 When the power consumption is not collected, there are a total of 3 power consumption values from 15:00 to 18:00, which are continuous missing values. At this time, the system will collect the power consumption on the 13th in the historical power consumption statistical sequence (time series sample) at the time interval of collecting data every 1 hour, and obtain the power consumption corresponding to 24 time points. Value. The 24 power consumption values are characteristic data sequences.
步骤S20,对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;Step S20: Perform an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences;
每个特征数据序列中都可能会存在异常数据,例如因为系统故障或者数据录入出错导致数据异常,而特征数据序列的数据异常会影响到连续缺失值的准确性,因此需要对每个特征数据序列执行异常检测计算,已将筛选出特征数据序列中的正常数据序列。异常检测计算是为了检测序列中是否有游离的异常数据,例如采用主成分分析法,多元高斯分布法,孤立森林算法等等,从而将正常分布的特征数据序列筛选出来。There may be abnormal data in each feature data sequence, for example, data abnormality due to a system failure or data entry error, and the data abnormality of the feature data sequence will affect the accuracy of consecutive missing values, so each feature data sequence needs to be The anomaly detection calculation is performed, and the normal data sequence in the characteristic data sequence has been filtered out. The anomaly detection calculation is to detect whether there is free anomalous data in the sequence, for example, the principal component analysis method, multiple Gaussian distribution method, isolated forest algorithm, etc. are used to screen out the characteristic data sequence of normal distribution.
参照图2,所述步骤S20包括:Referring to FIG. 2, the step S20 includes:
步骤S21,确定所述每个特征数据序列中的所有特征时间点以及对应的序列特征值,根据特征时间点和序列特征值在模型空间中对应的数据点的位置,以生成数据点集合,并统计所述数据点集合的总数据点个数;Step S21: Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the positions of the corresponding data points in the model space in the model space, and Counting the total number of data points in the data point set;
可以理解的是,每个特征数据序列中有特征时间点以及序列特征值两类数据,并且这两类数据都是相互映射的,因此每个特征数据序列中可根据特征时间点和序列特征值得到对应的数据点,将各个数据点代入到孤立森林算法模型中,模型中配置有模型空间,用于归纳放置所有数据点。即模型空间相当于一个坐标空间,根据各个数据点的坐标值,系统可确定各个特征数据序列中所有数据点的坐标位置,从而在模型空间中生成相应的数据点集合。例如当前序列A中包括0时的用电量值为5,6时的用电量值为8,12时的用电量值为10,18时的用电量值为8。因此序列A中的数据点包括A1=(0,5),A2=(6,8),A3=(12,10),A4=(18,8)。而这些数据点将在模型空间中根据坐标依次排列,从而获取到各个数据点的数据点集合,并根据数据点集合统计其中所有数据点的总数据点个数。以上所述例子仅为举例,并不代表数据点集合仅包括以上四个数据点的具体数值。It can be understood that each feature data sequence has two types of data: feature time points and sequence feature values, and these two types of data are mapped to each other. Therefore, each feature data sequence can be worthwhile according to the feature time points and the sequence features. Go to the corresponding data points, and substitute each data point into the isolated forest algorithm model. The model is configured with a model space for inductively placing all data points. That is, the model space is equivalent to a coordinate space. According to the coordinate values of each data point, the system can determine the coordinate positions of all data points in each characteristic data sequence, thereby generating a corresponding data point set in the model space. For example, the current sequence A includes a power consumption value of 0 at 5, a power consumption value of 6 at 8, a power consumption value of 12 at 10, and a power consumption value of 8 at 18. Therefore, the data points in sequence A include A1 = (0,5), A2 = (6,8), A3 = (12,10), and A4 = (18,8). These data points will be sequentially arranged in the model space according to the coordinates, thereby obtaining the data point set of each data point, and counting the total number of data points of all data points in the data point set. The above examples are only examples, and do not mean that the data point set includes only the specific values of the above four data points.
步骤S22,按照孤立森林算法的预设切割规则对所述数据点集合中的所有数据点进行迭代空间切割,直至获取到所有单独被切割在单一空间内的单一数据点;Step S22: Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;
孤立森林算法的预设切割规则是对所有数据点集合进行迭代空间切割。所述空间切割是指将模型空间中的数据点集合进行预设规则的切割,并计算各个切割空间内的数据点数量。假设数据点集合中各数据点较为集中,那么在空间切割过程中就不容易有单独的数据点内切割在一个空间内。而若是数据点集合中存在部分数据点较为松散或游离在数据点集合的边缘时,那么那些游离的数据点将容易被单独切割在一个空间内。系统通过迭代空间切割,从而获得所有被单独切割在单一空间内的单一数据点。可以理解的是,数据点集合中每个数据点被单独切割在单一空间内时,此时即产生了单一数据点,系统将记录该单一数据点。且所有单一数据点的数量等于数据点集合中所有数据点的数量。The preset cutting rule of the isolated forest algorithm is to perform iterative spatial cutting on all data point sets. The space cutting refers to cutting preset data points in a model space and calculating the number of data points in each cutting space. Assuming that the data points in the data point set are relatively concentrated, it is not easy to have separate data points cut into one space during the space cutting process. If some data points in the data point set are loose or scattered at the edge of the data point set, those scattered data points will be easily cut into a single space. The system cuts through iterative space to obtain all single data points that are individually cut into a single space. It can be understood that when each data point in the data point set is individually cut into a single space, a single data point is generated at this time, and the system will record the single data point. And the number of all single data points is equal to the number of all data points in the data point set.
步骤S23,获取所述各个单一数据点产生时所属的迭代次数,并获取所述所有单一数据点中迭代次数在前预设次数中的目标数据点;Step S23: Obtain the number of iterations to which each single data point is generated, and obtain a target data point of a preset number of iterations among all the single data points;
步骤S24,统计所述所有目标数据点的数据点个数,计算所述数据点个数在所述总数据点个数中的占比值,并将所述占比值设置为异常得分;Step S24: Count the number of data points of all the target data points, calculate a ratio value of the number of data points in the total number of data points, and set the ratio value as an abnormal score;
系统获取各单一数据点产生时的迭代次数。例如单一数据点A在第一次空间切割时产生,单一数据点B、C在第二次空间切割时产生,单一数据点D、E、F、G在第三次空间切割时产生等等,系统将统计各个单一数据点产生时的迭代次数。假设预设次数为2,则系统将获取在前2次空间迭代中产生的目标数据点A、B和C。当前系统统计目标数据点的数据点个数总共为3个,假设当前数据点集合中的总数据点个数15个,那么数据点个数占总比的占比值为3/15=0.2。系统将把占比值设置为异常得分,以作为后续数值比较的参考值。The system obtains the number of iterations when each single data point is generated. For example, a single data point A is generated during the first spatial cutting, a single data point B, C is generated during the second spatial cutting, a single data point D, E, F, G is generated during the third spatial cutting, etc. The system will count the number of iterations for each single data point. Assuming the preset number of times is 2, the system will obtain the target data points A, B, and C generated in the previous 2 spatial iterations. The total number of data points of the current system's statistical target data points is 3. Assuming that the total number of data points in the current data point set is 15, the ratio of the number of data points to the total value is 3/15 = 0.2. The system will set the percentage value to the abnormal score as a reference value for subsequent numerical comparisons.
具体地,在本实施例中,将24个时间点对应的用电量数值进行异常检测计算,例如通过孤立森林算法进行计算,以剔除异常数值,将无效的游离数据过滤掉,从而得到符合正常分布规律的正常数据。通过将特征数据序列中序列特征值进行空间切割,并对各空间内的序列特征值进行再切割,直到获取到被单独切割在数据空间中的序列特征值。该过程将以二叉树分层的形式体现出来,也就是说,被切割在同一侧数据空间的所有序列特征值将继续进行迭代切割,二叉树将继续向下分层,而被单独留在数据空间内的序列特征值由于不会再继续切割,则停留在当前二叉树所在层的高度。孤立森林算法将根据所有离散的序列特征值的高度,统计出特征数据序列的异常得分。Specifically, in this embodiment, anomaly detection calculation is performed on the power consumption values corresponding to 24 time points, for example, calculation is performed by an isolated forest algorithm to eliminate abnormal values and filter out invalid free data, so as to obtain normality. Normal data with regular distribution. By spatially cutting the sequence feature values in the feature data sequence and re-cutting the sequence feature values in each space, the sequence feature values that are individually cut in the data space are obtained. This process will be reflected in the form of binary tree layering, that is, all sequence feature values that are cut on the same side of the data space will continue to be iteratively cut, the binary tree will continue to be layered down, and left alone in the data space Because the sequence eigenvalues of the sequence will not continue cutting, it stays at the height of the layer where the current binary tree is located. The isolated forest algorithm will calculate the abnormal score of the characteristic data sequence according to the height of all discrete sequence eigenvalues.
步骤S25,若异常得分大于零,则确定该异常得分对应的特征数据序列为正常数据序列。In step S25, if the abnormal score is greater than zero, it is determined that the feature data sequence corresponding to the abnormal score is a normal data sequence.
所述异常得分反映了所有用电量数值整体的偏移程度,当异常得分大于零时,证明当前所有用电量数值的分布情况属于正常情况,序列特征值是正常数值,对应的所有特征数据序列(即用电量数值)为正常数据序列。而利用孤立森林算法可捕捉到无效的游离特征值,并对其进行数据量化。且通过孤立森林算法所获得的异常得分即是该序列特征值的反映参数。系统只需对异常得分的数值进行判断。The abnormal score reflects the overall degree of deviation of all power consumption values. When the abnormal score is greater than zero, it proves that the current distribution of all power consumption values is normal, and the sequence feature values are normal values, and all corresponding feature data. The sequence (ie power consumption value) is a normal data sequence. The use of the isolated forest algorithm can capture invalid eigenvalues and quantify the data. And the abnormal score obtained by the isolated forest algorithm is the reflection parameter of the characteristic value of the sequence. The system only needs to judge the value of the abnormal score.
步骤S30,获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;Step S30: Obtain target time points corresponding to the continuous missing values in the target time series, and obtain sequence feature values at all target time points in all normal data sequences;
连续缺失值在目标时间序列中有各自的目标时间点,而该目标时间点对应到正常数据序列中也有相应的序列特征值。而在正常数据序列中该序列特征值将作为后续连续缺失值的计算参考数值。当特征数据序列被判定为正常数据序列时,系统可直接调用正常数据序列中与目标时间点对应的序列特征值。The consecutive missing values have their own target time points in the target time series, and the target time points correspond to the normal data series and also have corresponding sequence characteristic values. In the normal data sequence, the characteristic value of the sequence will be used as the reference value for the calculation of subsequent consecutive missing values. When the feature data sequence is determined as a normal data sequence, the system may directly call the sequence feature value corresponding to the target time point in the normal data sequence.
例如用电量统计序列中连续缺失值所在的目标时间点13号的15时,16时,17时和18时,那么系统将从各月份的用电量正常数据序列中得到13号的15时,16时,17时和18时的用电量数值。For example, at the target time point of 15 o'clock, 16 o'clock, 17 o'clock, and 18 o'clock on the 13th at the time point where consecutive missing values in the power consumption statistics series are located, then the system will obtain the 15 o'clock on the 13th from the normal data series of monthly power consumption , At 16 o'clock, 17 o'clock and 18 o'clock.
步骤S40,对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。Step S40: Perform a mean calculation on all sequence feature values at each target time point to obtain a feature mean value at each target time point, and use the feature mean value as a filling reference value for consecutive missing values corresponding to the target time point.
系统获取到的序列特征值是多个时间序列样本在对应目标时间点上的特征值。由于每一个序列特征值都可以作为目标时间序列中的参考值,因此,系统将对所有正常数据序列中各个目标时间点上的特征值进行平均计算,以得到该目标时间点的平均值,该平均值可作为连续缺失值的填充值。计算特征均值是为了抹平不同正常数据序列在同一目标时间点上数值的波动差异,使得填充参考值的数值更能够反映该时间点上的分布情况。The sequence feature value obtained by the system is the feature value of multiple time series samples at corresponding target time points. Because each sequence feature value can be used as a reference value in the target time series, the system will average the feature values at each target time point in all normal data sequences to obtain the average value of the target time point. The average can be used as a padding value for consecutive missing values. The calculation of the feature mean is to smooth out the fluctuation of the values of different normal data sequences at the same target time point, so that the value filled in the reference value can better reflect the distribution situation at that time point.
例如,系统分别获取到不同月份13号15时,16时,17时,18时四个用电量数值,计算这四个用电量数值的均值,假设不同月份13号15时的均值a,不同月份13号16时的均值b,不同月份13号17时的均值c,不同月份13号18时的均值d。那么,a,b,c,d将作为目标时间序列中15时至18时中连续缺失值的填充值。For example, the system obtains four power consumption values at 15:00, 16:00, 17:00, and 18:00 on the 13th of each month, and calculates the average value of the four power consumption values. Assuming the average value a of the 15th on the 13th of different months, Mean b of 16:00 on the 13th of different months, mean c of 17:00 on the 13th of different months, and mean d of 18:00 on the 13th of different months. Then, a, b, c, and d will be used as padding values for consecutive missing values in the target time series from 15:00 to 18:00.
本发明通过当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。本发明从时间序列样本中抽取序列特征值,通过异常检测判定正常数据序列,从多个正常数据序列中的目标时间点上的特征值进行均值计算,并把均值作为连续缺失值在对应时间点上的填充值,减少了异常特征值的干扰,保证了填充参考值的数据可靠性,提升了连续缺失值填充效率,解决了传统的单点缺失值填充方法在对连续缺失值的填充计算过程容易引入异常数据值,使得计算出来的填充值偏移量较大,导致数据真实性降低的技术问题,而保留了时间序列本身的分布特性,并降低了计算复杂度。In the present invention, when continuous missing values are detected in a target time series collected based on a preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate characteristic data of each time series sample. Sequence; perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences; obtain target time points corresponding to the continuous missing values in the target time sequence, and obtain all Sequence feature value at the target time point; average calculation of all sequence feature values at each target time point to obtain the feature average value at each target time point, and use the feature average value as the continuous missing value at the corresponding target time point The reference value for the fill. The present invention extracts sequence feature values from time series samples, determines normal data sequences through anomaly detection, performs average calculation from feature values at target time points in multiple normal data sequences, and uses the mean value as consecutive missing values at corresponding time points. The padding value on the surface reduces the interference of abnormal eigenvalues, ensures the data reliability of padding reference values, improves the filling efficiency of continuous missing values, and solves the traditional single-point missing value filling method in the process of filling continuous missing values. It is easy to introduce abnormal data values, which causes a large offset of the calculated padding value, leading to a technical problem of reducing the authenticity of the data, while retaining the distribution characteristics of the time series itself, and reducing the computational complexity.
进一步地,在本发明连续缺失值填充方法第一实施例的基础上,提出本发明连续缺失值填充方法第二实施例,与前述实施例的区别在于,所述步骤S20之后还包括:Further, based on the first embodiment of the continuous missing value filling method of the present invention, a second embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S20, the method further includes:
步骤S50,统计当前所有正常数据序列的序列个数;Step S50: Count the sequence numbers of all current normal data sequences;
在现实情况中,可能出现特征数据序列较多,但经过筛选后正常数据序列极少的现象发生。而在本实施例中,正常数据序列的样本若是少于某个数值,会影响到最终填充参考值的精确度。只有正常数据序列的样本量够大,才能保证正常数据序列能够为填充参考值提供较高的参考性。例如,用电量序列统计序列中,夏冬两季的用电量数值可能相较于春秋两季偏高,因此只有保障正常数据序列的样本数据量在合理数值内,才能确保最终填充参考值的精确。故,系统将统计当前所有正常数据序列的序列个数。In reality, there may be many characteristic data sequences, but very few normal data sequences occur after screening. In this embodiment, if the sample of the normal data sequence is less than a certain value, the accuracy of the final filling reference value will be affected. Only when the sample size of the normal data sequence is large enough, can it guarantee that the normal data sequence can provide high reference for filling the reference value. For example, in the power consumption statistics series, the power consumption values in summer and winter may be higher than those in spring and autumn. Therefore, only by ensuring that the sample data amount of the normal data sequence is within a reasonable value, can the final filling of the reference value be guaranteed. Of precision. Therefore, the system will count the number of sequences of all current normal data sequences.
步骤S60,若序列个数小于第一预设值,则从预设样本数据库中导入新的时间序列样本,并根据新的时间序列样本获取到新的正常数据序列,直至所有正常数据序列的序列个数不小于第一预设值。In step S60, if the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, up to the sequence of all normal data sequences. The number is not less than the first preset value.
根据实际业务需求,系统可设定第一预设值,该第一预设值可根据实际业务需求而动态调整。例如,系统可指定:当连续缺失值为N个时,正常数据序列的序列个数不得少于2N个,即序列个数的多少需要根据系统指定而调整。第一预设值即为序列个数的最低门限值,若序列个数小于第一预设值,说明当前序列个数过少,对最终填充参考值会造成精确度影响。系统需从预设的样本数据库中导入新的时间序列样本,并通过对时间序列样本执行第一实施例中的步骤,获取到新的正常数据序列。According to actual business needs, the system can set a first preset value, and the first preset value can be dynamically adjusted according to actual business needs. For example, the system can specify that when the number of consecutive missing values is N, the number of normal data sequences must not be less than 2N, that is, the number of sequences needs to be adjusted according to the system designation. The first preset value is the minimum threshold of the number of sequences. If the number of sequences is less than the first preset value, it indicates that the current number of sequences is too small, which will affect the accuracy of the final filling reference value. The system needs to import a new time series sample from a preset sample database, and obtain the new normal data sequence by performing the steps in the first embodiment on the time series sample.
在本实施例中,鉴于系统对精确度的严格要求,系统将循环执行步骤S50和步骤S60,不断获取到新的正常数据序列,并将当前所有正常数据序列进行序列个数统计,再进行基于第一预设值的判断,直至序列个数不小于第一预设值。通过上述步骤,即可保障正常数据序列能够提供足够的数据样本,从而提高最终填充参考值的数据可靠性。In this embodiment, in view of the strict requirements of the system for accuracy, the system will execute steps S50 and S60 in a loop, continuously obtain new normal data sequences, and count the number of sequences of all current normal data sequences. The first preset value is determined until the number of sequences is not less than the first preset value. Through the above steps, it can be ensured that the normal data sequence can provide sufficient data samples, thereby improving the data reliability of the final filling reference value.
进一步地,在本发明连续缺失值填充方法第二实施例的基础上,提出本发明连续缺失值填充方法第三实施例,与前述实施例的区别在于,所述步骤S40之后还包括:Further, based on the second embodiment of the continuous missing value filling method of the present invention, a third embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S40, the method further includes:
步骤S70,对各个连续缺失值对应的填充参考值进行标记,并将各填充参考值对应参考的各个正常数据序列中的序列特征值进行映射标记。Step S70: Mark the padding reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each padding reference value.
通常所有指标数据都具有统计意义,本发明中得到的填充参考值实质上是从其他历史数据中推算而得,并不代表真实数据,为避免用户将数据引用为真实数据,本实施例将对目标时间写中由填充参考值填充的数值进行标记,并且将各填充参考值所参考的各个正常数据序列中的序列特征值进行映射标记。Generally, all index data has statistical significance. The filling reference value obtained in the present invention is essentially calculated from other historical data and does not represent real data. In order to prevent users from referencing data as real data, this embodiment will In the target time writing, the values filled with the padding reference values are marked, and the sequence feature values in each normal data sequence referenced by each padding reference value are mapped and marked.
假设当前存在用电量统计序列,而用户想要对该序列中的数值进行统计以获得某种趋势,由于其中的填充参考值不是真实数据,因此系统将获取到填充参考值的特征均值,并将计算特征均值时所应用的各个序列特征值查询出来,再对应到各自的特征数据序列中,由当前填充参考值的目标时间点将各个序列特征值映射到对应的时间点上。最后再将各个特征数据上的目标时间点所引用的序列特征值标记出来作为参照数值。Assume that there is a current electricity consumption statistical sequence, and the user wants to collect statistics on the sequence to obtain a certain trend. Since the filled reference value is not real data, the system will obtain the feature mean of the filled reference value, and Query each sequence feature value used in calculating the feature mean, and then map it to the respective feature data sequence. The target time point of the current filled reference value is used to map each sequence feature value to the corresponding time point. Finally, the sequence feature values cited at the target time points on each feature data are marked as reference values.
因此,本实施例的效果是各个连续缺失值都标识出采用的所有序列特征值以及该序列特征值所在的特征数据序列,用户可方便地查询到数据源头,再进行分析计算。Therefore, the effect of this embodiment is that each consecutive missing value identifies all the sequence feature values used and the feature data sequence in which the sequence feature value is located, and the user can easily query the data source and then perform analysis and calculation.
进一步地,在本发明连续缺失值填充方法第三实施例的基础上,提出本发明连续缺失值填充方法第四实施例,与前述实施例的区别在于,所述获取所有正常数据序列中所有目标时间点上的序列特征值的步骤还包括:Further, on the basis of the third embodiment of the continuous missing value filling method of the present invention, a fourth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that the method obtains all targets in all normal data sequences. The step of the sequence feature value at the time point further includes:
若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除。If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.
获取到的正常数据序列虽然保障了该序列中的特征值为正常值,但若是该正常数据序列中处于目标时间点上的序列特征值也为缺失值,意味着该正常数据序列对最终填充参考值的计算并没有任何数据支撑,还会增加计算复杂度,无法为连续缺失值的填充提供有效的数据源。因此该正常数据序列将作为无效数据序列被系统删除,既可以减轻计算复杂度,又可以避免引入无效数据,降低填充参考值的数据可靠性。Although the obtained normal data sequence guarantees that the eigenvalues in the sequence are normal values, if the sequence eigenvalues at the target time point in the normal data sequence are also missing values, it means that the normal data sequence is a reference for the final filling. The calculation of the values does not have any data support, it will also increase the computational complexity, and cannot provide an effective data source for the filling of consecutive missing values. Therefore, the normal data sequence will be deleted by the system as an invalid data sequence, which can reduce the computational complexity, avoid the introduction of invalid data, and reduce the reliability of data filled with reference values.
进一步地,在本发明连续缺失值填充方法第四实施例的基础上,提出本发明连续缺失值填充方法第五实施例,与前述实施例的区别在于,所述若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除的步骤之后还包括:Further, based on the fourth embodiment of the continuous missing value filling method of the present invention, a fifth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that if any normal data sequence is detected, When the sequence characteristic value at the target time point is missing, the step of deleting the normal data sequence further includes:
步骤S80,若检测到所有正常数据序列中任一目标时间点上的序列特征值的数值个数小于第二预设值,则从预设样本数据库中导入新的时间序列样本;Step S80: if it is detected that the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value, import a new time series sample from a preset sample database;
本实施例中,由于删除了序列特征值为缺失值的正常数据序列,导致当前正常数据序列的序列个数减少了1个。若序列个数不小于第一预设值,则其他的正常数据序列依旧可用。但是相应的,目标时间点上的序列特征值会应为1个正常数据序列被删除而减少了1个。也就是说,正常数据序列的序列个数达标,而该正常数据序列可能是无效数据,例如用电量统计序列中,A用电量数据序列对应的该月用电量是正常的,但是所有用电量数值中大部分数据是新能源用电(如电能是风能发电获取的)的电量数据,而不是传统用电(如电能是火力发电获取的)的电量数据,虽然用电量没有变,但本发明要统计的是火力用电的电量数据,因此该正常数据序列不能被统计在内。In this embodiment, because the normal data sequence with the missing sequence feature value is deleted, the number of sequences of the current normal data sequence is reduced by one. If the number of sequences is not less than the first preset value, other normal data sequences are still available. Correspondingly, the sequence feature value at the target time point should be deleted by one normal data sequence and reduced by one. That is, the number of normal data sequences meets the standard, and the normal data sequence may be invalid data. For example, in the electricity consumption statistics sequence, the monthly electricity consumption corresponding to the A electricity consumption data sequence is normal, but all Most of the data in the electricity consumption data are electricity data for new energy consumption (such as electricity obtained from wind power generation), rather than traditional electricity consumption (such as electricity obtained from thermal power generation), although the electricity consumption has not changed. However, what the present invention is to count is the power data of the thermal power consumption, so the normal data sequence cannot be counted.
而系统为保障序列特征值的数据参考性,通常会指定序列特征值的数值个数必须达到一个合理数值,以确保能够大范围覆盖样本,提高均值计算的准确性。因此系统设定了一个第二预设值,所述第二预设值将作为数值个数的参考门限值。系统将统计所有正常数据序列中任一目标时间点的序列特诊值的数值个数,若数值个数小于第二预设值,说明当前序列特征值的数据样本量不达标,可能对填充参考值的计算精度存在影响,因此需要增加正常数据序列的序列特征值。此时系统将从预设样本数据库中导入新的时间序列样本。In order to ensure the data referentiality of the sequence eigenvalues, the system usually specifies that the number of sequence eigenvalues must reach a reasonable value to ensure that the sample can be covered in a wide range and the accuracy of the mean calculation is improved. Therefore, the system sets a second preset value, and the second preset value will be used as a reference threshold for the number of values. The system will count the number of sequence diagnosis values at any target time point in all normal data sequences. If the number of values is less than the second preset value, it means that the data sample size of the current sequence characteristic value does not meet the standard, and may be a reference for filling. The calculation accuracy of the value has an influence, so it is necessary to increase the sequence characteristic value of the normal data sequence. The system will import a new time series sample from the preset sample database.
步骤S90,根据新的时间序列样本执行获取新的正常数据序列的步骤,并从新的正常数据序列中获取所有目标时间点上的序列特征值,直至所有正常数据序列中任一目标时间点上的序列特征值的数值个数不小于第二预设值。Step S90: Perform the steps of obtaining a new normal data sequence according to the new time series samples, and obtain the sequence feature values at all target time points from the new normal data sequence, up to any target time point in all normal data sequences. The number of numerical values of the sequence characteristic value is not less than the second preset value.
获取到新的时间序列样本之后,系统将执行第一实施例中获取正常数据序列的步骤,并从由新的时间序列样本中获取到的新的正常数据序列对应的目标时间点上的序列特征值,最后重新执行步骤S80和步骤S90,直至所有正常数据序列中任一目标时间点上的序列特征值的数值个数都不小于第二预设值。After obtaining a new time series sample, the system will perform the steps of obtaining a normal data sequence in the first embodiment, and the sequence characteristics at the target time point corresponding to the new normal data sequence obtained from the new time series sample Step S80 and step S90, until the number of sequence feature values at any target time point in all normal data sequences is not less than the second preset value.
以下将通过举例进行解释说明,当前正常数据序列总共有5个,对应的各目标时间点上的序列特征值的数值个数也为5个,假设系统设定的第二预设值是6,则数值个数小于第二预设值,此时需要补充新的时间序列样本,系统从预设样本数据库中导入新的时间序列样本。根据第二预设值和数值个数,系统导入的新的时间序列样本的样本数量为1个,对新的时间序列样本执行异常检测计算,得到正常数据序列后得到序列特征值的步骤,再重新统计所有正常数据序列中序列特征值的数值个数,最后进行数值个数比对。若最后数值个数大于或等于第二预设值,则本实施例执行结束。The following will explain by examples. There are currently 5 normal data sequences in total, and the number of corresponding sequence feature values at each target time point is also 5. Assuming that the second preset value set by the system is 6, Then the number of values is less than the second preset value. At this time, a new time series sample needs to be added, and the system imports a new time series sample from the preset sample database. According to the second preset value and the number of values, the number of samples of the new time series sample imported by the system is 1. The step of performing anomaly detection calculation on the new time series sample to obtain the sequence characteristic value after obtaining the normal data sequence, and then Re-count the number of sequence eigenvalues in all normal data sequences, and finally compare the number of values. If the number of the last numerical values is greater than or equal to the second preset value, the execution of this embodiment ends.
进一步地,在本发明连续缺失值填充方法第一实施例的基础上,提出本发明连续缺失值填充方法第六实施例,与前述实施例的区别在于,所述步骤S40之后还包括:Further, based on the first embodiment of the continuous missing value filling method of the present invention, a sixth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S40, the method further includes:
步骤a,将所有正常数据序列转化为对应的正常序列分布曲线,并将基于填充参考值的目标时间序列转化为目标序列分布曲线;Step a: All normal data sequences are converted into corresponding normal sequence distribution curves, and a target time series based on the filled reference value is converted into a target sequence distribution curve;
步骤b,将所述正常序列分布曲线和目标序列分布曲线显示在预设坐标系中,以供用户分析。Step b, displaying the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for analysis by a user.
本实施例中,为方便用户直观地查看分析正常数据序列和目标时间序列在序列特征值上的差异,在将特征均值作为连续缺失值的填充参考值之后,系统将把正常数据序列和包括填充参考值的目标时间序列分别转化为正常序列分布曲线和目标序列分布曲线。用户可在预设坐标系中显示正常数据序列的正常分布情况以及目标时间序列的真实分布情况。将数据可视化为曲线的意义在于,用户可以直观地观测并分析填充参考值是否偏离了正常分布情形,并针对观测结果进行再分析。In this embodiment, in order to facilitate the user to intuitively view and analyze the difference between the normal data sequence and the target time series on the feature values of the sequence, after taking the feature mean as a continuous reference value for missing values, the system will include the normal data sequence and the The target time series of the reference value is converted into a normal series distribution curve and a target series distribution curve, respectively. The user can display the normal distribution of the normal data sequence and the true distribution of the target time series in the preset coordinate system. The significance of visualizing the data as a curve is that the user can intuitively observe and analyze whether the filling reference value deviates from the normal distribution situation and reanalyze the observation results.
参照图3,本发明提供了一种数据分析装置,所述数据分析装置包括:Referring to FIG. 3, the present invention provides a data analysis device, and the data analysis device includes:
采集模块10,用于当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;检测模块20,用于对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;获取模块30,用于获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;填充模块40,用于对所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。A collection module 10 is configured to collect all sequence feature values from all time series samples according to a preset time interval when continuous missing values are detected in a target time series collected based on a preset time interval to generate each time series sample Feature data sequence; a detection module 20 for performing anomaly detection calculations on each feature data sequence to determine a normal data sequence in all feature data sequences; an acquisition module 30 for obtaining the continuous missing values at a target time series Corresponding to the target time point in time, and obtain the sequence feature values at all target time points in all normal data sequences; the filling module 40 is used to calculate the mean value of all sequence feature values to obtain the feature average value at each target time point, The feature average value is used as a filling reference value of consecutive missing values corresponding to the target time point.
进一步地,所述检测模块包括:Further, the detection module includes:
确定单元,用于确定所述每个特征数据序列中的所有特征时间点以及对应的序列特征值;生成单元,用于根据特征时间点和序列特征值在模型空间中对应的数据点的位置,以生成数据点集合;统计单元,用于统计所述数据点集合的总数据点个数;切割单元,用于按照孤立森林算法的预设切割规则对所述数据点集合中的所有数据点进行迭代空间切割,直至获取到所有单独被切割在单一空间内的单一数据点;获取单元,用于获取所述各个单一数据点产生时所属的迭代次数,并获取所述所有单一数据点中迭代次数在前预设次数中的目标数据点;所述统计单元,还用于统计所述所有目标数据点的数据点个数;计算单元,用于计算所述数据点个数在所述总数据点个数中的占比值,并将所述占比值设置为异常得分;所述确定单元,还用于若异常得分大于零,则确定该异常得分对应的特征数据序列为正常数据序列。A determining unit, configured to determine all feature time points and corresponding sequence feature values in each of the feature data sequences; a generating unit, used to position the corresponding data points in model space according to the feature time points and the sequence feature values, To generate a data point set; a statistics unit for counting the total number of data points in the data point set; a cutting unit for performing all data points in the data point set according to a preset cutting rule of an isolated forest algorithm Iterative space cutting until all single data points that are individually cut into a single space are obtained; an obtaining unit is configured to obtain the number of iterations to which each single data point belongs and obtain the number of iterations in all the single data points The target data points in the previous preset number of times; the statistics unit is further configured to count the number of data points of all the target data points; the calculation unit is configured to calculate the number of data points in the total data points The percentage value in the number, and setting the percentage value as an abnormal score; the determining unit is further configured to: if the abnormal score is greater than zero, Given this characteristic abnormality score data sequence corresponding to the sequence of normal data.
进一步地,所述数据分析装置还包括:统计模块,用于统计当前所有正常数据序列的序列个数;第一导入模块,用于若序列个数小于第一预设值,则从预设样本数据库中导入新的时间序列样本;所述获取模块30还用于根据新的时间序列样本获取到新的正常数据序列,直至所有正常数据序列的序列个数不小于第一预设值。Further, the data analysis device further includes: a statistics module for counting the number of sequences of all current normal data sequences; and a first importing module for sampling from a preset sample if the number of sequences is less than a first preset value A new time series sample is imported into the database; the acquisition module 30 is further configured to obtain a new normal data sequence according to the new time series sample until the number of sequences of all normal data sequences is not less than a first preset value.
进一步地,所述数据分析装置还包括:标记模块,用于对各个连续缺失值对应的填充参考值进行标记,并将各填充参考值对应参考的各个正常数据序列中的序列特征值进行映射标记。Further, the data analysis device further includes: a marking module, configured to mark the filling reference value corresponding to each consecutive missing value, and map and mark the sequence feature value in each normal data sequence referenced by each filling reference value. .
进一步地,所述获取模块30还用于若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除。Further, the obtaining module 30 is further configured to delete a normal data sequence if a sequence feature value at a target time point in any normal data sequence is detected as a missing value.
进一步地,所述数据分析装置还包括:第二导入模块,用于若检测到所有正常数据序列中任一目标时间点上的序列特征值的数值个数小于第二预设值,则从预设样本数据库中导入新的时间序列样本;执行模块,用于根据新的时间序列样本执行获取新的正常数据序列的步骤;所述获取模块30还用于从新的正常数据序列中获取所有目标时间点上的序列特征值,直至所有正常数据序列中任一目标时间点上的序列特征值的数值个数不小于第二预设值。Further, the data analysis device further includes: a second importing module, configured to detect the number of sequence feature values at any target time point in all normal data sequences from a preset value, and It is assumed that a new time series sample is imported into the sample database; an execution module is configured to perform the steps of obtaining a new normal data sequence according to the new time series sample; the obtaining module 30 is further configured to obtain all target times from the new normal data sequence The number of sequence feature values at the point until the number of sequence feature values at any target time point in all normal data sequences is not less than the second preset value.
进一步地,所述数据分析装置还包括:转化模块,用于将所有正常数据序列转化为对应的正常序列分布曲线,并将基于填充参考值的目标时间序列转化为目标序列分布曲线;显示模块,用于将所述正常序列分布曲线和目标序列分布曲线显示在预设坐标系中,以供用户分析。Further, the data analysis device further includes a conversion module for converting all normal data sequences into corresponding normal sequence distribution curves, and converting a target time series based on the filled reference value into a target sequence distribution curve; a display module, Used to display the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for user analysis.
参照图4,图4是本发明实施例方法涉及的硬件运行环境的设备结构示意图。Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a device in a hardware operating environment involved in a method according to an embodiment of the present invention.
本发明实施例终端可以是PC,也可以是智能手机、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、便携计算机等终端设备。In the embodiment of the present invention, the terminal may be a PC, or a smart phone, a tablet computer, an e-book reader, or MP3 (Moving Picture). Experts Group Audio Layer III, standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, compression standard audio layer for motion picture experts 4) Terminal equipment such as players, portable computers.
如图4所示,该数据分析终端可以包括:处理器1001,例如CPU,存储器1005,通信总线1002。其中,通信总线1002用于实现处理器1001和存储器1005之间的连接通信。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 4, the data analysis terminal may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to implement connection and communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory. memory), such as disk storage. The memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
可选地,该数据分析终端还可以包括用户接口、网络接口、摄像头、RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口还可以包括标准的有线接口、无线接口。网络接口可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。Optionally, the data analysis terminal may further include a user interface, a network interface, a camera, an RF (Radio Frequency) circuits, sensors, audio circuits, WiFi modules, etc. The user interface may include a display, an input unit such as a keyboard, and the optional user interface may also include a standard wired interface and a wireless interface. The network interface can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
本领域技术人员可以理解,图4中示出的数据分析终端结构并不构成对数据分析终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the data analysis terminal shown in FIG. 4 does not constitute a limitation on the data analysis terminal, and may include more or fewer components than shown in the figure, or combine some components or different components. Layout.
如图4所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块以及计算机可读指令。操作系统是管理和控制数据分析终端硬件和软件资源的程序,支持计算机可读指令以及其它软件和/或程序的运行。网络通信模块用于实现存储器1005内部各组件之间的通信,以及与数据分析终端中其它硬件和软件之间通信。As shown in FIG. 4, the memory 1005 as a computer storage medium may include an operating system, a network communication module, and computer-readable instructions. The operating system is a program that manages and controls the hardware and software resources of the data analysis terminal, and supports the operation of computer-readable instructions and other software and / or programs. The network communication module is used to implement communication between components in the memory 1005 and to communicate with other hardware and software in the data analysis terminal.
在图4所示的数据分析终端中,处理器1001用于执行存储器1005中存储的计算机可读指令,实现如上所述的连续缺失值填充方法的步骤。In the data analysis terminal shown in FIG. 4, the processor 1001 is configured to execute computer-readable instructions stored in the memory 1005 to implement the steps of the continuous missing value filling method described above.
本发明数据分析终端的具体实施方式与上述连续缺失值填充方法各实施例基本相同,在此不再赘述。The specific implementation manner of the data analysis terminal of the present invention is basically the same as each embodiment of the continuous missing value filling method described above, and details are not described herein again.
本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质可以为非易失性可读存储介质。所述计算机可读存储介质存储有一个或者一个以上程序,所述一个或者一个以上程序还可被一个或者一个以上的处理器执行以用于实现如上所述的连续缺失值填充方法的步骤。The invention also provides a computer-readable storage medium, which may be a non-volatile readable storage medium. The computer-readable storage medium stores one or more programs, and the one or more programs can also be executed by one or more processors for implementing the steps of the continuous missing value filling method as described above.
本发明计算机可读存储介质具体实施方式与上述连续缺失值填充方法各实施例基本相同,在此不再赘述。The specific implementation manner of the computer-readable storage medium of the present invention is basically the same as each embodiment of the continuous missing value filling method described above, and details are not described herein again.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and thus do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description and drawings of the present invention, or directly or indirectly used in other related technical fields All are included in the patent protection scope of the present invention.

Claims (20)

  1. 一种连续缺失值填充方法,其特征在于,所述连续缺失值填充方法包括: A continuous missing value filling method, characterized in that the continuous missing value filling method includes:
    当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;
    对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;
    获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;
    对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.
  2. 如权利要求1所述的连续缺失值填充方法,其特征在于,所述对每个特征数据序列进行基于孤立森林算法的异常检测计算,以确定所有特征数据序列中的正常数据序列的步骤包括:The continuous missing value filling method according to claim 1, wherein the step of performing anomaly detection calculation based on an isolated forest algorithm on each feature data sequence to determine a normal data sequence in all feature data sequences comprises:
    确定所述每个特征数据序列中的所有特征时间点以及对应的序列特征值,根据特征时间点和序列特征值在模型空间中对应的数据点的位置,以生成数据点集合,并统计所述数据点集合的总数据点个数;Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the position of the corresponding data points in the model space in the model space, and count the The total number of data points in the data point collection;
    按照孤立森林算法的预设切割规则对所述数据点集合中的所有数据点进行迭代空间切割,直至获取到所有单独被切割在单一空间内的单一数据点;Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;
    获取所述各个单一数据点产生时所属的迭代次数,并获取所述所有单一数据点中迭代次数在前预设次数中的目标数据点;Obtaining the number of iterations to which each single data point belongs, and obtaining a target data point in a preset number of iterations among all the single data points;
    统计所述所有目标数据点的数据点个数,计算所述数据点个数在所述总数据点个数中的占比值,并将所述占比值设置为异常得分;Counting the number of data points of all target data points, calculating a ratio value of the number of data points in the total number of data points, and setting the ratio value as an abnormal score;
    若异常得分大于零,则确定该异常得分对应的特征数据序列为正常数据序列。If the abnormal score is greater than zero, it is determined that the characteristic data sequence corresponding to the abnormal score is a normal data sequence.
  3. 如权利要求1所述的连续缺失值填充方法,其特征在于,所述对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列的步骤之后还包括:The continuous missing value filling method according to claim 1, wherein after the step of performing an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences, further comprising:
    统计当前所有正常数据序列的序列个数;Count the number of sequences of all current normal data sequences;
    若序列个数小于第一预设值,则从预设样本数据库中导入新的时间序列样本,并根据新的时间序列样本获取到新的正常数据序列,直至所有正常数据序列的序列个数不小于第一预设值。If the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, until the number of sequences of all normal data sequences is not equal. Less than the first preset value.
  4. 如权利要求1所述的连续缺失值填充方法,其特征在于,所述对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值的步骤之后还包括:The continuous missing value filling method according to claim 1, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature is The step of filling the reference value with the mean value as a continuous missing value corresponding to the target time point further includes:
    对各个连续缺失值对应的填充参考值进行标记,并将各填充参考值对应参考的各个正常数据序列中的序列特征值进行映射标记。Mark the filled reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each filled reference value.
  5. 如权利要求1所述的连续缺失值填充方法,其特征在于,所述获取所有正常数据序列中所有目标时间点上的序列特征值的步骤还包括:The continuous missing value filling method according to claim 1, wherein the step of obtaining sequence feature values at all target time points in all normal data sequences further comprises:
    若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除。If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.
  6. 如权利要求5所述的连续缺失值填充方法,其特征在于,所述若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除的步骤之后还包括:The continuous missing value filling method according to claim 5, characterized in that, if the sequence feature value at the target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted after the step Also includes:
    若检测到所有正常数据序列中任一目标时间点上的序列特征值的数值个数小于第二预设值,则从预设样本数据库中导入新的时间序列样本;If it is detected that the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value, importing a new time series sample from a preset sample database;
    根据新的时间序列样本执行获取新的正常数据序列的步骤,并从新的正常数据序列中获取所有目标时间点上的序列特征值,直至所有正常数据序列中任一目标时间点上的序列特征值的数值个数不小于第二预设值。Perform the steps of obtaining a new normal data sequence according to the new time series samples, and obtain the sequence feature values at all target time points from the new normal data sequence, up to the sequence feature values at any target time point in all normal data sequences The number of values is not less than the second preset value.
  7. 如权利要求1所述的连续缺失值填充方法,其特征在于,所述对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值的步骤之后还包括:The continuous missing value filling method according to claim 1, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature is The step of filling the reference value with the mean value as a continuous missing value corresponding to the target time point further includes:
    将所有正常数据序列转化为对应的正常序列分布曲线,并将基于填充参考值的目标时间序列转化为目标序列分布曲线;Convert all normal data sequences into corresponding normal sequence distribution curves, and transform the target time series based on the filled reference value into the target sequence distribution curve;
    将所述正常序列分布曲线和目标序列分布曲线显示在预设坐标系中,以供用户分析。The normal sequence distribution curve and the target sequence distribution curve are displayed in a preset coordinate system for user analysis.
  8. 一种数据分析装置,其特征在于,所述数据分析装置包括:A data analysis device, characterized in that the data analysis device includes:
    采集模块,用于当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;A collection module is used to collect all sequence feature values from all time series samples according to the preset time interval when continuous missing values are detected in the target time series collected based on the preset time interval to generate each time series sample. Characteristic data sequence;
    检测模块,用于对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;A detection module for performing anomaly detection calculations on each feature data sequence to determine a normal data sequence in all feature data sequences;
    获取模块,用于获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;An obtaining module, configured to obtain target time points corresponding to the continuous missing values in the target time series, and obtain sequence feature values at all target time points in all normal data sequences;
    填充模块,用于对所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。A filling module is used to calculate the mean value of all sequence feature values to obtain the feature mean value at each target time point, and use the feature mean value as a filling reference value of consecutive missing values corresponding to the target time point.
  9. 如权利要求8所述的数据分析装置,其特征在于,所述检测模块包括:The data analysis device according to claim 8, wherein the detection module comprises:
    确定单元,用于确定所述每个特征数据序列中的所有特征时间点以及对应的序列特征值;A determining unit, configured to determine all feature time points in the feature data sequence and corresponding sequence feature values;
    生成单元,用于根据特征时间点和序列特征值在模型空间中对应的数据点的位置,以生成数据点集合;A generating unit, configured to generate a set of data points according to the positions of the characteristic time points and the corresponding data points in the model space of the sequence eigenvalues;
    统计单元,用于统计所述数据点集合的总数据点个数;A statistics unit, configured to count the total number of data points in the data point set;
    切割单元,用于按照孤立森林算法的预设切割规则对所述数据点集合中的所有数据点进行迭代空间切割,直至获取到所有单独被切割在单一空间内的单一数据点;A cutting unit, configured to perform iterative spatial cutting on all data points in the data point set according to a preset cutting rule of an isolated forest algorithm until all single data points that are individually cut into a single space are obtained;
    获取单元,用于获取所述各个单一数据点产生时所属的迭代次数,并获取所述所有单一数据点中迭代次数在前预设次数中的目标数据点;An obtaining unit, configured to obtain the number of iterations to which each single data point belongs, and to obtain a target data point of a preset number of iterations among all the single data points;
    所述统计单元,还用于统计所述所有目标数据点的数据点个数;The statistics unit is further configured to count the number of data points of all the target data points;
    计算单元,用于计算所述数据点个数在所述总数据点个数中的占比值,并将所述占比值设置为异常得分;A calculation unit, configured to calculate a ratio of the number of data points to the total number of data points, and set the ratio to an abnormal score;
    所述确定单元,还用于若异常得分大于零,则确定该异常得分对应的特征数据序列为正常数据序列。The determining unit is further configured to, if the abnormal score is greater than zero, determine that the feature data sequence corresponding to the abnormal score is a normal data sequence.
  10. 如权利要求8所述的数据分析装置,其特征在于,所述数据分析装置还包括:The data analysis device according to claim 8, wherein the data analysis device further comprises:
    统计模块,用于统计当前所有正常数据序列的序列个数;Statistics module, for counting the number of sequences of all current normal data sequences;
    第一导入模块,用于若序列个数小于第一预设值,则从预设样本数据库中导入新的时间序列样本;A first import module, configured to import a new time series sample from a preset sample database if the number of sequences is less than a first preset value;
    所述获取模块还用于根据新的时间序列样本获取到新的正常数据序列,直至所有正常数据序列的序列个数不小于第一预设值。The acquiring module is further configured to acquire a new normal data sequence according to a new time series sample until the number of sequences of all normal data sequences is not less than a first preset value.
  11. 如权利要求8所述的数据分析装置,其特征在于,所述数据分析装置还包括:The data analysis device according to claim 8, wherein the data analysis device further comprises:
    标记模块,用于对各个连续缺失值对应的填充参考值进行标记,并将各填充参考值对应参考的各个正常数据序列中的序列特征值进行映射标记。The labeling module is configured to mark the padding reference values corresponding to each consecutive missing value, and map and mark sequence feature values in each normal data sequence corresponding to each padding reference value.
  12. 如权利要求8所述的数据分析装置,其特征在于,所述获取模块还用于若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除。The data analysis device according to claim 8, wherein the acquisition module is further configured to delete a normal data sequence if a sequence feature value at a target time point in any normal data sequence is detected as a missing value. .
  13. 如权利要求12所述的数据分析装置,其特征在于,所述数据分析装置还包括:The data analysis device according to claim 12, wherein the data analysis device further comprises:
    第二导入模块,用于若检测到所有正常数据序列中任一目标时间点上的序列特征值的数值个数小于第二预设值,则从预设样本数据库中导入新的时间序列样本;A second import module, configured to import a new time series sample from a preset sample database if the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value;
    执行模块,用于根据新的时间序列样本执行获取新的正常数据序列的步骤;An execution module, configured to perform the steps of obtaining a new normal data sequence according to the new time series sample;
    所述获取模块还用于从新的正常数据序列中获取所有目标时间点上的序列特征值,直至所有正常数据序列中任一目标时间点上的序列特征值的数值个数不小于第二预设值。The obtaining module is further configured to obtain sequence feature values at all target time points from the new normal data sequence, until the number of sequence feature values at any target time point in all normal data sequences is not less than a second preset value.
  14. 如权利要求8所述的数据分析装置,其特征在于,所述数据分析装置还包括:The data analysis device according to claim 8, wherein the data analysis device further comprises:
    转化模块,用于将所有正常数据序列转化为对应的正常序列分布曲线,并将基于填充参考值的目标时间序列转化为目标序列分布曲线;A conversion module for converting all normal data sequences into corresponding normal sequence distribution curves, and converting a target time series based on a filled reference value into a target sequence distribution curve;
    显示模块,用于将所述正常序列分布曲线和目标序列分布曲线显示在预设坐标系中,以供用户分析。A display module is configured to display the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for user analysis.
  15. 一种数据分析终端,其特征在于,所述数据分析终端包括:存储器、处理器、通信总线以及存储在所述存储器上的计算机可读指令,所述处理器用于执行所述计算机可读指令,以实现如下步骤:A data analysis terminal, characterized in that the data analysis terminal includes: a memory, a processor, a communication bus, and computer-readable instructions stored on the memory, and the processor is configured to execute the computer-readable instructions, To achieve the following steps:
    当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;
    对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;
    获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;
    对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.
  16. 如权利要求15所述的数据分析终端,其特征在于,所述对每个特征数据序列进行基于孤立森林算法的异常检测计算,以确定所有特征数据序列中的正常数据序列的步骤包括:The data analysis terminal according to claim 15, wherein the step of performing anomaly detection calculation based on an isolated forest algorithm for each feature data sequence to determine a normal data sequence in all feature data sequences comprises:
    确定所述每个特征数据序列中的所有特征时间点以及对应的序列特征值,根据特征时间点和序列特征值在模型空间中对应的数据点的位置,以生成数据点集合,并统计所述数据点集合的总数据点个数;Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the position of the corresponding data points in the model space in the model space, and count the The total number of data points in the data point collection;
    按照孤立森林算法的预设切割规则对所述数据点集合中的所有数据点进行迭代空间切割,直至获取到所有单独被切割在单一空间内的单一数据点;Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;
    获取所述各个单一数据点产生时所属的迭代次数,并获取所述所有单一数据点中迭代次数在前预设次数中的目标数据点;Obtaining the number of iterations to which each single data point belongs, and obtaining a target data point in a preset number of iterations among all the single data points;
    统计所述所有目标数据点的数据点个数,计算所述数据点个数在所述总数据点个数中的占比值,并将所述占比值设置为异常得分;Counting the number of data points of all target data points, calculating a ratio value of the number of data points in the total number of data points, and setting the ratio value as an abnormal score;
    若异常得分大于零,则确定该异常得分对应的特征数据序列为正常数据序列。If the abnormal score is greater than zero, it is determined that the characteristic data sequence corresponding to the abnormal score is a normal data sequence.
  17. 如权利要求15所述的数据分析终端,其特征在于,所述对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列的步骤之后还包括:The data analysis terminal according to claim 15, wherein after the step of performing an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences, further comprising:
    统计当前所有正常数据序列的序列个数;Count the number of sequences of all current normal data sequences;
    若序列个数小于第一预设值,则从预设样本数据库中导入新的时间序列样本,并根据新的时间序列样本获取到新的正常数据序列,直至所有正常数据序列的序列个数不小于第一预设值。If the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, until the number of sequences of all normal data sequences is not equal. Less than the first preset value.
  18. 如权利要求15所述的数据分析终端,其特征在于,所述对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值的步骤之后还包括:The data analysis terminal according to claim 15, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as After the step of filling the reference value with consecutive missing values corresponding to the target time point, the method further includes:
    对各个连续缺失值对应的填充参考值进行标记,并将各填充参考值对应参考的各个正常数据序列中的序列特征值进行映射标记。Mark the filled reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each filled reference value.
  19. 如权利要求15所述的数据分析终端,其特征在于,所述获取所有正常数据序列中所有目标时间点上的序列特征值的步骤还包括:The data analysis terminal according to claim 15, wherein the step of obtaining sequence feature values at all target time points in all normal data sequences further comprises:
    若检测到任一正常数据序列中目标时间点上的序列特征值为缺失值时,将该正常数据序列删除。If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium is characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
    当检测到基于预设时间间隔采集到的目标时间序列中存在连续缺失值时,按照预设时间间隔从所有时间序列样本中采集所有序列特征值,以生成各时间序列样本的特征数据序列;When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;
    对每个特征数据序列执行异常检测计算,以确定所有特征数据序列中的正常数据序列;Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;
    获取所述连续缺失值在目标时间序列中对应的目标时间点,并获取所有正常数据序列中所有目标时间点上的序列特征值;Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;
    对各目标时间点上的所有序列特征值作均值计算,以获得各个目标时间点上的特征均值,并将所述特征均值作为对应目标时间点的连续缺失值的填充参考值。 The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point. Ranch
PCT/CN2018/103333 2018-07-09 2018-08-30 Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium WO2020010677A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810748247.XA CN109947812B (en) 2018-07-09 2018-07-09 Continuous missing value filling method, data analysis device, terminal and storage medium
CN201810748247.X 2018-07-09

Publications (1)

Publication Number Publication Date
WO2020010677A1 true WO2020010677A1 (en) 2020-01-16

Family

ID=67005851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/103333 WO2020010677A1 (en) 2018-07-09 2018-08-30 Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN109947812B (en)
WO (1) WO2020010677A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885101A (en) * 2021-03-30 2021-06-01 浙江大华技术股份有限公司 Method and device for determining abnormal equipment, storage medium and electronic device
CN114997313A (en) * 2022-06-07 2022-09-02 厦门大学 Anomaly detection method for ocean online monitoring data

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046027B (en) * 2019-11-25 2023-07-25 北京百度网讯科技有限公司 Missing value filling method and device for time series data
CN111177221B (en) * 2019-12-26 2021-05-04 苏州亿歌网络科技有限公司 Statistical data acquisition method, device and equipment
WO2021164028A1 (en) * 2020-02-21 2021-08-26 Siemens Aktiengesellschaft Method and apparatus for filling missing industrial longitudinal data
CN111612032A (en) * 2020-04-08 2020-09-01 深圳市水务科技有限公司 Data processing method and system
CN111797143B (en) * 2020-07-07 2023-12-15 长沙理工大学 Aquaculture electricity larceny detection method based on electricity consumption statistical distribution skewness coefficient
CN113077357B (en) * 2021-03-29 2023-11-28 国网湖南省电力有限公司 Power time sequence data anomaly detection method and filling method thereof
CN113377753A (en) * 2021-06-09 2021-09-10 国网吉林省电力有限公司 Heat accumulating type electric boiler load data cleaning system
CN113568343B (en) * 2021-07-20 2023-04-07 苏州伟创电气科技股份有限公司 Method, device, equipment and storage medium for capturing arbitrary data
CN114168586A (en) * 2022-02-10 2022-03-11 北京宝兰德软件股份有限公司 Abnormal point detection method and device
CN116627953B (en) * 2023-05-24 2023-10-27 首都师范大学 Method for repairing loss of groundwater level monitoring data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080033630A1 (en) * 2006-07-26 2008-02-07 Eun-Mi Lee System and method of predicting traffic speed based on speed of neighboring link
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529704A (en) * 2016-10-31 2017-03-22 国家电网公司 Monthly maximum power load forecasting method and apparatus
CN107491832A (en) * 2017-07-12 2017-12-19 国网上海市电力公司 Energy quality steady-state index prediction method based on chaology
CN108090558B (en) * 2018-01-03 2021-06-08 华南理工大学 Automatic filling method for missing value of time sequence based on long-term and short-term memory network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080033630A1 (en) * 2006-07-26 2008-02-07 Eun-Mi Lee System and method of predicting traffic speed based on speed of neighboring link
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, KUN.: "Models and Algorithms for Travel Time Reliability Assessment of Urban Road Networks Based-on Moving Source Data", CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE, ENGINEERING SCIENCE AND TECHNOLOGY II, 15 August 2008 (2008-08-15) *
ZHANG, RONGCHANG: "Analysis of Abnormal Electro-data Based on Data Mining", CHINESE MASTER'S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, 15 January 2018 (2018-01-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885101A (en) * 2021-03-30 2021-06-01 浙江大华技术股份有限公司 Method and device for determining abnormal equipment, storage medium and electronic device
CN112885101B (en) * 2021-03-30 2022-06-14 浙江大华技术股份有限公司 Method and device for determining abnormal equipment, storage medium and electronic device
CN114997313A (en) * 2022-06-07 2022-09-02 厦门大学 Anomaly detection method for ocean online monitoring data
CN114997313B (en) * 2022-06-07 2024-05-07 厦门大学 Abnormality detection method for ocean on-line monitoring data

Also Published As

Publication number Publication date
CN109947812B (en) 2023-11-10
CN109947812A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
WO2020010677A1 (en) Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium
WO2020019403A1 (en) Electricity consumption abnormality detection method, apparatus and device, and readable storage medium
WO2020015061A1 (en) Monitoring alarm method, device and system for weblogic server, and computer storage medium
CN101534021A (en) Multimode data acquisitions and processing method applied to power automation system
CN101996271B (en) Software interface for automatically generating simulation calculation model of PSCAD power system
CN102169158A (en) Steady state oscillograph for power system
CN202759287U (en) Intelligent distribution monitoring system
WO2020015060A1 (en) Power consumption anomaly estimation method and apparatus, device, and computer storage medium
WO2020119383A1 (en) Medical insurance supervision method, device, apparatus and computer readable storage medium
AU2022204116A1 (en) Verification method for electrical grid measurement data
CN109391923A (en) A kind of building energy consumption management method and system based on 5G framework
CN107247797A (en) Time scale measurement data-storage system and method in electric power scheduling automatization system based on Redis
CN102338835A (en) Power quality dynamic monitoring system
WO2020143296A1 (en) Data collection method, device, equipment and computer readable storage medium
WO2020224090A1 (en) Body temperature information-based depression prediction system
WO2020143306A9 (en) Object inter-conversion method and apparatus, storage medium, and server
CN106096013A (en) Frequency queries based on electromagnetic environment data base and planing method
CN109256865A (en) A kind of distribution network automated terminal with self-checking function
CN114814695A (en) Data processing method and device, electronic equipment and storage medium
CN210323340U (en) Remote verification system of intelligent electric energy meter
CN211905518U (en) Anti-electricity-theft online monitoring device
CN203929939U (en) Platform district identifier
CN112488478A (en) Method and device for identifying topology of low-voltage transformer area and storage medium
CN113204592A (en) Data processing method, system and device under scene of Internet of things and storage medium
CN105223850A (en) A kind of energy monitor based on smart bluetooth and control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926108

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22/04/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18926108

Country of ref document: EP

Kind code of ref document: A1