WO2020010677A1

WO2020010677A1 - Method for acquiring consecutive missing values, data analysis device, terminal, and storage medium

Info

Publication number: WO2020010677A1
Application number: PCT/CN2018/103333
Authority: WO
Inventors: 郑立颖; 徐亮; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-07-09
Filing date: 2018-08-30
Publication date: 2020-01-16
Also published as: CN109947812B; CN109947812A

Abstract

The present invention discloses a method for acquiring consecutive missing values, a data analysis device, a data analysis terminal, and a computer-readable storage medium. The method for acquiring consecutive missing values comprises: if it is detected that a target time sequence acquired on the basis of a preset time interval has consecutive missing values, acquiring, according to the preset time interval, all of sequence feature values from all of time sequence samples, so as to generate a feature data sequence of each time sequence sample; performing anomaly detection calculation on each feature data sequence, so as to determine normal data sequences among all of the feature data sequences; acquiring a corresponding target time point of the consecutive missing values in the target data sequence, and acquiring sequence feature values at all of the target time points in all of the normal data sequences; and calculating a mean value of all of the sequence feature values at the target time points, and using feature mean values as filling reference values of the consecutive missing values corresponding to the target time point. The present invention improves the authenticity of data of a time sequence.

Description

Continuous missing value filling method, data analysis device, terminal and storage medium Ranch

This application claims the priority of a Chinese patent application filed on July 9, 2018, with the Chinese Patent Office, application number 201810748247.X, and the invention name "Continuous Missing Value Filling Method, Data Analysis Device, Terminal, and Storage Medium", which The entire contents are incorporated in the application by reference.

Technical field

The invention relates to the technical field of data analysis, and in particular, to a method for filling continuous missing values, a data analysis device, a data analysis terminal, and a computer-readable storage medium.

Background technique

In real life, people collect statistics on the collected index data. Usually, continuous changes of the index data can reflect a historical trend and predict the subsequent trend. However, in the statistical process of the indicator data, there are often some accidents. For example, the statistical indicator data cannot be collected during the time period of system failure or equipment replacement, resulting in continuous missing values of the indicator data in the continuous time series. However, the existing uniform mean filling will cause the filling value to not conform to the distribution of the time series itself, and the moving mean filling will introduce abnormal data values. Therefore, the traditional single-point missing value filling method is likely to cause a large deviation of the filled indicator data, which cannot guarantee the authenticity of the data.

Summary of the invention

The main purpose of the present invention is to provide a continuous missing value filling method, a data analysis device, a data analysis terminal, and a computer-readable storage medium. The purpose is to solve the traditional single-point missing value filling method in the process of filling continuous missing values easily. The introduction of abnormal data values makes the calculated padding values have a large offset, which leads to technical problems that reduce the authenticity of the data.

To achieve the foregoing objective, an embodiment of the present invention provides a continuous missing value filling method. The continuous missing value filling method includes:

When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;

Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;

Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;

The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.

The present invention also provides a data analysis device. The data analysis device includes: an acquisition module, configured to detect continuous missing values in a target time series collected based on a preset time interval, from all the preset time intervals. Collect all sequence characteristic values in the time series samples to generate characteristic data sequences for each time series sample; a detection module for performing anomaly detection calculations on each characteristic data sequence to determine the normal data sequence in all characteristic data sequences; obtain A module for obtaining the target time points corresponding to the continuous missing values in the target time series, and for obtaining the sequence feature values at all the target time points in all normal data sequences; a filling module for making an average of all the sequence feature values Calculate to obtain the feature mean value at each target time point, and use the feature mean value as the filling reference value of the consecutive missing values corresponding to the target time point.

In addition, in order to achieve the above object, the present invention further provides a data analysis terminal, the data analysis terminal includes: a memory, a processor, a communication bus, and computer-readable instructions stored on the memory, where the computer-readable When the instructions are executed by the processor, the steps of the continuous missing value filling method described above are implemented.

In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the implementation is as described above. Steps of continuous missing value population method.

In the present invention, when continuous missing values are detected in a target time series collected based on a preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate characteristic data of each time series sample. Sequence; perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences; obtain target time points corresponding to the continuous missing values in the target time sequence, and obtain all Sequence feature value at the target time point; average calculation of all sequence feature values at each target time point to obtain the feature average value at each target time point, and use the feature average value as the continuous missing value at the corresponding target time point The reference value for the fill. The present invention extracts sequence feature values from time series samples, determines normal data sequences through anomaly detection, performs average calculation from feature values at target time points in multiple normal data sequences, and uses the mean value as consecutive missing values at corresponding time points. The padding value on the surface reduces the interference of abnormal eigenvalues, ensures the data reliability of padding reference values, improves the filling efficiency of continuous missing values, and solves the traditional single-point missing value filling method in the process of filling continuous missing values. It is easy to introduce abnormal data values, which causes a large offset of the calculated padding value, leading to a technical problem of reducing the authenticity of the data, while retaining the distribution characteristics of the time series itself, and reducing the computational complexity.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic flowchart of a first embodiment of a continuous missing value filling method according to the present invention; FIG.

FIG. 2 is a detailed flowchart of step S20 in FIG. 1; FIG.

3 is a schematic diagram of functional modules of a data analysis device of the present invention;

FIG. 4 is a schematic structural diagram of a device in a hardware operating environment involved in a method according to an embodiment of the present invention.

The realization of the purpose, functional characteristics and advantages of the present invention will be further described with reference to the embodiments and the drawings.

detailed description

It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

The present invention provides a continuous missing value filling method. In a first embodiment of the continuous missing value filling method, referring to FIG. 1, the continuous missing value filling method includes:

In step S10, when continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate characteristic data of each time series sample. sequence;

The target time series refers to a data index set collected by the system based on a preset time interval, and the continuous missing value refers to a sequence feature value that cannot be recorded normally in the target time series due to special reasons. When there are continuous missing values in the target time series, in order to supplement the continuous missing values, the system will collect all sequence feature values from the time series samples at preset time intervals as the reference data of the target time series.

It can be understood that the time interval used for the sequence feature value in the target time series during the acquisition phase may be different from the preset time interval when the sequence feature is collected from the time series sample in the present invention, but in the target time series The time interval must be greater than or equal to the preset time interval when collecting data from a time series sample. For example, the time interval of the target time series is to collect data every 1 hour, then the preset time interval in the time series sample must be less than or equal to the time interval of collecting one data, such as collecting one data every 30 minutes, Collect a data in 20 minutes and so on. In this way, all sequence characteristic values collected from the time series samples can be used as reference values for the target time series. Otherwise, if the time series sample uses a preset time interval of 2 hours to collect one data, which is greater than the target time sample time interval, then within a day, there are 12 sequence feature values collected by the time series sample (based on the preset time) The interval is 2 hours), and the target time series has 24 sequence feature values (1 hour by time interval). The two are not equal at all. If three consecutive missing values of the target time series occur within consecutive 1.5 hours, then at most one sequence characteristic value is used as a reference within 1.5 hours of the time series sample, which cannot solve the technology of the present invention. problem.

After detecting consecutive missing values, the system will collect all sequence feature values from all time series samples at a preset time interval, and map all sequence feature values to their respective time series samples to generate a time series sample. Characteristic data sequence.

To facilitate understanding, this step can be described in the following example: Assume that the electricity consumption statistics sequence (ie, the target time series) was collected every hour on January 13th, and the electricity consumption statistics sequence is from 15:00 to 18 When the power consumption is not collected, there are a total of 3 power consumption values from 15:00 to 18:00, which are continuous missing values. At this time, the system will collect the power consumption on the 13th in the historical power consumption statistical sequence (time series sample) at the time interval of collecting data every 1 hour, and obtain the power consumption corresponding to 24 time points. Value. The 24 power consumption values are characteristic data sequences.

Step S20: Perform an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences;

There may be abnormal data in each feature data sequence, for example, data abnormality due to a system failure or data entry error, and the data abnormality of the feature data sequence will affect the accuracy of consecutive missing values, so each feature data sequence needs to be The anomaly detection calculation is performed, and the normal data sequence in the characteristic data sequence has been filtered out. The anomaly detection calculation is to detect whether there is free anomalous data in the sequence, for example, the principal component analysis method, multiple Gaussian distribution method, isolated forest algorithm, etc. are used to screen out the characteristic data sequence of normal distribution.

Referring to FIG. 2, the step S20 includes:

Step S21: Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the positions of the corresponding data points in the model space in the model space, and Counting the total number of data points in the data point set;

It can be understood that each feature data sequence has two types of data: feature time points and sequence feature values, and these two types of data are mapped to each other. Therefore, each feature data sequence can be worthwhile according to the feature time points and the sequence features. Go to the corresponding data points, and substitute each data point into the isolated forest algorithm model. The model is configured with a model space for inductively placing all data points. That is, the model space is equivalent to a coordinate space. According to the coordinate values of each data point, the system can determine the coordinate positions of all data points in each characteristic data sequence, thereby generating a corresponding data point set in the model space. For example, the current sequence A includes a power consumption value of 0 at 5, a power consumption value of 6 at 8, a power consumption value of 12 at 10, and a power consumption value of 8 at 18. Therefore, the data points in sequence A include A1 = (0,5), A2 = (6,8), A3 = (12,10), and A4 = (18,8). These data points will be sequentially arranged in the model space according to the coordinates, thereby obtaining the data point set of each data point, and counting the total number of data points of all data points in the data point set. The above examples are only examples, and do not mean that the data point set includes only the specific values of the above four data points.

Step S22: Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;

The preset cutting rule of the isolated forest algorithm is to perform iterative spatial cutting on all data point sets. The space cutting refers to cutting preset data points in a model space and calculating the number of data points in each cutting space. Assuming that the data points in the data point set are relatively concentrated, it is not easy to have separate data points cut into one space during the space cutting process. If some data points in the data point set are loose or scattered at the edge of the data point set, those scattered data points will be easily cut into a single space. The system cuts through iterative space to obtain all single data points that are individually cut into a single space. It can be understood that when each data point in the data point set is individually cut into a single space, a single data point is generated at this time, and the system will record the single data point. And the number of all single data points is equal to the number of all data points in the data point set.

Step S23: Obtain the number of iterations to which each single data point is generated, and obtain a target data point of a preset number of iterations among all the single data points;

Step S24: Count the number of data points of all the target data points, calculate a ratio value of the number of data points in the total number of data points, and set the ratio value as an abnormal score;

The system obtains the number of iterations when each single data point is generated. For example, a single data point A is generated during the first spatial cutting, a single data point B, C is generated during the second spatial cutting, a single data point D, E, F, G is generated during the third spatial cutting, etc. The system will count the number of iterations for each single data point. Assuming the preset number of times is 2, the system will obtain the target data points A, B, and C generated in the previous 2 spatial iterations. The total number of data points of the current system's statistical target data points is 3. Assuming that the total number of data points in the current data point set is 15, the ratio of the number of data points to the total value is 3/15 = 0.2. The system will set the percentage value to the abnormal score as a reference value for subsequent numerical comparisons.

Specifically, in this embodiment, anomaly detection calculation is performed on the power consumption values corresponding to 24 time points, for example, calculation is performed by an isolated forest algorithm to eliminate abnormal values and filter out invalid free data, so as to obtain normality. Normal data with regular distribution. By spatially cutting the sequence feature values in the feature data sequence and re-cutting the sequence feature values in each space, the sequence feature values that are individually cut in the data space are obtained. This process will be reflected in the form of binary tree layering, that is, all sequence feature values that are cut on the same side of the data space will continue to be iteratively cut, the binary tree will continue to be layered down, and left alone in the data space Because the sequence eigenvalues of the sequence will not continue cutting, it stays at the height of the layer where the current binary tree is located. The isolated forest algorithm will calculate the abnormal score of the characteristic data sequence according to the height of all discrete sequence eigenvalues.

In step S25, if the abnormal score is greater than zero, it is determined that the feature data sequence corresponding to the abnormal score is a normal data sequence.

The abnormal score reflects the overall degree of deviation of all power consumption values. When the abnormal score is greater than zero, it proves that the current distribution of all power consumption values is normal, and the sequence feature values are normal values, and all corresponding feature data. The sequence (ie power consumption value) is a normal data sequence. The use of the isolated forest algorithm can capture invalid eigenvalues and quantify the data. And the abnormal score obtained by the isolated forest algorithm is the reflection parameter of the characteristic value of the sequence. The system only needs to judge the value of the abnormal score.

Step S30: Obtain target time points corresponding to the continuous missing values in the target time series, and obtain sequence feature values at all target time points in all normal data sequences;

The consecutive missing values have their own target time points in the target time series, and the target time points correspond to the normal data series and also have corresponding sequence characteristic values. In the normal data sequence, the characteristic value of the sequence will be used as the reference value for the calculation of subsequent consecutive missing values. When the feature data sequence is determined as a normal data sequence, the system may directly call the sequence feature value corresponding to the target time point in the normal data sequence.

For example, at the target time point of 15 o'clock, 16 o'clock, 17 o'clock, and 18 o'clock on the 13th at the time point where consecutive missing values in the power consumption statistics series are located, then the system will obtain the 15 o'clock on the 13th from the normal data series of monthly power consumption , At 16 o'clock, 17 o'clock and 18 o'clock.

Step S40: Perform a mean calculation on all sequence feature values at each target time point to obtain a feature mean value at each target time point, and use the feature mean value as a filling reference value for consecutive missing values corresponding to the target time point.

The sequence feature value obtained by the system is the feature value of multiple time series samples at corresponding target time points. Because each sequence feature value can be used as a reference value in the target time series, the system will average the feature values at each target time point in all normal data sequences to obtain the average value of the target time point. The average can be used as a padding value for consecutive missing values. The calculation of the feature mean is to smooth out the fluctuation of the values of different normal data sequences at the same target time point, so that the value filled in the reference value can better reflect the distribution situation at that time point.

For example, the system obtains four power consumption values at 15:00, 16:00, 17:00, and 18:00 on the 13th of each month, and calculates the average value of the four power consumption values. Assuming the average value a of the 15th on the 13th of different months, Mean b of 16:00 on the 13th of different months, mean c of 17:00 on the 13th of different months, and mean d of 18:00 on the 13th of different months. Then, a, b, c, and d will be used as padding values for consecutive missing values in the target time series from 15:00 to 18:00.

Further, based on the first embodiment of the continuous missing value filling method of the present invention, a second embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S20, the method further includes:

Step S50: Count the sequence numbers of all current normal data sequences;

In reality, there may be many characteristic data sequences, but very few normal data sequences occur after screening. In this embodiment, if the sample of the normal data sequence is less than a certain value, the accuracy of the final filling reference value will be affected. Only when the sample size of the normal data sequence is large enough, can it guarantee that the normal data sequence can provide high reference for filling the reference value. For example, in the power consumption statistics series, the power consumption values in summer and winter may be higher than those in spring and autumn. Therefore, only by ensuring that the sample data amount of the normal data sequence is within a reasonable value, can the final filling of the reference value be guaranteed. Of precision. Therefore, the system will count the number of sequences of all current normal data sequences.

In step S60, if the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, up to the sequence of all normal data sequences. The number is not less than the first preset value.

According to actual business needs, the system can set a first preset value, and the first preset value can be dynamically adjusted according to actual business needs. For example, the system can specify that when the number of consecutive missing values is N, the number of normal data sequences must not be less than 2N, that is, the number of sequences needs to be adjusted according to the system designation. The first preset value is the minimum threshold of the number of sequences. If the number of sequences is less than the first preset value, it indicates that the current number of sequences is too small, which will affect the accuracy of the final filling reference value. The system needs to import a new time series sample from a preset sample database, and obtain the new normal data sequence by performing the steps in the first embodiment on the time series sample.

In this embodiment, in view of the strict requirements of the system for accuracy, the system will execute steps S50 and S60 in a loop, continuously obtain new normal data sequences, and count the number of sequences of all current normal data sequences. The first preset value is determined until the number of sequences is not less than the first preset value. Through the above steps, it can be ensured that the normal data sequence can provide sufficient data samples, thereby improving the data reliability of the final filling reference value.

Further, based on the second embodiment of the continuous missing value filling method of the present invention, a third embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S40, the method further includes:

Step S70: Mark the padding reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each padding reference value.

Generally, all index data has statistical significance. The filling reference value obtained in the present invention is essentially calculated from other historical data and does not represent real data. In order to prevent users from referencing data as real data, this embodiment will In the target time writing, the values filled with the padding reference values are marked, and the sequence feature values in each normal data sequence referenced by each padding reference value are mapped and marked.

Assume that there is a current electricity consumption statistical sequence, and the user wants to collect statistics on the sequence to obtain a certain trend. Since the filled reference value is not real data, the system will obtain the feature mean of the filled reference value, and Query each sequence feature value used in calculating the feature mean, and then map it to the respective feature data sequence. The target time point of the current filled reference value is used to map each sequence feature value to the corresponding time point. Finally, the sequence feature values cited at the target time points on each feature data are marked as reference values.

Therefore, the effect of this embodiment is that each consecutive missing value identifies all the sequence feature values used and the feature data sequence in which the sequence feature value is located, and the user can easily query the data source and then perform analysis and calculation.

Further, on the basis of the third embodiment of the continuous missing value filling method of the present invention, a fourth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that the method obtains all targets in all normal data sequences. The step of the sequence feature value at the time point further includes:

If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.

Although the obtained normal data sequence guarantees that the eigenvalues in the sequence are normal values, if the sequence eigenvalues at the target time point in the normal data sequence are also missing values, it means that the normal data sequence is a reference for the final filling. The calculation of the values does not have any data support, it will also increase the computational complexity, and cannot provide an effective data source for the filling of consecutive missing values. Therefore, the normal data sequence will be deleted by the system as an invalid data sequence, which can reduce the computational complexity, avoid the introduction of invalid data, and reduce the reliability of data filled with reference values.

Further, based on the fourth embodiment of the continuous missing value filling method of the present invention, a fifth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that if any normal data sequence is detected, When the sequence characteristic value at the target time point is missing, the step of deleting the normal data sequence further includes:

Step S80: if it is detected that the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value, import a new time series sample from a preset sample database;

In this embodiment, because the normal data sequence with the missing sequence feature value is deleted, the number of sequences of the current normal data sequence is reduced by one. If the number of sequences is not less than the first preset value, other normal data sequences are still available. Correspondingly, the sequence feature value at the target time point should be deleted by one normal data sequence and reduced by one. That is, the number of normal data sequences meets the standard, and the normal data sequence may be invalid data. For example, in the electricity consumption statistics sequence, the monthly electricity consumption corresponding to the A electricity consumption data sequence is normal, but all Most of the data in the electricity consumption data are electricity data for new energy consumption (such as electricity obtained from wind power generation), rather than traditional electricity consumption (such as electricity obtained from thermal power generation), although the electricity consumption has not changed. However, what the present invention is to count is the power data of the thermal power consumption, so the normal data sequence cannot be counted.

In order to ensure the data referentiality of the sequence eigenvalues, the system usually specifies that the number of sequence eigenvalues must reach a reasonable value to ensure that the sample can be covered in a wide range and the accuracy of the mean calculation is improved. Therefore, the system sets a second preset value, and the second preset value will be used as a reference threshold for the number of values. The system will count the number of sequence diagnosis values at any target time point in all normal data sequences. If the number of values is less than the second preset value, it means that the data sample size of the current sequence characteristic value does not meet the standard, and may be a reference for filling. The calculation accuracy of the value has an influence, so it is necessary to increase the sequence characteristic value of the normal data sequence. The system will import a new time series sample from the preset sample database.

Step S90: Perform the steps of obtaining a new normal data sequence according to the new time series samples, and obtain the sequence feature values at all target time points from the new normal data sequence, up to any target time point in all normal data sequences. The number of numerical values of the sequence characteristic value is not less than the second preset value.

After obtaining a new time series sample, the system will perform the steps of obtaining a normal data sequence in the first embodiment, and the sequence characteristics at the target time point corresponding to the new normal data sequence obtained from the new time series sample Step S80 and step S90, until the number of sequence feature values at any target time point in all normal data sequences is not less than the second preset value.

The following will explain by examples. There are currently 5 normal data sequences in total, and the number of corresponding sequence feature values at each target time point is also 5. Assuming that the second preset value set by the system is 6, Then the number of values is less than the second preset value. At this time, a new time series sample needs to be added, and the system imports a new time series sample from the preset sample database. According to the second preset value and the number of values, the number of samples of the new time series sample imported by the system is 1. The step of performing anomaly detection calculation on the new time series sample to obtain the sequence characteristic value after obtaining the normal data sequence, and then Re-count the number of sequence eigenvalues in all normal data sequences, and finally compare the number of values. If the number of the last numerical values is greater than or equal to the second preset value, the execution of this embodiment ends.

Further, based on the first embodiment of the continuous missing value filling method of the present invention, a sixth embodiment of the continuous missing value filling method of the present invention is proposed. The difference from the foregoing embodiment is that after step S40, the method further includes:

Step a: All normal data sequences are converted into corresponding normal sequence distribution curves, and a target time series based on the filled reference value is converted into a target sequence distribution curve;

Step b, displaying the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for analysis by a user.

In this embodiment, in order to facilitate the user to intuitively view and analyze the difference between the normal data sequence and the target time series on the feature values of the sequence, after taking the feature mean as a continuous reference value for missing values, the system will include the normal data sequence and the The target time series of the reference value is converted into a normal series distribution curve and a target series distribution curve, respectively. The user can display the normal distribution of the normal data sequence and the true distribution of the target time series in the preset coordinate system. The significance of visualizing the data as a curve is that the user can intuitively observe and analyze whether the filling reference value deviates from the normal distribution situation and reanalyze the observation results.

Referring to FIG. 3, the present invention provides a data analysis device, and the data analysis device includes:

A collection module 10 is configured to collect all sequence feature values from all time series samples according to a preset time interval when continuous missing values are detected in a target time series collected based on a preset time interval to generate each time series sample Feature data sequence; a detection module 20 for performing anomaly detection calculations on each feature data sequence to determine a normal data sequence in all feature data sequences; an acquisition module 30 for obtaining the continuous missing values at a target time series Corresponding to the target time point in time, and obtain the sequence feature values at all target time points in all normal data sequences; the filling module 40 is used to calculate the mean value of all sequence feature values to obtain the feature average value at each target time point, The feature average value is used as a filling reference value of consecutive missing values corresponding to the target time point.

Further, the detection module includes:

A determining unit, configured to determine all feature time points and corresponding sequence feature values in each of the feature data sequences; a generating unit, used to position the corresponding data points in model space according to the feature time points and the sequence feature values, To generate a data point set; a statistics unit for counting the total number of data points in the data point set; a cutting unit for performing all data points in the data point set according to a preset cutting rule of an isolated forest algorithm Iterative space cutting until all single data points that are individually cut into a single space are obtained; an obtaining unit is configured to obtain the number of iterations to which each single data point belongs and obtain the number of iterations in all the single data points The target data points in the previous preset number of times; the statistics unit is further configured to count the number of data points of all the target data points; the calculation unit is configured to calculate the number of data points in the total data points The percentage value in the number, and setting the percentage value as an abnormal score; the determining unit is further configured to: if the abnormal score is greater than zero, Given this characteristic abnormality score data sequence corresponding to the sequence of normal data.

Further, the data analysis device further includes: a statistics module for counting the number of sequences of all current normal data sequences; and a first importing module for sampling from a preset sample if the number of sequences is less than a first preset value A new time series sample is imported into the database; the acquisition module 30 is further configured to obtain a new normal data sequence according to the new time series sample until the number of sequences of all normal data sequences is not less than a first preset value.

Further, the data analysis device further includes: a marking module, configured to mark the filling reference value corresponding to each consecutive missing value, and map and mark the sequence feature value in each normal data sequence referenced by each filling reference value. .

Further, the obtaining module 30 is further configured to delete a normal data sequence if a sequence feature value at a target time point in any normal data sequence is detected as a missing value.

Further, the data analysis device further includes: a second importing module, configured to detect the number of sequence feature values at any target time point in all normal data sequences from a preset value, and It is assumed that a new time series sample is imported into the sample database; an execution module is configured to perform the steps of obtaining a new normal data sequence according to the new time series sample; the obtaining module 30 is further configured to obtain all target times from the new normal data sequence The number of sequence feature values at the point until the number of sequence feature values at any target time point in all normal data sequences is not less than the second preset value.

Further, the data analysis device further includes a conversion module for converting all normal data sequences into corresponding normal sequence distribution curves, and converting a target time series based on the filled reference value into a target sequence distribution curve; a display module, Used to display the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for user analysis.

Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a device in a hardware operating environment involved in a method according to an embodiment of the present invention.

In the embodiment of the present invention, the terminal may be a PC, or a smart phone, a tablet computer, an e-book reader, or MP3 (Moving Picture). Experts Group Audio Layer III, standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, compression standard audio layer for motion picture experts 4) Terminal equipment such as players, portable computers.

As shown in FIG. 4, the data analysis terminal may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to implement connection and communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory. memory), such as disk storage. The memory 1005 may optionally be a storage device independent of the foregoing processor 1001.

Optionally, the data analysis terminal may further include a user interface, a network interface, a camera, an RF (Radio Frequency) circuits, sensors, audio circuits, WiFi modules, etc. The user interface may include a display, an input unit such as a keyboard, and the optional user interface may also include a standard wired interface and a wireless interface. The network interface can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).

Those skilled in the art can understand that the structure of the data analysis terminal shown in FIG. 4 does not constitute a limitation on the data analysis terminal, and may include more or fewer components than shown in the figure, or combine some components or different components. Layout.

As shown in FIG. 4, the memory 1005 as a computer storage medium may include an operating system, a network communication module, and computer-readable instructions. The operating system is a program that manages and controls the hardware and software resources of the data analysis terminal, and supports the operation of computer-readable instructions and other software and / or programs. The network communication module is used to implement communication between components in the memory 1005 and to communicate with other hardware and software in the data analysis terminal.

In the data analysis terminal shown in FIG. 4, the processor 1001 is configured to execute computer-readable instructions stored in the memory 1005 to implement the steps of the continuous missing value filling method described above.

The specific implementation manner of the data analysis terminal of the present invention is basically the same as each embodiment of the continuous missing value filling method described above, and details are not described herein again.

The invention also provides a computer-readable storage medium, which may be a non-volatile readable storage medium. The computer-readable storage medium stores one or more programs, and the one or more programs can also be executed by one or more processors for implementing the steps of the continuous missing value filling method as described above.

The specific implementation manner of the computer-readable storage medium of the present invention is basically the same as each embodiment of the continuous missing value filling method described above, and details are not described herein again.

The above are only preferred embodiments of the present invention, and thus do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description and drawings of the present invention, or directly or indirectly used in other related technical fields All are included in the patent protection scope of the present invention.

Claims

A continuous missing value filling method, characterized in that the continuous missing value filling method includes:

When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;

Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;

Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;

The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.
The continuous missing value filling method according to claim 1, wherein the step of performing anomaly detection calculation based on an isolated forest algorithm on each feature data sequence to determine a normal data sequence in all feature data sequences comprises:

Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the position of the corresponding data points in the model space in the model space, and count the The total number of data points in the data point collection;

Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;

Obtaining the number of iterations to which each single data point belongs, and obtaining a target data point in a preset number of iterations among all the single data points;

Counting the number of data points of all target data points, calculating a ratio value of the number of data points in the total number of data points, and setting the ratio value as an abnormal score;

If the abnormal score is greater than zero, it is determined that the characteristic data sequence corresponding to the abnormal score is a normal data sequence.
The continuous missing value filling method according to claim 1, wherein after the step of performing an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences, further comprising:

Count the number of sequences of all current normal data sequences;

If the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, until the number of sequences of all normal data sequences is not equal. Less than the first preset value.
The continuous missing value filling method according to claim 1, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature is The step of filling the reference value with the mean value as a continuous missing value corresponding to the target time point further includes:

Mark the filled reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each filled reference value.
The continuous missing value filling method according to claim 1, wherein the step of obtaining sequence feature values at all target time points in all normal data sequences further comprises:

If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.
The continuous missing value filling method according to claim 5, characterized in that, if the sequence feature value at the target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted after the step Also includes:

If it is detected that the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value, importing a new time series sample from a preset sample database;

Perform the steps of obtaining a new normal data sequence according to the new time series samples, and obtain the sequence feature values at all target time points from the new normal data sequence, up to the sequence feature values at any target time point in all normal data sequences The number of values is not less than the second preset value.
The continuous missing value filling method according to claim 1, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature is The step of filling the reference value with the mean value as a continuous missing value corresponding to the target time point further includes:

Convert all normal data sequences into corresponding normal sequence distribution curves, and transform the target time series based on the filled reference value into the target sequence distribution curve;

The normal sequence distribution curve and the target sequence distribution curve are displayed in a preset coordinate system for user analysis.
A data analysis device, characterized in that the data analysis device includes:

A collection module is used to collect all sequence feature values from all time series samples according to the preset time interval when continuous missing values are detected in the target time series collected based on the preset time interval to generate each time series sample. Characteristic data sequence;

A detection module for performing anomaly detection calculations on each feature data sequence to determine a normal data sequence in all feature data sequences;

An obtaining module, configured to obtain target time points corresponding to the continuous missing values in the target time series, and obtain sequence feature values at all target time points in all normal data sequences;

A filling module is used to calculate the mean value of all sequence feature values to obtain the feature mean value at each target time point, and use the feature mean value as a filling reference value of consecutive missing values corresponding to the target time point.
The data analysis device according to claim 8, wherein the detection module comprises:

A determining unit, configured to determine all feature time points in the feature data sequence and corresponding sequence feature values;

A generating unit, configured to generate a set of data points according to the positions of the characteristic time points and the corresponding data points in the model space of the sequence eigenvalues;

A statistics unit, configured to count the total number of data points in the data point set;

A cutting unit, configured to perform iterative spatial cutting on all data points in the data point set according to a preset cutting rule of an isolated forest algorithm until all single data points that are individually cut into a single space are obtained;

An obtaining unit, configured to obtain the number of iterations to which each single data point belongs, and to obtain a target data point of a preset number of iterations among all the single data points;

The statistics unit is further configured to count the number of data points of all the target data points;

A calculation unit, configured to calculate a ratio of the number of data points to the total number of data points, and set the ratio to an abnormal score;

The determining unit is further configured to, if the abnormal score is greater than zero, determine that the feature data sequence corresponding to the abnormal score is a normal data sequence.
The data analysis device according to claim 8, wherein the data analysis device further comprises:

Statistics module, for counting the number of sequences of all current normal data sequences;

A first import module, configured to import a new time series sample from a preset sample database if the number of sequences is less than a first preset value;

The acquiring module is further configured to acquire a new normal data sequence according to a new time series sample until the number of sequences of all normal data sequences is not less than a first preset value.
The data analysis device according to claim 8, wherein the data analysis device further comprises:

The labeling module is configured to mark the padding reference values corresponding to each consecutive missing value, and map and mark sequence feature values in each normal data sequence corresponding to each padding reference value.
The data analysis device according to claim 8, wherein the acquisition module is further configured to delete a normal data sequence if a sequence feature value at a target time point in any normal data sequence is detected as a missing value. .
The data analysis device according to claim 12, wherein the data analysis device further comprises:

A second import module, configured to import a new time series sample from a preset sample database if the number of sequence feature values at any target time point in all normal data sequences is less than the second preset value;

An execution module, configured to perform the steps of obtaining a new normal data sequence according to the new time series sample;

The obtaining module is further configured to obtain sequence feature values at all target time points from the new normal data sequence, until the number of sequence feature values at any target time point in all normal data sequences is not less than a second preset value.
The data analysis device according to claim 8, wherein the data analysis device further comprises:

A conversion module for converting all normal data sequences into corresponding normal sequence distribution curves, and converting a target time series based on a filled reference value into a target sequence distribution curve;

A display module is configured to display the normal sequence distribution curve and the target sequence distribution curve in a preset coordinate system for user analysis.
A data analysis terminal, characterized in that the data analysis terminal includes: a memory, a processor, a communication bus, and computer-readable instructions stored on the memory, and the processor is configured to execute the computer-readable instructions, To achieve the following steps:

When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;

Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;

Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;

The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point.
The data analysis terminal according to claim 15, wherein the step of performing anomaly detection calculation based on an isolated forest algorithm for each feature data sequence to determine a normal data sequence in all feature data sequences comprises:

Determine all feature time points and corresponding sequence feature values in each feature data sequence, and generate a data point set according to the feature time points and the position of the corresponding data points in the model space in the model space, and count the The total number of data points in the data point collection;

Perform iterative space cutting on all data points in the data point set according to a preset cutting rule of the isolated forest algorithm until all single data points that are individually cut into a single space are obtained;

Obtaining the number of iterations to which each single data point belongs, and obtaining a target data point in a preset number of iterations among all the single data points;

Counting the number of data points of all target data points, calculating a ratio value of the number of data points in the total number of data points, and setting the ratio value as an abnormal score;

If the abnormal score is greater than zero, it is determined that the characteristic data sequence corresponding to the abnormal score is a normal data sequence.
The data analysis terminal according to claim 15, wherein after the step of performing an abnormality detection calculation on each feature data sequence to determine a normal data sequence in all feature data sequences, further comprising:

Count the number of sequences of all current normal data sequences;

If the number of sequences is less than the first preset value, a new time series sample is imported from the preset sample database, and a new normal data sequence is obtained according to the new time series sample, until the number of sequences of all normal data sequences is not equal. Less than the first preset value.
The data analysis terminal according to claim 15, wherein the mean value calculation is performed on all sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as After the step of filling the reference value with consecutive missing values corresponding to the target time point, the method further includes:

Mark the filled reference values corresponding to each consecutive missing value, and map and mark the sequence feature values in each normal data sequence referenced by each filled reference value.
The data analysis terminal according to claim 15, wherein the step of obtaining sequence feature values at all target time points in all normal data sequences further comprises:

If a sequence feature value at a target time point in any normal data sequence is detected as a missing value, the normal data sequence is deleted.
A computer-readable storage medium is characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following steps are implemented:

When continuous missing values are detected in the target time series collected based on the preset time interval, all sequence characteristic values are collected from all time series samples according to the preset time interval to generate a characteristic data sequence of each time series sample;

Perform anomaly detection calculations on each feature data sequence to determine normal data sequences in all feature data sequences;

Acquiring target time points corresponding to the continuous missing values in the target time series, and acquiring sequence feature values at all target time points in all normal data sequences;

The mean value calculation is performed on all the sequence feature values at each target time point to obtain the feature mean value at each target time point, and the feature mean value is used as the filling reference value of the consecutive missing values corresponding to the target time point. Ranch