WO2024029659A1

WO2024029659A1 - Electronic apparatus for performing quality verification of time series data and performing method therefor

Info

Publication number: WO2024029659A1
Application number: PCT/KR2022/013177
Authority: WO
Inventors: 문재원; 금승우; 오승택; 이지훈; 황지수
Original assignee: 한국전자기술연구원
Priority date: 2022-08-01
Filing date: 2022-09-02
Publication date: 2024-02-08
Also published as: KR20240017694A

Abstract

A method for quality verification of time series data is provided. The method comprises the steps of: refining time series data collected for certain characteristic information, on the basis of a predetermined reference period; partitioning the refined time series data according to a certain partition period; verifying the quality of data for the time series data partitioned according to the partition period; and selecting the time series data that has been completely verified and processing the data according to a data complementary condition.

Description

Electronic device and method for performing quality verification of time series data

The present invention relates to an electronic device and method for performing quality verification of time series data.

Recently, with the popularization of machine learning technology and IoT devices, attempts to collect data and extract meaningful information using sensors are continuing in various fields such as smart farms and smart factories. Since the sensor data accumulated in this way is so large, it must be processed and utilized through big data processing methods.

However, most sensor data contains large and small error data. And sensor data is dependent on network status, and the sensor itself may record error values.

If there is no countermeasure for these errors, the overall performance of subsequent analysis and learning will be adversely affected.

An embodiment of the present invention provides an electronic device and method for performing quality verification of time series data, which determines the quality of data based on the cycle of time series data and selects and utilizes data that meets the criteria.

However, the technical challenge that this embodiment aims to achieve is not limited to the technical challenges described above, and other technical challenges may exist.

As a technical means for achieving the above-described technical problem, the quality verification method of time series data according to the first aspect of the present invention includes the steps of refining time series data collected for predetermined characteristic information based on a predetermined reference period; Splitting the refined time series data according to a predetermined division cycle; Verifying the quality of data for time series data divided according to the division cycle; A step of selecting the verified time series data and processing the data according to data supplementation conditions, wherein the step of verifying the quality of the data for the time series data divided according to the division cycle includes the step of verifying the quality of the data for the divided time series data calculating at least one of the degree of continuous missing data and the degree of total missing data in the corresponding division cycle; and determining the divided time series data of the corresponding division period as defective data when each calculated degree exceeds a degree set according to a standard parameter.

In some embodiments of the present invention, the step of refining the time series data collected for the predetermined characteristic information based on a predetermined reference period is inferred through the characteristic information of the collected time series data or based on external parameters. It may include setting the reference period.

In some embodiments of the present invention, the step of dividing the purified time series data according to a predetermined division cycle is performed according to the division period calculated by applying a predetermined weight determined by reflecting the characteristic information to the reference period. The time series data can be divided.

In some embodiments of the present invention, the step of dividing the refined time series data according to a predetermined division period may include deleting time series data that does not satisfy the division period from among the time series data.

In some embodiments of the present invention, dividing the refined time series data according to a predetermined division cycle includes dividing the refined time series data based on a first division cycle; and re-dividing the divided time series data based on a second division cycle.

In some embodiments of the present invention, the step of verifying the quality of the data for the time series data divided according to the division cycle includes performing the second division cycle on the verified time series data that is determined not to be defective data. A step of recursively verifying quality based on may be further included.

In some embodiments of the present invention, the step of verifying the quality of data for time series data divided according to the division cycle includes varying the number of time series data divided according to the second division cycle to the reference parameter. A step of adjusting the set degree of consecutive missing data and the degree of overall missing data may be further included.

In some embodiments of the present invention, when the time series data is multivariate data including a plurality of characteristic information, the multivariate data is arranged into columns and rows according to time information groups according to each characteristic information and division period, The step of verifying the quality of data for time series data divided according to the division cycle includes checking whether missing data for each characteristic information exists for each time information group of the multivariate data; If missing data exists in each time information group, first counting is added, and if consecutive missing data exists in a plurality of time information groups adjacent to the time information group to which the first counting is added, the missing data is added. adding a second counting based on the number of consecutive time information groups; And it may include calculating the degree of the continuous missing data for each time information group according to the division period by adding up the first and second counting.

In some embodiments of the present invention, the step of adding the second counting may be performed when consecutive missing data exists based on characteristic information in a plurality of time information groups adjacent to the time information group to which the first counting has been added. In this case, second counting may be added based on the number of time information groups in which missing data is consecutive based on the characteristic information.

In some embodiments of the present invention, the step of adding the second counting includes consecutive time information groups with missing entire characteristic information within the time information group to which the first counting is added and a plurality of adjacent time information groups. If it exists, second counting can be added based on the number of consecutive time information groups.

In addition, the electronic device according to the second aspect of the present invention divides time series data collected for predetermined characteristic information according to a predetermined division period, and verifies the quality of the data for the time series data divided according to the division period. Afterwards, it includes a processor that selects the verified time series data and processes the data according to data supplementation conditions.

In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium recording a computer program for executing the method may be further provided.

According to an embodiment of the present invention described above, quality verification is performed based on time series data with periodic characteristics and defective data is processed, so that data with high data quality can be used for learning and analysis, resulting in overall performance results. can be improved.

In addition, since data to be supplemented is selected based on the status of missing data included in the data, more rational and high-quality data processing is possible. In addition, since high-quality data is provided based on data supplementation conditions, unreasonable deletion or interpolation work can be avoided, allowing higher quality data analysis to be performed.

In addition, more reasonable and high-quality data processing is possible by applying and supplementing optimized methods according to the status of the section containing missing data. Interpolation and replacement methods can be applied differently depending on the purpose of data use, resulting in higher quality. High data complementation can be performed. In addition, it can be applied to data that is a combination of multiple single data, so high-quality data supplementation can be performed even when combining data.

The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

1 is a block diagram showing the configuration of an electronic device according to an embodiment of the present invention.

Figure 2 is a diagram showing a flowchart of operations performed by an electronic device according to the first embodiment of the present invention.

Figures 3a and 3b are diagrams illustrating an example of refining time series data based on a standard technology cycle.

Figure 4 is a diagram showing an example of dividing time series data according to a division cycle.

Figure 5 is a diagram to explain the quality verification process of univariate time series data.

Figures 6a and 6b are diagrams for explaining the quality verification process of multivariate time series data.

Figure 7 is a diagram to explain the process of recursively verifying the quality of data.

Figure 8 is a diagram showing data including missing data.

Figure 9 is a diagram illustrating an operation flowchart of an electronic device according to a second embodiment of the present invention.

Figure 10 is a diagram illustrating setting a section of first data according to a method according to an embodiment of the present invention.

FIG. 11 is a diagram illustrating generating second data according to a method according to an embodiment of the present invention.

Figure 12 is a diagram illustrating processing of second data based on data supplementation conditions according to a method according to an embodiment of the present invention.

Figure 13 is a diagram illustrating processing of second data according to a method according to an embodiment of the present invention.

Figure 14 is a diagram illustrating an operation flowchart of an electronic device according to another embodiment of the present invention.

Figure 15 is a diagram showing the operation of an electronic device according to an embodiment of the present invention.

Figure 16 is a diagram showing the operation of an electronic device according to another embodiment of the present invention.

The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide a general understanding of the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

Below, to aid the understanding of those skilled in the art, the background on which the present invention was proposed will first be described, and then the embodiments of the present invention will be described.

As large amounts of time series data are produced and distributed due to the widespread use of IoT devices, attempts to gain insight by applying analysis, prediction, and classification techniques to time series data are continuing in various industries.

In addition, efforts are continuing to open important data and allow many users to utilize it for various purposes, such as the public data portal, Seoul Open Data Plaza, and card big data platform, centered on the domestic government and public institutions.

In addition, we are trying to increase productivity by collecting time series data and applying machine learning using various sensors in various domains such as smart farms, smart factories, and smart cities.

As such, time series data is being actively researched in various fields, and outlier detection technology using time series data detects error data generated by sensor and network abnormalities in time series data or abnormal data sections that occur due to abnormal situations.

Additionally, as an example of a technology that uses time series data, there is also a data classification and clustering technique to find similarities and patterns between time series data. This reduces the processing cost by reducing the dimension from high dimension to low dimension, effectively extracts similar features, and visually displays them to secure recognition and insight into the data and is used to easily identify similar patterns.

However, there are problems with using most of the existing time series data as is. In other words, technologies that use time series data proceed under the assumption that the time series data is flawless. However, in most real-world data collected, the time series is often asynchronous or irregularly sampled, and time points are missing or incomplete beyond the sensor collection range. Cases in the form of data frequently occur.

These data can be utilized to the extent of monitoring by identifying their approximate form, but because their contents are incomplete, they are not appropriate for use as detailed analysis and learning data.

A way to solve this problem is to restore and utilize partially lost data as if it were normal data. However, if the amount of lost data is large, restoring and using the data forcibly may lead to incorrect results.

The fundamental solution to this problem is to completely delete data including error data and use only appropriate data sections. However, when deleted, a lot of data becomes unusable and is discarded, and if the original data is insufficient, analysis and utilization may not be possible, so it is necessary to provide standards for deletion.

Therefore, it is necessary to construct a usable dataset through appropriate preprocessing of the data.

In order to solve this problem, an electronic device and method for performing quality verification of time series data according to an embodiment of the present invention enable selection of time series data with usable quality based on the periodic characteristics of the time series data. The purpose is to

Through this, an embodiment of the present invention can expect high analysis and learning performance due to high quality when using learning analysis based on time series data for which quality verification has been completed.

Hereinafter, an electronic device and method for performing quality verification of time series data according to an embodiment of the present invention (hereinafter referred to as the first embodiment) will be described with reference to FIGS. 1 to 7. In addition, in FIGS. 8 to 16, an electronic device that performs data selection based on data supplementation conditions and a method for performing the same (hereinafter referred to as the second embodiment) will be described. Meanwhile, it goes without saying that the first and second embodiments of the present invention may mutually share or partially apply technical features depending on the embodiment at each stage.

The electronic device 100 according to an embodiment of the present invention includes an input unit 110, a communication unit 120, a display unit 130, a memory 140, and a processor 150.

The input unit 110 generates input data in response to user input of the electronic device 100. The user input is a user input regarding data that the electronic device 100 wants to process, a user input regarding a division cycle, a user input regarding quality verification conditions, a user input regarding data supplementation conditions, and at least one to process missing data. May include user input regarding how to handle missing data.

The input unit 110 includes at least one input means. The input unit 110 includes a keyboard, key pad, dome switch, touch panel, touch key, mouse, menu button, etc. may include.

The communication unit 120 performs communication with an external device such as a server or a data collection device to receive data. This communication unit 120 may include both a wired communication module and a wireless communication module. The wired communication module can be implemented as a power line communication device, telephone line communication device, home cable (MoCA), Ethernet, IEEE1294, integrated wired home network, and RS-485 control device. In addition, wireless communication modules include WLAN (wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, wireless USB technology and wireless HDMI technology, as well as 5G (5th generation communication) and LTE-A. It may be composed of modules to implement functions such as (long term evolution-advanced), LTE (long term evolution), and Wi-Fi (wireless fidelity).

The display unit 130 displays display data according to the operation of the electronic device 100. The display unit 130 displays display data needed to verify data based on data quality verification conditions (e.g., a screen for setting quality verification conditions), and display data needed to select data based on data supplement conditions (e.g. For example, a screen that sets data supplementation conditions) and a screen that displays data processing results, etc. can be displayed. Alternatively, the display unit 130 may display data required to process missing data, for example, a screen for processing abnormal data among collected data, a screen for identifying information about missing data, a screen for receiving user input, A screen displaying data processing results, etc. can be displayed. The display unit 130 includes a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, and a micro electro mechanical systems (MEMS) display. and electronic paper displays. The display unit 130 may be combined with the input unit 110 and implemented as a touch screen.

The memory 140 stores operation programs of the electronic device 100. Here, the memory 140 is a general term for non-volatile storage devices and volatile storage devices that continue to retain stored information even when power is not supplied. For example, memory 120 may include compact flash (CF) cards, secure digital (SD) cards, memory sticks, solid-state drives (SSD), and micro SD. This includes NAND flash memory such as cards, magnetic computer storage devices such as hard disk drives (HDD), and optical disc drives such as CD-ROM, DVD-ROM, etc. You can.

The memory 140 may store data collected from an external device, data on data quality verification conditions, data on data supplementation conditions, information on abnormal data, information on methods for processing missing data, etc. In addition, the memory 140 is a model learned to identify at least one method of processing missing data based on information about a model learned to set a section of data to be processed according to the quality of the data or information about missing data. Information about can be stored.

The processor 150 may control at least one other component (eg, hardware or software component) of the electronic device 100 by executing software such as a program, and may perform various data processing or calculations.

The processor 150 divides the time series data collected for predetermined characteristic information according to a predetermined division cycle, verifies the quality of the data for the time series data divided according to the division cycle, and then selects the verified time series data. Thus, the data can be processed according to the data supplement conditions.

Meanwhile, in one embodiment of the present invention, the processor 150 uses machine learning and neural network as an artificial intelligence algorithm to process data according to data purification, division, quality verification, and data supplementation conditions. network), or at least one of deep learning algorithms can be used. For example, as an artificial intelligence algorithm, at least one of machine learning, neural network, or deep learning algorithm may be used. Examples of neural network networks include Convolutional Neural Network (CNN) and Deep Neural Network (DNN). Network) and RNN (Recurrent Neural Network).

First, the processor 150 refines the time series data collected for certain characteristic information based on a predetermined reference period (S210).

In one embodiment, time series data shows continuous characteristics, and continuous time series data may repeat over time or show common patterns. Additionally, time series data may have periodicity, and the periods may show common and repeating patterns based on units such as 'hour, day, week, month, year'.

For example, outdoor temperature has both daily and yearly periodicity because it is affected by revolution and rotation. In addition, changes in carbon dioxide inside schools are likely to have daily and weekly patterns due to daily routine, and may also have yearly periodicity because indoor window opening patterns vary depending on the external temperature. These patterns play an important role in the analysis and purification of data and must be considered when utilizing the data.

Meanwhile, in one embodiment of the present invention, predetermined characteristic information refers to characteristic information based on a sensor of time series data. For example, when analyzing climate change in a specific city, the time series data sensed by each sensor is

Temperature, humidity, precipitation, traffic volume, population density, etc. over time in a specific city can be characteristic information. Or, when comparing the amount of fine dust in each city, Seoul, Busan, Cheongju, etc. may correspond to the characteristic information.

In one embodiment, the processor 150 may basically generate a reference period so that the time stamp of time series data used as input is uniform. However, an embodiment of the present invention is not necessarily limited to this, and the reference period can be set by various methods. As an example, the processor 150 may set the reference period by inferring through characteristic information of the collected time series data, which is the original data, or may set the reference period based on the user's judgment or external parameters.

After the reference period is set, the processor 150 sets a new time stamp according to the reference period and changes the time series data to be described uniformly according to the time stamp. At this time, if some data is missing from the time series data, the processor 150 may mark the missing data to be differentiated (for example, NAN).

Figure 3a shows time series data before refinement according to a standard cycle (310), and Figure 3b shows time series data refined according to a standard cycle (320). At this time, the standard cycle was set in 1-minute increments, and the time stamp in 1-minute increments was set according to the technology cycle. At this time, missing data in Figure 3a is indicated as NAN.

Referring again to FIG. 2, the processor 150 divides the refined time series data according to a predetermined division cycle (S220).

In one embodiment, the processor 150 may divide time series data according to a basic division cycle. At this time, the basic division cycle can be set to 'seconds, minutes, hours, days, weeks, months, years'.

In another embodiment, most time series data has periodic characteristics. In order to analyze and utilize it as learning data according to a specific pattern, the processor 150 sets a predetermined weight ( Time series data can be divided according to the division cycle calculated by applying N). That is, the division cycle can be set to 'basic division cycle * N'. For example, the division cycle can be set to 'default division cycle * N', such as 3 hours, 3 days, 1 year, 2 months, etc. As another example, when analyzing indoor air quality in a school, there is a high probability that a pattern on a weekly basis or a certain pattern on a daily basis will appear, so the weight corresponding to the pattern is reflected, and when analyzing subway usage, it corresponds to a daily pattern. The weight can be reflected in the basic division cycle.

Meanwhile, if there is data that does not satisfy the division cycle among the time series data, the processor 150 may delete the corresponding data.

In one embodiment, the processor 150 may delete data that does not completely satisfy the division cycle among the detailed data constituting the time series data without selecting it. However, it is not necessarily limited to the corresponding embodiment, and of course, data can be selected and utilized as needed even if the division cycle is not completely satisfied.

For example, if the first and last values of time series data do not satisfy the division cycle as shown in Table 1, the corresponding data can be excluded.

분할 주기split cycle	기간(yyyy-mm-dd HH:MM:SS)period(yyyy-mm-dd HH:MM:SS)
시간hour	00:00~59:5900:00~59:59
일Day	00:00:00~23:59:5900:00:00~23:59:59
주main	MON 00:00:00~SUN23:59:59MON 00:00:00~SUN23:59:59
월month	01 00:00:00~31 23:59:5901 00:00:00~31 23:59:59
년year	01-01 00:00:00~12-31 23:59:5901-01 00:00:00~12-31 23:59:59

The example in Figure 4 is the result of dividing the time series data (400) of '2020-05-29 23:59:00 ~ 2020-06-20 01:00:00' according to the division cycle of 1 day, and the time series The data is divided into 21 parts (400-1 to 400-N) according to the division cycle. At this time, the division cycle of 1 day is set to '00:00:00~23:59:59', and each data divided by cycle must be complete and have the same level of detailed data. Accordingly, the processor 150 may delete data of dates '05-29' and '06-20' that do not satisfy the complete division cycle from the entire time series data. In another embodiment, the processor 150 may perform a double division. Time series data can be divided by applying a period. That is, the processor 150 may divide the refined time series data based on the first division cycle and re-divide the time series data divided according to the first division cycle based on the second division cycle (or third division cycle, etc.). there is. At this time, of course, the first division period may be set to be smaller than the second division period.

For example, the processor 150 may divide the time series data divided according to the first division cycle of daily units again according to the second division cycle of weekdays and weekends and select only necessary data.

For example, the processor 150 divides the time series data divided according to the first division cycle in daily units again according to the second division cycle of 'Month' and 'Tuesday-Sunday' and the latter 'Tuesday-Sunday' data. You can also select and use only.

For example, when investigating weekly characteristics in a museum, there is a high probability of showing similar patterns for each specific day group, so the processor 150 divides according to the second division cycle of 'Monday' and 'Tuesday-Sunday'. Time series data can be divided again according to the third division cycle of ‘Monday’, ‘Tuesday-Friday’, and ‘Saturday-Sunday’.

Additionally, in one embodiment of the present invention, the second division cycle may be a sub-division cycle of the first division cycle in Table 1. When time series data is divided based on this division cycle, the data can be selected and used as analysis and learning data as shown in the following example. The reason for applying this double division cycle is that, for example, when there is a need to extract only data from 9 to 10 o'clock in the summer of July and August every year, a parameter description method is needed.

- For example, if time series data is divided into year division cycle and month division cycle, the data selection condition is (year, sub={0:[0,0,0,0,0,0,0,0,0 ,1,0,0]}), only the month of October can be selected for the month that is the next sub-basic division cycle for one year.

- For example, if time series data is divided into year division cycle, month division cycle, and week division cycle, the data selection condition is (year, sub={0:[0,0,0,0,0,0,0 ,0,0,1,0,0], 1=[1, 1, 0, 0, 0, 0, 0]}), only the Monday/Tuesday data of the week, which is the lower standard unit, for the month of October You can choose.

- For example, if time series data is divided into year division cycle, month division cycle, week division cycle, and daily division cycle, the data selection condition is (year, sub={0:[0,0,0,0,0 ,0,0,0,0,1,0,0], 1:[1, 1, 1, 1, 1, 1, 1], 2:[1, 1, 1, 0, ----- ]}), you can select all days of the week for the month of October and only select data for

days

1, 2, and 3 of them.

Referring again to FIG. 2, the processor 150 verifies the quality of the data for time series data divided according to the division cycle (S230).

In one embodiment, the processor 150 checks the status of missing values in the corresponding division cycle for time series data divided according to the division cycle and calculates at least one of the degree of continuous missing data and the degree of total missing data. . In addition, if the calculated degree of continuous missing data and the degree of total missing data exceeds the degree set according to the standard parameter, the processor 150 determines the divided time series data of the corresponding division cycle as defective data.

Here, the degree of consecutive missing data and the degree of total missing data refer to concepts such as number, ratio, probability, etc. of data.

At this time, an embodiment of the present invention can perform quality verification by distinguishing between cases where time series data is univariate data and cases where it is multivariate data.

Univariate data is data that includes only one characteristic information. In this case, the processor 150 performs verification on only one characteristic information. In addition, the processor 150 may determine the data to be defective if the degree of consecutive missing data or the degree of total missing data is greater than the set reference parameter, or if all of these are satisfied.

Referring to FIG. 5, when verifying the univariate data 510 for characteristic information F1 among the entire time series data 500, the number of consecutive missing data in the F1 data 510 is 2 and 1, and the total missing data is The number is calculated as 3. At this time, when the standard parameter is set to {Number of consecutive missing data: 2, Number of total missing data: 3}, the number of consecutive missing data in the F1 data 510 may be 2, so the total number of missing data and Regardless, F1 data is determined to be bad data. In contrast, if the reference parameter is set to {number of consecutive missing data: 5, total number of missing data: 10}, the F1 data 510 is determined to be normal data rather than defective.

FIGS. 6A and 6B are diagrams for explaining the quality verification process of multivariate

time series data

610 and 620.

Unlike univariate data, multivariate data is time series data that includes a plurality of characteristic information, and quality verification of multivariate time series data means verifying the quality of a plurality of characteristic information.

If the time series data is multivariate data, the processor 150 may organize the time series data by sorting it into columns and rows according to time information groups according to each characteristic information and division cycle. At this time, if there is N characteristic information included in the multivariate time series data, the quality can be verified by selecting data corresponding to 2 to N pieces of characteristic information, that is, a plurality of characteristic information.

In one embodiment, the processor 150 may determine that a row is missing if there is at least one missing data based on the time information group (row). According to this, in the case of Figure 6a, 6 out of 7 rows are determined to be missing rows.

In another embodiment, the processor 150 may determine that a row is missing when all of the data constituting a time information group (row) is missing. According to this, in the case of FIG. 6A, one row (P1) is determined to be a missing row, and in FIG. 6B, two rows (P2) are determined to be missing rows.

In another embodiment, the processor 150 checks whether missing data exists for each characteristic information (column) for each time information group (row) of multivariate data. And, if missing data exists in each time information group, first counting is added, and if consecutive missing data exists in a plurality of time information groups adjacent to the time information group to which the first counting is added, the missing data is consecutive. Add a second counting based on the number of time information groups (rows). Thereafter, the processor 150 may calculate the degree of consecutive missing data for each time information group according to the division cycle by adding the first and second counting.

According to this, in the case of Figure 6a, the number of consecutive missing data is determined as (1, 0, 5, 5, 5, 5, 5). That is, in the case of the first row, since there is missing data in its own row, first counting = 1 is added, and since there is no missing data in the second row, which is an adjacent row, second counting = 0. In addition, in the case of the 3rd row, since there is missing data in its own row, the first counting = 1 is added, and since there is consecutive missing data in all of the 4th to 7th rows, which are multiple adjacent rows, the number is calculated based on the number. The second counting = 4 is added, and 5, the sum of the first and second counting, is calculated as the number of consecutive missing data.

In another embodiment, the processor 150 checks whether missing data exists for each characteristic information (column) for each time information group (row) of multivariate data. And, if missing data exists in each time information group, first counting is added, and if continuous missing data exists based on characteristic information in a plurality of time information groups adjacent to the time information group to which the first counting is added. , based on the characteristic information, a second counting is added based on the number of time information groups in which missing data is consecutive. Thereafter, the processor 150 may calculate the degree of consecutive missing data for each time information group according to the division cycle by adding the first and second counting.

According to this, in the case of Figure 6a, the number of consecutive missing data is determined as (1, 0, 2, 2, 1, 1, 1), and in the case of Figure 6b, the number of consecutive missing data is (4, 4, 4, 4) , 1, 1, 1). As an example, looking at the third row of FIG. 6A, since there is missing data in its own row, the first counting = 1 is added, and among a plurality of adjacent rows, the fourth row has consecutive missing data based on the column. Since there is 1, add the second counting = 1. 2, which is the sum of the first and second counting, is calculated as the number of consecutive missing data. As another example, in the case of the first row of FIG. 6B, the first counting = 1 and missing data consecutive to the first row exists in the adjacent second to fourth rows, respectively (F3 in the second row, F3 in the 3rd row, F2 in the 4th row) The second counting = 3 is added. And 4, which is the sum of the first and second counting, is calculated as the number of consecutive missing data.

In another embodiment, the processor 150 checks whether missing data exists for each characteristic information (column) for each time information group (row) of multivariate data. And, if missing data exists in each time information group, first counting is added, and time information groups with all characteristic information missing are consecutively formed within the time information group to which the first counting is added and a plurality of adjacent time information groups. If present, a second counting is added based on the number of consecutive time information groups. Thereafter, the processor 150 may calculate the degree of consecutive missing data for each time information group according to the division cycle by adding the first and second counting.

According to this, in the case of Figure 6b, the number of consecutive missing data is determined as (1, 2, 2, 1, 1, 1, 1). For example, in the case of the second row, first counting = 1 is added, and since there is a third row missing all characteristic information among adjacent rows, second counting = 1 is added. And 2, which is the sum of the first and second counting, is calculated as the number of consecutive missing data.

As such, an embodiment of the present invention can verify whether bad data is bad by calculating the degree of consecutive missing data according to the above-described method, or can verify whether bad data is bad based on the degree of total missing data.

In one embodiment, the processor 150 may recursively and repeatedly perform quality verification of time series data (S235), and may verify the quality in the order of long-term to short-term cycles.

In other words, when the quality of time series data divided by the split cycle is verified, the average statistical quality of the overall missing data may be at a satisfactory level, but if the missing data is concentrated in a specific part or the nature of the distribution causes the problem situation to be concentrated. In some cases, recovery of missing data may be difficult.

For example, this may be the case where the quality of the data is verified at one-week intervals and determined to be normal data, but most of the defective data exists only on Thursday.

Accordingly, in one embodiment of the present invention, even if a data set divided by a basic division cycle is used, re-verification can be performed by recursively re-dividing the data into sub-cycle data during quality verification.

To this end, the processor 150 may recursively verify the quality of verified time series data that is divided according to the first division cycle and then determined to be not defective data based on the second division cycle.

For example, if the first division cycle is set to '1 week' and the standard parameter is set to 'number of consecutive missing data = 3, total number of missing data = 30' when verifying the quality of the divided data, the continuous If there are 14 errors in a pattern where the number of missing data is 2 (total number of missing data = 28), the quality verification is passed, but Thursday data can be considered difficult to utilize.

Therefore, in one embodiment of the present invention, even if time series data is divided and used according to the basic division cycle (first division cycle), when verifying quality, the second division cycle (or the third division cycle, which is a lower unit) is used when verifying quality. etc.), quality verification can be performed recursively.

Meanwhile, depending on the embodiment, the processor 150 may adjust at least one of the degree of continuous missing data and the degree of total missing data set in the reference parameter by varying the number of time series data divided according to the second division cycle.

Referring again to FIG. 2, the processor 150 selects verified time series data (S240) and processes the data according to data supplementation conditions (S250).

In one embodiment, the processor 150 selects data that has completed and passed quality verification and then performs supplementary processing on missing data. At this time, missing data processing methods are, for example, “mean”, “median”, “frequent”, “ffill”, “bfill”, “linear_interpolation”, “spline_interpolation”, “stineman_interpolation”, “KNN”, “ARIMA”. , “Randomforest”, “NAOMI”, “BRITS”, etc., but is not limited thereto.

Meanwhile, in the above description, steps S210 to S250 may be further divided into additional steps or combined into fewer steps, depending on the implementation of the present invention. Additionally, some steps may be omitted or the order between steps may be changed as needed.

Hereinafter, with reference to FIGS. 8 to 16, an electronic device and method for performing data selection based on data supplementation conditions according to a second embodiment of the present invention will be described. Meanwhile, it goes without saying that the content described in FIG. 8 and below can be mutually applied to the electronic device and method for performing quality verification of time series data according to the first embodiment described in FIG. 1 and below.

First, to aid the understanding of those skilled in the art, the background on which the present invention was proposed will first be described, and then the embodiments of the present invention will be described.

With the development of industrial technology and information and communication technology, the amount of data is increasing explosively, and the performance of data utilization technologies such as data mining and machine learning is gradually improving. At this time, in order to obtain good results using data utilization technology, the prerequisite that the data is flawless must be satisfied. However, in real environments, missing or abnormal data frequently occurs for various reasons.

When processing data that contains missing or anomalous data, this can have a significant impact on the conclusions that can be drawn from the data.

As a way to handle missing data, for example, when each row is independent in table format data, the method of batch deleting rows containing missing data is the most widely used and simple to process. However, in the case of time series data that depends on the passage of time, this method is difficult to guarantee data continuity if a specific row is arbitrarily deleted because the time at which the data was acquired is important. Therefore, in the case of time series data, it is preferable to delete all data before and after the time the missing data occurs rather than partially deleting the missing data.

When using this method of collectively deleting missing data, the amount of data deleted varies depending on the location of the missing data, and in some cases, a lot of data may be deleted.

Therefore, in general, a method for eliminating missing values is applied to time series data by interpolating missing data as much as possible. However, this method may also produce low-quality data due to unreasonable interpolation work if the time series data includes an amount of missing data that exceeds a certain threshold, thereby reducing the meaning of recovery.

In addition, there is no consideration for missing data that inevitably appears when combining multiple pieces of data due to deletion and interpolation of missing data in batches, so a flexible processing method for missing data that appears due to combining data is needed.

To this end, an electronic device and method for performing data selection based on data supplementation conditions according to an embodiment of the present invention can variably determine the extent to which missing data can be utilized, based on the quality desired by the user. Therefore, even if the time series data includes missing data, the data can be recovered and utilized more efficiently by selecting the selected data. In addition, an embodiment of the present invention can handle missing data considering the purpose of utilizing the data or the quantity and quality of the data, and can be applied not only to single data but also to data that is a combination of multiple single data. .

Hereinafter, a detailed description will be given focusing on the attached drawings.

FIG. 8 is a diagram illustrating data 800 including missing data.

Data 800 in FIG. 8 is a table of data collected according to time (Time, T) for each feature information (Feature, N), and consists of 10 different feature information and 10 times.

Although integrity is assumed when analyzing data, in the process of collecting actual data, missing or abnormal data is frequently generated for various reasons.

In one embodiment of the present invention, missing data is comprehensively defined as data that cannot be converted and expressed in any way, such as numbers or letters, and is data that cannot be defined or does not exist. This means that no data was collected at that time, or that data was collected but missing during the process of transmitting to a device such as a server. Missing data values can be expressed in various ways, such as extreme values such as “-999” or fixed characters such as “NaN” or “NA”. However, there are cases where non-standardized notation of missing data makes it difficult to clearly determine normal and abnormal data after the data is recorded. Therefore, representative libraries that process data mark missing data as “NaN” or “NA” for simplicity and functionality.

Abnormal data refers to data that has a negative impact on the results when analyzing collected data. For example, it refers to erroneous data such as the collected data has abnormal values or is outside the allowable measurement range of the sensor that collects the data. do. In the present invention, abnormal data among the collected data can be processed by replacing it with missing data, or can be interpolated into appropriate data using data collected before and after the abnormal data. In the present invention, abnormal data is expressed as “NaN” or “NA” and replaced with missing data.

If you use a method of batch deleting data to handle missing data 810, you can obtain a perfect data set that prevents contamination from missing data, but the degree of deletion is large depending on the location of the missing data, so it can be used as data. It may not be enough to do this. For example, if rows containing missing data 810 are collectively deleted from data 800, rows T1 and T10 remain, which may be insufficient to obtain useful information using data 800. .

Alternatively, when a method of batch interpolation of data is used to process the missing data 810, the data can be preserved as much as possible by arbitrarily recovering the missing data based on adjacent data or past data. However, since the recovered data is not accurate data, excessive interpolation may result in poor data quality, contaminating the results of analysis and learning.

For example, when rows including missing data 810 in data 800 are batch interpolated, the data in column N3 is interpolated using only the data obtained in rows T1 and T10, so the data generated by interpolation Data quality may decrease. In addition, the accuracy of interpolation cannot be guaranteed because missing data occurred irregularly in the data in columns N7, N8, and N10.

Therefore, for each data in column N3, column N7, column N8, and column N10, a method of determining whether the data can be recovered and whether recovering the data will improve the quality of the data is needed.

The processor 150 according to an embodiment of the present invention sets a section of first data to be processed among data collected for at least one characteristic information (S910).

Meanwhile, as previously explained in relation to FIG. 8, characteristic information refers to the content of collected data, and the collected data is collected in time series for at least one piece of characteristic information.

The processor 150 may receive data collected from an external device such as a server, but the data may be collected by the electronic device 100 and is not limited to any one.

The processor 150 may set the section of the first data based on the required time section. At this time, the first data becomes the object to be processed among the collected data.

When analyzing collected data, for example, when applying data pattern classification using clustering, performance can be improved by excluding data with a lot of missing data from the analysis. However, in the case of data containing a certain degree of missing data, performance can be improved by recovering the data using interpolation, etc. and then utilizing it as much as possible. In other words, standards are needed for how much data containing missing data will be tolerated and selected. Therefore, appropriately setting the first data can contribute to improving the processing quality of the collected data and produce correct results.

In one embodiment, the processor 150 may set the first section of the first data based on the degree of missing data included in each section among the plurality of sections of the first data. For example, when setting a time section using collected data, there may be a plurality of sections that can be set as first data. If the degree of missing data included in a specific section among a plurality of sections is small, the quality of the data can be evaluated to be better than that of other sections. Accordingly, the processor 150 may set the section containing the least amount of missing data among the plurality of sections of the first data as the first section of the first data.

Additionally, the processor 150 may set the first section of the first data based on the degree of continuity of the missing data included in the first section or the degree of summation of the missing data included in the first section. For example, in the case of a section containing three consecutive pieces of missing data and a section containing three pieces of missing data but data that are scattered and can be supplemented by interpolation, the latter section is the more valid data and is the first section. It is highly likely that it will be set as a section.

In another embodiment, the processor 150 identifies the overall degree of missing data in the collected data, and selects a section in which the degree of missing data included in the corresponding section is less compared to the overall degree of missing data as the first section of the first data. It can be set to .

The processor 150 according to an embodiment of the present invention generates second data by resetting the missing data included in the section of the first data (S920).

The section of the first data may include not only missing data but also uncollected data. Uncollected data refers to cases where, excluding data missing during data collection, when different data are listed in time series, no data is collected because the data collection start time or end time is different.

According to an embodiment of the present invention, resetting missing data means setting uncollected data included in the section of the first data as missing data. This is to unify the data so that it receives the same processing by changing the format of existing missing data and uncollected data to be the same.

The processor 150 according to an embodiment of the present invention processes the second data based on data supplementation conditions provided to select data that needs supplementation (S930).

According to an embodiment of the present invention, the processor 150 may set data supplementation conditions based on at least one of the ratio, period, and degree of missing data included in the second data. At this time, the data supplementation condition may be applied to one data set among the data collected according to at least one characteristic. For example, in the case of data collected for multiple characteristics, it can be applied to the data set collected corresponding to each characteristic. Alternatively, in the case of data collected under two or more different conditions for one characteristic, it can be applied to the data set collected corresponding to each condition.

At this time, the processor 150 may receive user input for data supplementation conditions through the input unit 110 and set them, or may receive data on data supplementation conditions from an external device through the communication unit 120. In addition, the processor 150 performs at least part of the data analysis, processing, and generation of result information to set optimized data supplementation conditions for processing the collected data or secondary data using rule-based or artificial intelligence algorithms such as machine learning and neural networks. It can be performed using at least one of a network or deep learning algorithm.

At this time, processing the second data means performing various data processing such as selecting third data that satisfies data supplement conditions from the second data, deleting the second data or selected third data, or interpolating. Includes.

More specifically, looking at data supplementation conditions, the processor 150 may process the second data when the ratio of missing data included in the second data is higher than a predefined value.

The processor 150 may process the second data when the period of missing data included in the second data is higher than a predefined value. At this time, the period of missing data may refer to a period of consecutive missing data or a period of the sum of periods corresponding to missing data distributed in the second data.

The processor 150 may process the second data when the degree of missing data included in the second data is higher than a predefined value.

According to an embodiment of the present invention, the work is performed by selecting data to be supplemented based on the situation of missing data included in the data rather than deleting or interpolating data in batches, thereby performing more rational and high-quality data processing. is possible.

According to an embodiment of the present invention, only high-quality data can be used by efficiently selecting time series data even if it includes missing data based on the quality desired by the user.

Figures 10 to 13 sequentially show an embodiment of processing data collected according to the operation flow previously described in Figure 9. In this embodiment, data D1 to D7 collected for one characteristic are processed. However, the present invention is not limited to this embodiment and can process data collected for a plurality of characteristics, in which case the data shown in FIGS. 10 to 13 exist for each characteristic, or D1 to D7 are each different from each other. It could be a different characteristic.

Figure 10 is a diagram illustrating setting a section of first data according to a method according to an embodiment of the present invention. FIG. 10 describes step S910 of FIG. 9.

10 shows data 1000 including missing data 1010 and uncollected data 1020. The processor 150 may set a section 1030 of the first data to be processed in the collected data 1000. According to an embodiment of the present invention, the processor 150 sets the first section 1030 of the first data among the plurality of sections of the first data in consideration of the total missing data 1010 and the uncollected data 1020. You can.

For example, in the case of the currently set section 1030, the number of missing data and uncollected data is 7, whereas if the section is set one space ahead, the number of missing data and uncollected data is 9. In addition, it can be seen that the number of consecutive missing data increases to three, as shown in row D3, and the quality of the data further deteriorates.

According to an embodiment of the present invention, by setting the section of the first data among the collected data, it can contribute to further improving data quality as part of the preprocessing process of selecting data that satisfies data supplementation conditions.

FIG. 11 is a diagram illustrating generating second data according to a method according to an embodiment of the present invention. FIG. 11 describes step S920 of FIG. 9.

FIG. 11 shows second data 1100 generated by processing the first data previously set in FIG. 10. According to one embodiment of the present invention, the processor 150 generates the second data 1100 by resetting the missing data 1010 included in the section 1030 of the first data.

At this time, resetting the missing data means setting the uncollected data 1020 included in the first data section 1030 as missing data 1010. This is to unify the format of existing missing data (1010) and uncollected data (1020) so that they receive the same processing when processing data.

Figure 12 is a diagram illustrating processing of second data based on data supplementation conditions according to a method according to an embodiment of the present invention. Figure 13 is a diagram illustrating processing of second data according to a method according to an embodiment of the present invention. Figures 12 and 13 are described in relation to step S930 of Figure 9.

According to an embodiment of the present invention, the processor 150 may set data supplementation conditions based on at least one of the ratio, period, and degree of missing data 1010 included in the second data 1100.

More specifically, looking at the data supplementation conditions, the processor 150 can process the second data 1100 when the ratio of missing data 1010 included in the second data 1100 is higher than a predefined value. there is.

The processor 150 may process the second data 1100 when the period of missing data 1010 included in the second data 1100 is higher than a predefined value. At this time, the period of the missing data 1010 may refer to a period of consecutive missing data 1010 or a period of the sum of the periods corresponding to the missing data 1010 distributed in the second data 1100.

The processor 150 may process the second data 1100 when the degree of missing data 1010 included in the second data 1100 is higher than a predefined value.

At this time, the processor 150 processing the second data 1100 includes selecting third data 1110 that satisfies the data supplementation conditions from the second data 1100.

For example, the data supplementation condition set for the second data 1100 shown in FIG. 12 is that the number of missing data 1010 is two or more, and the processor 150 supplements the data that satisfies the data supplementation condition. It can be selected as necessary third data (1110).

At this time, the data supplementation condition may be applied to one data set among the data collected according to at least one characteristic. For example, it is assumed that the second data 1100 is data measuring the amount of fine dust in each city, and rows D1 to D7 are data on the amount of fine dust collected in different cities. The data supplementation condition for identifying cities in which the number of missing data 1010 is two or more is applied to each of rows D1 to D7, so that the processor 150 determines that the data in rows D3 and D5 of the second data 1100 are data. It can be selected as third data 1110 that needs supplementation.

The processor 150 according to an embodiment of the present invention may delete or interpolate the selected third data 1110. In this embodiment, the selected third data 1110 was deleted.

The processor 150 identifies missing data among the data remaining after the third data selection and processing as data 1310 requiring interpolation. The processor may perform interpolation on data 1310 that requires interpolation and perform analysis using the recovered data 1300.

According to an embodiment of the present invention, data requiring supplementation is selected based on data supplementation conditions, so high-quality data can be provided. In addition, since the analysis is based on the processed data of the selected data, unreasonable deletion or interpolation work can be avoided, allowing higher quality data analysis to be performed.

The processor 150 according to an embodiment of the present invention processes abnormal data among the collected data (S1410). The operation of the processor 150 in step S1410 may be an operation of processing abnormal data among the first data in relation to step S920 of FIG. 9.

The collected data is collected in time series for at least one characteristic information. For example, it may be temperature data collected from a temperature sensor. The processor 150 may receive data collected from an external device such as a server, but the data may be collected by the electronic device 100 and is not limited to any one.

According to an embodiment of the present invention, the processor 150 may process abnormal data among the collected data by replacing it with missing data, or may interpolate it into appropriate data using data collected before and after the abnormal data.

In one embodiment, the processor 150 identifies information about missing data including processed abnormal data among the collected data (S1420). In step S1420, the operation of the processor 150 is to identify information about missing data including processed abnormal data among the first data in relation to step S930 of FIG. 9, and at least This may be an operation of processing missing data included in the second data using one missing data processing method.

According to an embodiment of the present invention, the collected data may include missing data as well as abnormal data. According to an embodiment of the present invention, the missing data includes missing data replaced from abnormal data in step S1410 and missing data already included in the collected data.

According to an embodiment of the present invention, information about missing data includes at least one of information about the location of the missing data and information about the continuity of the missing data. According to an embodiment of the present invention, information about the location of missing data includes, for example, information about the row and column where the missing data is located in data in a table format. In addition, information about the continuity of missing data includes information about the degree (time) of continuous missing data and information that can identify trends or patterns of missing data, such as the distribution pattern of missing data.

Accordingly, the processor 150 may identify information about the missing data that includes at least one of information about the location of the missing data and information about the continuity of the missing data.

The processor 150 according to an embodiment of the present invention processes missing data using at least one missing data processing method based on information about the missing data (S1430).

The processor 150 according to an embodiment of the present invention may supplement missing data based on information about the location of the missing data and/or information about the continuity of the missing data.

At this time, the processor 150 may identify the at least one missing data processing method to process the missing data corresponding to at least one section based on information about the missing data. The processor 150 may supplement the missing data by considering parameter information that adjusts the degree of processing of the missing data according to the information about the missing data. Parameter information according to this embodiment may include information about a section containing missing data, information about a missing data processing method, missing data processing conditions, etc.

As an example, a section containing 10 consecutive pieces of missing data can be processed by applying one missing data processing method. As another example, a section containing 10 consecutive pieces of missing data can be divided into three sections and processed by applying different missing data processing methods to each section. Additionally, by applying multiple missing data processing methods to each section, the final supplemented data value can be derived by applying the average value or a certain ratio of the supplemented data values according to each processing method.

At this time, the processor 150 may process the missing data based on conditions that determine whether to process the missing data, that is, conditions that determine whether to supplement the data. For example, under conditions such as performing supplementation only when missing data is less than 20% of the total data, missing data does not exceed 30% of the total data, and supplementation is performed only for 10 or less consecutive pieces of missing data. Missing data can be handled accordingly.

According to one embodiment of the present invention, the processor 150 performs at least part of data analysis, processing, and generation of result information to adjust the degree of processing of missing data according to information about missing data using a rule-based or artificial intelligence algorithm. It can be performed using at least one of machine learning, neural network, or deep learning algorithms.

In addition, in order to adaptively compensate for missing data according to the user's request, the processor 150 inputs a user input regarding at least one missing data processing method to process missing data corresponding to at least one section through the input unit 110. It can be received through . Accordingly, the processor 150 may supplement the missing data by applying at least one missing data processing method according to parameter information defined by the user.

According to an embodiment of the present invention, more rational and high-quality data processing is possible by applying and supplementing an optimized method according to the state of the section containing missing data.

According to an embodiment of the present invention, different interpolation and replacement methods can be applied depending on the purpose of data use, so higher quality data supplementation can be performed.

Figure 15 is a diagram showing the operation of an electronic device according to an embodiment of the present invention. In this embodiment, the process 1500 of processing missing data is described, and since content overlapping with that described in FIG. 14 is applied in the same manner as in FIG. 14, detailed description thereof will be omitted.

The processor 150 according to an embodiment of the present invention processes abnormal data (b) among the collected data (hereinafter referred to as collected data (a)) (1510).

More specifically, the abnormal data (b) includes certain abnormal data (b1) and uncertain abnormal data (b2). Clearly abnormal data (b1) refers to error data that is clearly determined, such as having a value that exceeds the minimum-maximum range that the value of the collected data (a) can have. Uncertain abnormal data (b2) refers to abnormal data that is not a clear error, but appears uncertain as to whether it is abnormal data, such as a clear difference when compared with data obtained before and after the relevant data.

The processor 150 identifies abnormal data (b) including certain abnormal data (b1) and uncertain abnormal data (b2) among the collected data (a), and determines the certain abnormal data (b1) and uncertain abnormal data (b2). Process each. As an example, the processor 150 replaces certain abnormal data (b1) with missing data among the collected data (a) and processes it, replaces uncertain abnormal data (b2) with missing data and processes it, or processes uncertain abnormal data (b2) by replacing it with missing data. Data collected before and after can be used to interpolate to appropriate data. At this time, the processor 150 may receive a user input for determining the value of the uncertain abnormal data b2 through the input unit 110.

The processor 150 according to an embodiment of the present invention identifies information about missing data (c) including processed abnormal data among the collected data (a) (1520).

The processor 150 according to an embodiment of the present invention processes the missing data c using at least one missing data processing method based on information about the missing data c (1530). As a result, processed data (d) obtained by processing the collected data (a) is obtained.

According to an embodiment of the present invention, abnormal data can be processed more precisely by distinguishing and processing abnormal data into certain abnormal data and uncertain abnormal data.

Figure 16 is a diagram showing the operation of an electronic device according to another embodiment of the present invention. The operation of FIG. 16 explains a method 1600 of integrating a plurality of processed data (d) obtained by separately processing a plurality of collected data (a).

According to an embodiment of the present invention, Data1, Data2,... , In order to integrate a plurality of collected data (a) including DataN, the data processing 1500 described in FIGS. 14 and 15 must be performed first for each collected data. For each collected data (a), the processed data (d) obtained through the data processing (1500) process are Data1', Data2',... , DataN’.

The processor 150 according to an embodiment of the present invention combines the obtained processing data (d) (1610).

Let's look at the process of combining processed data (d) in detail by referring to the data in Table 2. Data 1, Data 2, and Data 3 shown in Table 2 are assumed to be processed data (d) for which data processing (1500) has been individually completed.

데이터 1 data 1	1/1일 0시 0분 ~ 1/10e일 24시 0분00:00 on 1/1 ~ 24:00 on 1/10e	1분 단위로 측정Measured in 1 minute increments

데이터 2data 2	1/1일 3시~ 1/10일 23시1/1 3:00 ~ 1/10 23:00	1시간 단위로 측정Measured in 1-hour increments

데이터 3data 3	1/1일 0시~ 1/11일 24시0:00 on 1/1 ~ 24:00 on 1/11	3시간 단위로 측정Measured in 3-hour increments

According to one embodiment of the present invention, the processor 150 may set the combining section of the plurality of processed data d as shown in Table 3.

결합구간 1 Combined section 1	1/1일 3시~ 1/10일 23시1/1 3:00 ~ 1/10 23:00
결합구간 2 Combined section 2	1/1일 0시 0분 ~ 1/10일 24시 0분1/1 00:00 ~ 1/10 24:00

According to one embodiment of the present invention, the processor 150 may reset missing data according to the combining section. According to an embodiment of the present invention, resetting missing data means setting the uncollected data as missing data when uncollected data occurs beyond the time period in which the collected data was collected. By changing the format of existing missing data and uncollected data to be the same, the purpose is to unify them so that they receive the same processing when processing data. For example, when setting the combining section to combining section 1, some data from data 1 and some data from data 2 Because the entire data and some data from Data 3 are used, there is no need to reset additional missing data.

However, when the combined section is set to combined section 2, data 1 uses all data and data 3 uses some data, so setting missing data is unnecessary, while data 2 uses 0:00 on 1/1 day and then 3:00 on 1/1 day. Since there is no data before or after 23:00 on 1/10 and before 24:00 on 1/10, it is necessary to reset missing data for uncollected data corresponding to that time.

According to one embodiment of the present invention, the processor 150 may combine data based on the data collection cycle of the plurality of processed data d. As an example, the processor 150 may reindex the data based on the data collection cycle of the plurality of processed data d. More specifically, the processor 150 may upsample or downsample each of the plurality of processed data d based on the data collection cycle of the plurality of processed data d and combine them.

For example, when the combining cycle is set to 1 minute, upsampling is required for data 2 and data 3, and when the combining period is set to 1 hour, downsampling for data 1 and upsampling for data 3 are required.

At this time, downsampling can utilize well-known statistical calculation methods such as average, but upsampling has very diverse processing methods, and the resulting data restoration effects are also very different, so at least one of the missing data processing methods described in FIG. 16 above is used. This can be done by applying one. However, this is only an example, and methods of performing upsampling and downsampling can be applied without limitation.

After combining the data, the processor 150 may perform data processing 1620 again on the combined data. At this time, data processing 1620 may be the same as data processing 1500, and data processing 1620 and data processing 1500 may be performed on the same processor or different processors. More specifically, the processor 150 processes each of the plurality of collected data to obtain a plurality of processed data, combines the plurality of processed data, processes abnormal data among the combined data, and processes abnormal data among the combined data. Information on missing data including abnormal data may be identified, and the missing data may be processed using at least one missing data processing method based on the information on the missing data. The processor 150 may process missing data and integrate data (1630).

According to an embodiment of the present invention, it can be applied to data in which a plurality of single data are combined, so that high-quality data supplementation can be performed even when data are combined.

According to one embodiment of the present invention, quality verification is performed based on time series data with periodic characteristics and defective data is processed, so that data with high data quality can be used for learning and analysis, resulting in overall performance results. can be improved.

The embodiments of the present invention described above may be implemented as a program (or application) and stored in a medium in order to be executed in conjunction with a server, which is hardware.

The above-mentioned program is C, C++, JAVA, machine language, etc. that can be read by the processor (CPU) of the computer through the device interface of the computer in order for the computer to read the program and execute the methods implemented in the program. It may include code coded in a computer language. These codes may include functional codes related to functions that define the necessary functions for executing the methods, and include control codes related to execution procedures necessary for the computer's processor to execute the functions according to predetermined procedures. can do. In addition, these codes may further include memory reference-related codes that indicate at which location (address address) in the computer's internal or external memory additional information or media required for the computer's processor to execute the above functions should be referenced. there is. In addition, if the computer's processor needs to communicate with any other remote computer or server to execute the above functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes regarding whether communication should be performed and what information or media should be transmitted and received during communication.

The storage medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as a register, cache, or memory. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers that the computer can access or on various recording media on the user's computer. Additionally, the medium may be distributed to computer systems connected to a network, and computer-readable code may be stored in a distributed manner.

The steps of the method or algorithm described in connection with embodiments of the present invention may be implemented directly in hardware, implemented as a software module executed by hardware, or a combination thereof. The software module may be RAM (Random Access Memory), ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), Flash Memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer-readable recording medium well known in the art to which the present invention pertains.

Above, embodiments of the present invention have been described with reference to the attached drawings, but those skilled in the art will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

Claims

In a method performed by an electronic device,

Refining time series data collected for predetermined characteristic information based on a predetermined reference period;

Splitting the refined time series data according to a predetermined division cycle;

Verifying the quality of data for time series data divided according to the division cycle; and

Including the step of selecting the verified time series data and processing the data according to data supplementation conditions,

The step of verifying the quality of data for time series data divided according to the division cycle is,

Calculating at least one of the degree of continuous missing data and the degree of total missing data in the corresponding division period for the divided time series data; and

Comprising the step of determining the divided time series data of the corresponding division cycle as defective data when the calculated degree exceeds the degree set according to the standard parameter,

Quality verification methods for time series data.
According to paragraph 1,

The step of refining the time series data collected for the predetermined characteristic information based on a predetermined reference period is,

Comprising the step of setting the reference period based on inference or external parameters through characteristic information of the collected time series data,

Quality verification methods for time series data.
According to paragraph 1,

The step of dividing the refined time series data according to a predetermined division cycle is,

Splitting the time series data according to a division period calculated by applying a predetermined weight determined by reflecting the characteristic information to the reference period,

Quality verification methods for time series data.
According to paragraph 3,

The step of dividing the refined time series data according to a predetermined division cycle is,

Comprising the step of deleting time series data that does not satisfy the division cycle among the time series data,

Quality verification methods for time series data.
According to paragraph 1,

The step of dividing the refined time series data according to a predetermined division cycle is,

Splitting the refined time series data based on a first division cycle; and

Comprising the step of re-dividing the divided time series data based on a second division cycle,

Quality verification methods for time series data.
According to clause 5,

The step of verifying the quality of data for time series data divided according to the division cycle is,

Further comprising the step of recursively verifying the quality of the verified time series data determined not to be defective data based on the second division cycle,

Quality verification methods for time series data.
According to clause 6,

The step of verifying the quality of data for time series data divided according to the division cycle is,

Further comprising adjusting the degree of continuous missing data and the degree of total missing data set in the reference parameter by varying the number of time series data divided according to the second division cycle,

Quality verification methods for time series data.
According to paragraph 1,

When the time series data is multivariate data including a plurality of characteristic information, the multivariate data is arranged into columns and rows according to time information groups according to each characteristic information and division period,

The step of verifying the quality of data for time series data divided according to the division cycle is,

checking whether missing data for each characteristic information exists for each time information group of the multivariate data;

If missing data exists in each time information group, first counting is added, and if consecutive missing data exists in a plurality of time information groups adjacent to the time information group to which the first counting is added, the missing data is added. adding a second counting based on the number of consecutive time information groups; and

Comprising the step of calculating the degree of the continuous missing data for each time information group according to the division period by adding up the first and second counting,

Quality verification methods for time series data.
According to clause 8,

The step of adding the second counting is,

If there is continuous missing data based on characteristic information in a plurality of time information groups adjacent to the time information group to which the first counting has been added, the number of time information groups in which missing data is consecutive based on the characteristic information Adding a second counting based on

Quality verification methods for time series data.
According to clause 8,

The step of adding the second counting is,

If there is a time information group missing all characteristic information consecutively within the time information group to which the first counting is added and a plurality of adjacent time information groups, a second counting is performed based on the number of the consecutive time information groups. Adding ,

Quality verification methods for time series data.
In electronic devices,

Split the time series data collected for predetermined characteristic information according to a predetermined division cycle, verify the quality of the data for the time series data divided according to the division cycle, and then select the verified time series data to supplement the data. Including a processor that processes the data in accordance with the terms,

Electronic devices.