WO2023090510A1

WO2023090510A1 - Electronic device for performing data selection based on data supplementation condition, and executing method thereof

Info

Publication number: WO2023090510A1
Application number: PCT/KR2021/017884
Authority: WO
Inventors: 문재원; 금승우; 오승택; 유미선; 황지수
Original assignee: 한국전자기술연구원
Priority date: 2021-11-18
Filing date: 2021-11-30
Publication date: 2023-05-25

Abstract

An electronic device according to an embodiment of the present invention comprises a processor which: configures a period of first data to be processed among data collected for at least one property; reconfigures missing data included in the period of first data to generate second data; and processes the second data on the basis of a data supplementation condition provided for selection of data which needs to be supplemented.

Description

Electronic device for performing data selection based on data complement conditions and method for performing the same

The present invention relates to an electronic device for performing data selection and processing missing data and a method for performing the same.

With the development of industrial technology and information and communication technology, the amount of data is explosively increasing, and the performance of data utilization technologies such as data mining or machine learning that utilizes them is getting better and better. At this time, in order to obtain a good result using the data utilization technology, the precondition that the data is flawless must be satisfied. However, in a real environment, for various reasons, frequently missing or abnormal data occurs.

The processing of data containing missing or outlier data can significantly affect the conclusions that can be drawn from the data.

As a method of handling missing data, for example, when each row is independent in tabular data, a method of collectively deleting rows including missing data is the most widely used and can be easily processed. However, in this method, it is difficult to guarantee data continuity if a specific row is arbitrarily deleted because the time at which the data was acquired is important in the case of time series data that depends on the passage of time. Therefore, in the case of time series data, it is preferable to delete all data before and after the point in time at which the missing data occurs rather than partially deleting the missing data.

When using this method of deleting missing data in batches, the amount of data to be deleted varies depending on the location of the missing data, and in some cases, a large amount of data may be deleted.

Therefore, in general, for time series data, a method of eliminating missing values by interpolating the missing data as much as possible is applied. However, this method also produces low-quality data due to unreasonable interpolation if the time-series data includes missing data in an amount exceeding a certain threshold, and thus the meaning of recovery may be lowered.

In addition, there is no consideration for missing data that inevitably appears when combining a plurality of different data due to batch deletion and interpolation of missing data, so a flexible processing method for missing data that appears due to combining data is required.

An object of the present invention is to provide a method and apparatus for selecting data using data supplementation conditions to variably determine the degree of utilization of missing data.

An object of the present invention is to provide a data selection method and apparatus capable of more efficiently recovering and utilizing data by selecting and selecting even if missing data is included in time series data based on a quality desired by a user.

An object of the present invention is to provide an electronic device and method for processing missing data in consideration of the purpose of utilizing data or the quantity and quality of data.

An object of the present invention is to provide an electronic device and a method for processing the same, which can selectively apply different missing data preprocessing techniques according to missing data situations since the purposes of using data are different.

An object of the present invention is to provide an application method for single data as well as data in which a plurality of single data are combined.

In an electronic device according to an embodiment of the present invention, a section of first data to be processed is set among data collected for at least one characteristic, and missing data included in the section of the first data is reset. and a processor for generating 2 data and processing the second data based on a data supplement condition prepared to select data requiring supplementation.

The processor sets the data supplementation condition based on at least one of the ratio, period, and number of missing data included in the second data, and selects third data that satisfies the data supplementation condition from among the second data. can do.

The processor may process the second data when a ratio of missing data included in the second data is higher than a predefined value.

The processor may process the second data when a period of missing data included in the second data is higher than a predefined value.

The processor may process the second data when the number of missing data included in the second data is higher than a predefined value.

The processor may set a first section of the first data based on the number of missing data included in each section among a plurality of sections of the first data.

The processor may set the first section of the first data based on the consecutive number of missing data included in the first section or the summed number of missing data included in the first section.

A method for performing data selection based on a data complement condition according to an embodiment of the present invention, comprising: setting a section of first data to be processed among data collected for at least one characteristic; generating second data by resetting missing data included in the section of the first data; and processing the second data based on data supplementation conditions prepared to select data requiring supplementation.

The processing of the second data may include setting the data supplementation condition based on at least one of a ratio, period, and number of missing data included in the second data; The method may include selecting third data that satisfies the data supplementation condition from among the second data.

The processing of the second data may include processing the second data when a ratio of missing data included in the second data is higher than a predefined value.

The processing of the second data may include processing the second data when a period of missing data included in the second data is higher than a predefined value.

The processing of the second data may include processing the second data when the number of missing data included in the second data is higher than a predefined value.

Setting the section of the first data may include setting a first section of the first data based on the number of missing data included in each section among a plurality of sections of the first data.

The setting of the first section may include setting the first section of the first data based on the continuous number of missing data included in the first section or the summed number of missing data included in the first section. steps may be included.

In an electronic device according to an embodiment of the present invention, abnormal data among collected data is processed, information on missing data including the processed abnormal data among the collected data is identified, and information about the missing data is determined. and a processor processing the missing data using at least one missing data processing method based on information.

The processor may identify information about the missing data including at least one of information about a location of the missing data and information about continuity of the missing data.

The processor may identify abnormal data including certain abnormal data and uncertain abnormal data among the collected data, and process the certain abnormal data and uncertain abnormal data, respectively.

The processor may identify at least one missing data processing method to process missing data corresponding to at least one section based on the missing data information.

The electronic device may further include an input unit, and the processor may receive a user input related to at least one missing data processing method to process missing data corresponding to the at least one section through the input unit.

The processor obtains a plurality of processed data by respectively processing the collected data including a plurality of collected data, combines the plurality of processed data, processes abnormal data among the combined data, and processes the combined data. Among them, information on missing data including the processed abnormal data may be identified, and the missing data may be processed using at least one missing data processing method based on the information on the missing data.

The processor may perform upsampling or downsampling of each of the plurality of processed data and combine them based on a data collection period of the plurality of processed data.

The processor may set a combining section for combining the plurality of processed data, and reset missing data included in the combined section of each processed data according to the combining section.

A method for processing missing data according to an embodiment of the present invention, comprising: processing abnormal data among collected data; identifying information about missing data including the processed abnormal data among the collected data; and processing the missing data using at least one missing data processing method based on the information on the missing data.

The identifying information on the missing data may include identifying information on the missing data including at least one of information about a location of the missing data and information about continuity of the missing data. .

The processing of the abnormal data may include identifying abnormal data including certain abnormal data and uncertain abnormal data among the collected data; The method may further include processing the definite anomaly data and the uncertain anomaly data, respectively.

The processing of the missing data may include identifying the at least one missing data processing method to process the missing data corresponding to at least one section based on the information on the missing data.

The identifying of the at least one missing data processing method may include receiving a user input regarding at least one missing data processing method to process the missing data corresponding to the at least one section.

The processing of the missing data includes a step of obtaining a plurality of processed data by respectively processing the collected data including a plurality of collected data, wherein the method comprises: combining the plurality of processed data; processing abnormal data among the combined data; identifying information about missing data including the processed abnormal data among the combined data; The method may further include processing the missing data using at least one missing data processing method based on the information on the missing data.

The combining of the plurality of processed data may include upsampling or downsampling each of the plurality of processed data based on a data collection period of the plurality of processed data and combining the plurality of processed data.

The combining of the plurality of processed data may include setting a combining section for combining the plurality of processed data, and resetting missing data included in the combining section of each of the processed data according to the combining section. can

The generating of the second data includes processing abnormal data among the first data, and the processing of the second data identifies information about missing data including the processed abnormal data. doing; and processing the missing data included in the second data using at least one missing data processing method based on the information on the missing data.

According to an embodiment of the present invention, since data to be supplemented is selected based on the situation of missing data included in the data and the task is performed, more rational and high-quality data processing is possible.

According to an embodiment of the present invention, since high-quality data is provided based on data supplementation conditions, unreasonable deletion or interpolation operations can be avoided, and thus higher-quality data analysis can be performed.

According to an embodiment of the present invention, more reasonable and high-quality data processing is possible by applying and supplementing an optimized method according to the state of a section including missing data.

According to an embodiment of the present invention, since interpolation and substitution methods may be differently applied according to data utilization purposes, higher quality data supplementation may be performed.

According to an embodiment of the present invention, it can be applied to data in which a plurality of single data are combined, so that high-quality data supplementation can be performed even when combining data.

1 is a diagram illustrating data including missing data.

2 is a block diagram showing the configuration of an electronic device according to an embodiment of the present invention.

3 is a flowchart illustrating an operation of an electronic device according to an embodiment of the present invention.

4 is a diagram illustrating how to set a section of first data according to a method according to an embodiment of the present invention.

5 is a diagram showing how to generate second data according to a method according to an embodiment of the present invention.

6 is a diagram illustrating processing of second data based on a data complement condition according to a method according to an embodiment of the present invention.

7 is a diagram illustrating processing of second data according to a method according to an embodiment of the present invention.

8 is a flowchart illustrating an operation of an electronic device according to an embodiment of the present invention.

9 is a diagram illustrating an operation of an electronic device according to an embodiment of the present invention.

10 is a diagram illustrating an operation of an electronic device according to another embodiment of the present invention.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings. The detailed description set forth below in conjunction with the accompanying drawings is intended to describe exemplary embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced. In order to clearly describe the present invention in the drawings, parts irrelevant to the description may be omitted, and the same reference numerals may be used for the same or similar components throughout the specification.

1 is a diagram showing data 1 including missing data.

Data (1) of FIG. 1 is a table of data collected according to time (T) for each feature (N), and is composed of 10 different features and 10 times. For example, when climate change in a specific city is analyzed, temperature, humidity, precipitation, traffic volume, and population density of the specific city over time may be the characteristics. Alternatively, when comparing the amount of fine dust in each city, Seoul, Busan, Cheongju, etc. may correspond to the characteristics.

In analyzing data, integrity is premised, but in the process of collecting actual data, for various reasons, frequently missing or abnormal data occurs. Missing data according to an embodiment of the present invention is comprehensively defined as data that cannot be converted and displayed in any way, such as numbers and letters, and data that cannot be defined or does not exist. It means that there is no data collected at that time, or data that is collected but omitted in the process of transmitting to a device such as a server. The value of the missing data can be expressed in various ways, such as expressing an extreme value such as "-999" or expressing a predetermined character such as "NaN" or "NA". However, there are cases in which it is difficult to clearly determine normal data and abnormal data after data are recorded in the notation of non-standardized missing data. Therefore, representative libraries that process data mark missing data as "NaN" or "NA" for reasons of simplicity and functionality.

Abnormal data is data that adversely affects the result value in analyzing the collected data. For example, it means error data such as the collected data having abnormal values or exceeding the allowable measurement range of the sensor that collects the data. do. In the present invention, abnormal data among collected data may be replaced with missing data and processed, or may be interpolated with appropriate data using data collected before and after the abnormal data. In the present invention, abnormal data is marked as "NaN" or "NA" and replaced with missing data.

In order to handle missing data (11), if the method of deleting data in bulk is used, a complete data set that prevents contamination of missing data can be obtained, but the degree of deletion depending on the location of missing data is large, so it is used as data may be insufficient to do so. For example, if rows including missing data (11) are collectively deleted from data (1), rows T1 and rows T10 remain, which may be insufficient to obtain useful information using data (1). .

Alternatively, when using a method of batch interpolating data to process the missing data 11, data can be preserved as much as possible by recovering the missing data arbitrarily based on adjacent data or past data of the missing data. However, since the recovered data is not accurate data, excessive interpolation may contaminate the results of analysis and learning due to poor data quality.

For example, when interpolating rows including missing data (11) in data (1) in batches, data in column N3 is interpolated using only the data obtained in rows T1 and T10, so The quality of the data may be degraded. Also, in the case of data in columns N7, N8, and N10, interpolation accuracy cannot be guaranteed because missing data occurs irregularly.

Accordingly, a method for determining whether the data of columns N3, N7, N8, and N10 can be restored or whether restoring the data improves data quality is required.

Hereinafter, the present invention determines the degree of recovery of data including missing data, selects recoverable data, and proposes an electronic device and method for processing the data.

An electronic device 100 according to an embodiment of the present invention includes an input unit 110, a communication unit 120, a display unit 130, a memory 140, and a processor 150.

The input unit 110 generates input data in response to a user input of the electronic device 100 . The user input may include user input regarding data to be processed by the electronic device 100, user input regarding data complement conditions, and user input regarding at least one missing data processing method to process missing data.

The input unit 110 includes at least one input means. The input unit 110 includes a keyboard, a key pad, a dome switch, a touch panel, a touch key, a mouse, a menu button, and the like. can include

The communication unit 120 communicates with an external device such as a server or a data collection device to receive data. To this end, the communication unit 120 may perform communication such as 5th generation communication (5G), long term evolution-advanced (LTE-A), long term evolution (LTE), and wireless fidelity (Wi-Fi).

The display unit 130 displays display data according to the operation of the electronic device 100 . The display unit 130 may display display data necessary for selecting data based on the data complementation conditions, for example, a screen for setting data complementation conditions, a screen for displaying data processing results, and the like. Alternatively, the display unit 130 may display data required to process missing data, for example, a screen for processing abnormal data among collected data, a screen for identifying information on missing data, a screen for receiving user input, A screen for displaying data processing results can be displayed. The display unit 130 may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic LED (OLED) display, and a micro electro mechanical systems (MEMS) display. and electronic paper displays. The display unit 130 may be combined with the input unit 110 and implemented as a touch screen.

The memory 140 stores operating programs of the electronic device 100 . The memory 140 is a non-volatile storage that can retain data (information) regardless of whether or not power is provided, and data to be processed by the processor 150 is loaded. It includes memory of volatile properties that cannot preserve . Storage includes flash-memory, hard-disc drive (HDD), solid-state drive (SSD), read-only memory (ROM), and buffer and random access memory (RAM). etc.

The memory 140 may store data collected from an external device, data on data complement conditions, information on abnormal data, information on how to process missing data, and the like. In addition, the memory 140 learns to identify at least one missing data processing method based on information on a model learned to set a section of first data to be processed according to the quality of data or information on missing data. information about the model can be stored.

The processor 150 may execute software such as a program to control at least one other component (eg, a hardware or software component) of the electronic device 100 and perform various data processing or calculations.

Meanwhile, the processor 150 sets a section of the first data to be processed among data collected for at least one characteristic, resets missing data included in the section of the first data to generate second data, Based on data supplementation conditions prepared to select data requiring supplementation, at least some of data analysis, processing, and result information generation for processing the second data are rule-based or artificial intelligence (AI) algorithms, machine learning, This may be performed using at least one of a neural network and a deep learning algorithm.

The processor 150 processes abnormal data among the collected data, identifies information on missing data including the processed abnormal data among the collected data, and performs at least one missing data processing method based on the information on the missing data. At least one of machine learning, neural network, or deep learning algorithm as a rule-based or artificial intelligence algorithm for at least part of data analysis, processing, and result information generation for processing missing data using This can be done using Examples of the neural network may include models such as a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), and a Recurrent Neural Network (RNN).

The processor 150 according to an embodiment of the present invention sets a section of first data to be processed among data collected for at least one characteristic (S310).

According to one embodiment of the present invention, as described above with respect to FIG. 1, the characteristics refer to the contents of collected data, and the collected data is time-sequentially collected for at least one characteristic.

The processor 150 may receive data collected from an external device such as a server, but may be data collected by the electronic device 100, and is not limited thereto.

The processor 150 may set a section of the first data based on a necessary time section. At this time, the first data becomes a target to be processed among the collected data.

When performing analysis on the collected data, for example, when applying clustering-applied data pattern classification, excluding data with many missing data from analysis can improve performance. However, in the case of data that includes missing data to a certain extent, performance can be improved by recovering the data using interpolation or the like and then maximally utilizing it. In other words, a criterion is needed to determine the extent to which data including missing data is allowed and selected. Therefore, properly setting the first data can contribute to improving the processing quality of the collected data and can lead to correct results.

According to an embodiment of the present invention, the processor 150 may set a first section of the first data based on the number of missing data included in each section among a plurality of sections of the first data. For example, when setting a time interval using collected data, a plurality of intervals that can be set as the first data may exist. If the number of missing data included in a specific section among a plurality of sections is small, it can be evaluated that the quality of data is good compared to other sections. Accordingly, the processor 150 may set a section having the smallest number of missing data among a plurality of sections of the first data as the first section of the first data.

In addition, the processor 150 may set the first section of the first data based on the number of consecutive missing data included in the first section or the summed number of missing data included in the first section. For example, in the case of a section including three consecutive missing data and a section including three missing data but data that are distributed and can be supplemented by interpolation, the latter section is the first as more valid data. It is likely to be set as an interval.

In another embodiment, the processor 150 identifies the total number of missing data in the collected data, and assigns a section in which the number of missing data included in the corresponding section is small compared to the total number of missing data in the first section of the first data. can be set to

The processor 150 according to an embodiment of the present invention resets the missing data included in the section of the first data to generate second data (S320).

The section of the first data may include missing data as well as uncollected data. Non-collected data refers to cases in which there is no collected data, except for data omitted during data collection, when the data collection start time or collection end time is different when different data are listed in chronological order.

According to an embodiment of the present invention, resetting missing data means setting uncollected data included in the section of the first data as missing data. This is to unify the data processing so that they receive the same processing by changing the format of the existing missing data and uncollected data to be the same.

The processor 150 according to an embodiment of the present invention processes the second data based on a data supplementation condition prepared to select data requiring supplementation (S330).

According to an embodiment of the present invention, the processor 150 may set a data complement condition based on at least one of the ratio, period, and number of missing data included in the second data. In this case, the data complement condition may be applied to one data set among data collected according to at least one characteristic. For example, in the case of data collected for a plurality of characteristics, it may be applied to a data set corresponding to each characteristic. Alternatively, in the case of data collected under two or more different conditions for one characteristic, it may be applied to data sets collected corresponding to each condition.

At this time, the processor 150 may receive and set a user input for the data supplementation condition through the input unit 110 or may receive data on the data supplementation condition from an external device through the communication unit 120 . In addition, the processor 150 performs at least a part of data analysis, processing, and result information generation for setting optimized data complement conditions for processing the collected data or second data as a rule-based or artificial intelligence algorithm, such as machine learning and neural networks. It may be performed using at least one of a network and a deep learning algorithm.

At this time, processing the second data means performing various data processing, such as selecting third data that satisfies the data complement condition from the second data, deleting the second data or the selected third data, or interpolating. include

More specifically, looking at the data complement condition, the processor 150 may process the second data when the ratio of missing data included in the second data is higher than a predefined value.

The processor 150 may process the second data when the period of missing data included in the second data is higher than a predefined value. In this case, the period of the missing data may mean a period for consecutive missing data or a period obtained by summing the periods corresponding to the missing data scattered in the second data.

The processor 150 may process the second data when the number of missing data included in the second data is higher than a predefined value.

According to an embodiment of the present invention, data to be supplemented is selected based on the situation of missing data included in the data, rather than data being deleted or interpolated in batches, and thus data processing is performed more rationally and with higher quality. is possible

According to an embodiment of the present invention, even if the time series data includes missing data based on the quality desired by the user, only good quality data can be used by efficiently selecting and selecting the time series data.

4 to 7 sequentially illustrate one embodiment of processing the collected data according to the operation flow described in FIG. 3 above. In this embodiment, D1 to D7 data collected for one characteristic are processed. However, the present invention is not limited to this embodiment, and may process data collected for a plurality of characteristics. In this case, the data shown in FIGS. 4 to 7 exist for each characteristic, or D1 to D7 are each different. It may be a different characteristic.

4 is a diagram illustrating how to set a section of first data according to a method according to an embodiment of the present invention. 4 is described in relation to S310 of FIG. 3 .

4 shows data 400 including missing data 410 and uncollected data 420 . The processor 150 may set a section 430 of the first data to be processed in the collected data 400 . According to an embodiment of the present invention, the processor 150 sets a first section 430 of the first data among a plurality of sections of the first data in consideration of the entire missing data 410 and the uncollected data 420. can

For example, in the case of the currently set section 430, the number of missing data and uncollected data is 7, whereas if the section is set forward by one column, the number of missing data and uncollected data is 9. In addition, it can be seen that the number of consecutive missing data increases to three, such as the D3 row, and the quality of the data is further deteriorated.

According to an embodiment of the present invention, by setting a section of the first data among the collected data, as part of a preprocessing process for selecting data that satisfies the data supplementation condition, it can contribute to further improving data quality.

5 is a diagram showing how to generate second data according to a method according to an embodiment of the present invention. 5 is described in relation to S320 of FIG. 3 .

FIG. 5 shows second data 500 generated by processing the first data previously set in FIG. 4 . According to an embodiment of the present invention, the processor 140 generates the second data 500 by resetting the missing data 410 included in the section 430 of the first data.

At this time, resetting the missing data means setting the uncollected data 420 included in the section 430 of the first data as the missing data 410 . This is to unify the existing missing data 410 and the uncollected data 420 to receive the same processing by changing the same format.

6 is a diagram illustrating processing of second data based on a data complement condition according to a method according to an embodiment of the present invention. 7 is a diagram illustrating processing of second data according to a method according to an embodiment of the present invention. 6 and 7 are described in relation to S330 of FIG. 3 .

According to an embodiment of the present invention, the processor 150 may set a data complement condition based on at least one of the ratio, period, and number of missing data 410 included in the second data 500 .

More specifically, looking at the data complement condition, the processor 150 may process the second data 500 when the ratio of the missing data 410 included in the second data 500 is higher than a predefined value. there is.

The processor 150 may process the second data 500 when the period of the missing data 410 included in the second data 500 is higher than a predefined value. In this case, the period of the missing data 410 may mean a period for consecutive missing data 410 or a period obtained by adding the periods corresponding to the scattered missing data 410 to the second data 500 .

The processor 150 may process the second data 500 when the number of missing data 410 included in the second data 500 is higher than a predefined value.

At this time, processing the second data 500 by the processor 150 includes selecting third data 510 that satisfies a data complement condition from the second data 500 .

For example, in the data complementation condition set for the second data 500 shown in FIG. 6, the number of missing data 410 is two or more, and the processor 150 supplements data that satisfies the data supplementation condition. It is possible to select necessary third data 510 .

In this case, the data complement condition may be applied to one data set among data collected according to at least one characteristic. For example, it is assumed that the second data 500 is data measuring the amount of fine dust for each city, and rows D1 to D7 are data for the amount of fine dust collected in different cities. The data complement condition for identifying cities in which the number of missing data 410 is two or more is applied to rows D1 to D7, respectively, so that the processor 150 determines that the data in rows D3 and D5 of the second data 500 is data. It can be selected as the third data 510 that needs supplementation.

The processor 150 according to an embodiment of the present invention may delete or interpolate the selected third data 510 . In this embodiment, the selected third data 510 is deleted.

The processor 150 identifies missing data among remaining data after third data selection and processing accordingly as data 710 requiring interpolation. The processor may perform interpolation on data 710 requiring interpolation, and may perform analysis using the restored data 700 .

According to an embodiment of the present invention, since data requiring supplementation is selected based on data supplementation conditions, high-quality data can be provided. In addition, since the selected data is analyzed based on the processed data, it is possible to avoid unreasonable deletion or interpolation operations, thereby enabling higher quality data analysis.

The processor 150 according to an embodiment of the present invention processes abnormal data among the collected data (S810). The operation of the processor 150 in step S810 may be an operation of processing abnormal data among the first data in relation to the step S320 of FIG. 3 .

The collected data is time-sequentially collected for at least one characteristic. For example, it may be temperature data collected from a temperature sensor. The processor 150 may receive data collected from an external device such as a server, but may be data collected by the electronic device 100, and is not limited thereto.

In analyzing data, integrity is premised, but in the process of collecting actual data, for various reasons, frequently missing or abnormal data occurs.

Abnormal data is data that adversely affects the result value in analyzing the collected data. For example, it means error data such as the collected data having abnormal values or exceeding the allowable measurement range of the sensor that collects the data. do.

According to an embodiment of the present invention, the processor 150 may replace abnormal data among collected data with missing data for processing, or may interpolate appropriate data using data collected before and after the abnormal data.

Missing data according to an embodiment of the present invention is comprehensively defined as data that cannot be converted and displayed in any way, such as numbers and letters, and data that cannot be defined or does not exist. It means that there is no data collected at that time, or data that is collected but omitted in the process of transmitting to a device such as a server.

In general, the value of missing data can be expressed in various ways, such as expressing an extreme value such as "-999" or expressing a predetermined character such as "NaN" or "NA". However, there are cases in which it is difficult to clearly determine normal data and abnormal data (abnormal data) after the data are recorded in the notation of non-standardized missing data. Therefore, in the present invention, abnormal data is marked as "NaN" or "NA" and replaced with missing data.

The processor 150 according to an embodiment of the present invention identifies information about missing data including processed abnormal data among the collected data (S820). In step S820, the processor 150 identifies information on missing data including the processed abnormal data among the first data in relation to step S330 of FIG. 3, and based on the information on the identified missing data, at least It may be an operation of processing missing data included in the second data using one missing data processing method.

According to an embodiment of the present invention, collected data may include missing data as well as abnormal data. According to an embodiment of the present invention, the missing data includes missing data substituted from abnormal data in step S810 and missing data previously included in collected data.

According to an embodiment of the present invention, information on missing data includes at least one of information about a location of missing data and information about continuity of missing data. According to an embodiment of the present invention, the information about the location of missing data includes, for example, information about rows and columns where missing data is located in tabular data. In addition, the information on the continuity of the missing data includes information on the degree (time) of the continuity of the missing data and information capable of identifying the tendency or pattern of the missing data, such as the distribution of the missing data.

Accordingly, the processor 150 may identify information about the missing data including at least one of information about the location of the missing data and information about continuity of the missing data.

The processor 150 according to an embodiment of the present invention processes the missing data using at least one missing data processing method based on information on the missing data (S830).

The processor 150 according to an embodiment of the present invention may supplement missing data based on information about the location of the missing data and/or information about the continuity of the missing data.

In this case, the processor 150 may identify at least one missing data processing method to process the missing data corresponding to at least one section based on the missing data information. The processor 150 may complement the missing data by considering parameter information for adjusting the processing degree of the missing data according to information on the missing data. Parameter information according to the present embodiment may include information on a section including missing data, information on a method for processing missing data, conditions for processing missing data, and the like.

For example, a section including 10 consecutive missing data may be processed by applying one missing data processing method. As another example, a section including 10 consecutive pieces of missing data may be divided into three sections, and different missing data processing methods may be applied to each section for processing. In addition, a plurality of missing data processing methods are applied to each section, and the final supplemented data value may be derived by applying an average value or a predetermined ratio of supplemented data values according to each processing method.

In this case, the processor 150 may process the missing data based on a condition for determining whether to process the missing data, that is, a condition for determining whether to supplement data. For example, complementation is performed only when missing data is 20% or less of the total data, or complementation is performed only for 10 or fewer consecutive missing data and the missing data does not exceed 30% of the total data. Missing data can be handled accordingly.

According to an embodiment of the present invention, the missing data processing method includes, for example, "mean", "median", "frequent", "ffill", "bfill", "linear_interpolation", "spline_interpolation", "stineman_interpolation" , "KNN", "ARIMA", "Randomforest", "NAOMI", "BRITS", etc., but are not limited thereto.

According to an embodiment of the present invention, the processor 150 performs at least a part of data analysis, processing, and result information generation for adjusting the processing degree of missing data according to information on the missing data as a rule-based or artificial intelligence algorithm. It may be performed using at least one of machine learning, neural network, and deep learning algorithms.

In addition, in order to compensate for missing data adaptively to a user's request, the processor 150 receives user input regarding at least one missing data processing method for processing missing data corresponding to at least one section through the input unit 110. can be received through Accordingly, the processor 150 may supplement missing data by applying at least one missing data processing method according to parameter information defined by a user.

According to an embodiment of the present invention, since an optimized method is applied and supplemented according to the state of a section including missing data, more reasonable and high-quality data processing is possible.

9 is a diagram illustrating an operation of an electronic device according to an embodiment of the present invention. In this embodiment, the process 900 of processing missing data will be described, and since the contents overlapping with those described in FIG. 8 are applied in the same manner as in FIG. 8, a detailed description thereof will be omitted.

Processor 150 according to an embodiment of the present invention processes abnormal data 20 among collected data (hereinafter, referred to as collected data 10) (910).

More specifically, the abnormal data 20 includes certain abnormal data 21 and uncertain abnormal data 22 . Certainly abnormal data 21 means error data that is clearly determined, such as having a value exceeding a minimum-maximum range that the value of the collected data 10 can have. Uncertain abnormal data 22 refers to abnormal data that appears uncertain as abnormal data, such as showing a clear difference when compared with data acquired before and after the corresponding data, although it is not a clear error.

The processor 150 identifies abnormal data 20 including certain abnormal data 21 and uncertain abnormal data 22 among the collected data 10, and collects certain abnormal data 21 and uncertain abnormal data 22. process each. For example, the processor 150 replaces certain abnormal data 21 of the collected data 10 with missing data for processing, or replaces uncertain abnormal data 22 with missing data for processing, or replaces certain abnormal data 22 with missing data. Data collected before and after can be used to interpolate to appropriate data. At this time, the processor 150 may receive a user input for determining a value of the abnormal abnormal data 22 through the input unit 110 .

Processor 150 according to an embodiment of the present invention identifies information about missing data 30 including processed abnormal data among collected data 10 (920).

The processor 150 according to an embodiment of the present invention processes the missing data 30 using at least one missing data processing method based on information on the missing data 30 (930). As a result, processed data 40 obtained by processing the collected data 10 is obtained.

According to an embodiment of the present invention, abnormal data can be processed more precisely because abnormal data is classified into certain abnormal data and uncertain abnormal data.

10 is a diagram illustrating an operation of an electronic device according to another embodiment of the present invention. The operation of FIG. 10 describes a method 1000 for integrating a plurality of processed data 40 obtained by processing a plurality of collected data 10 respectively.

According to one embodiment of the present invention, in order to integrate a plurality of collected data 10 including Data1, Data2, ..., DataN, the data processing (900) described in FIGS. 8 and 9 for each collected data ) should be preceded. The processed data 40 obtained through the data processing 900 process for each collected data 10 includes Data1', Data2', ..., DataN'.

The processor 150 according to an embodiment of the present invention combines the acquired processed data 40 (1010).

The process of combining the processed data 40 will be described in detail with reference to the data in Table 1. It is assumed that data 1, data 2, and data 3 shown in Table 1 are processed data 40 for which data processing 900 has been completed individually.

데이터 1data 1	1/1일 0시 0분 ~ 1/10일 24시 0분January 1st 0:00 - January 10th 24:00	1분 단위로 측정Measured in 1 minute increments
데이터 2data 2	1/1일 3시~ 1/10일 23시January 1st 3:00 - January 10th 23:00	1시간 단위로 측정Measured in 1 hour increments

데이터 3data 3	1/1일 0시~ 1/11일 24시January 1st 0:00 - January 11th 24:00	3시간 단위로 측정Measured in 3-hour increments

According to an embodiment of the present invention, the processor 150 may set a combination period of the plurality of processed data 40 as shown in Table 2.

결합구간 1coupling section 1	1/1일 3시~ 1/10일 23시January 1st 3:00 - January 10th 23:00
결합구간 2coupling section 2	1/1일 0시 0분 ~ 1/10일 24시 0분January 1st 0:00 - January 10th 24:00

According to an embodiment of the present invention, the processor 150 may reset missing data according to the combining interval. According to an embodiment of the present invention, resetting the missing data means setting the non-collected data as missing data when non-collected data occurs beyond the time period in which the collected data is collected. This is to unify the data processing so that they receive the same processing by changing the format of the existing missing data and uncollected data to be the same.

For example, when the combination period is set to combination period 1, some data of data 1, all data of data 2, and some data of data 3 are used, so resetting additional missing data is unnecessary.

However, when setting the combination period to combination period 2, data 1 uses all data and data 3 uses some data, so missing data setting is unnecessary, while data 2 is 0:00 on 1/1 and 3:00 on 1/1. Since there is no data before and after 23:00 on January 10th and before 24:00 on January 10th, it is necessary to reset missing data for uncollected data corresponding to that time.

According to an embodiment of the present invention, the processor 150 may combine data based on a data collection period of the plurality of processed data 40 . For example, the processor 150 may reindex data based on a data collection period of the plurality of processed data 40 . More specifically, the processor 150 may perform upsampling or downsampling of each of the plurality of processed data 40 and combine them based on the data collection period of the plurality of processed data 40 .

For example, if the combining period is 1 minute, data 2 and data 3 need to be upsampled, and if the combining period is 1 hour, data 1 needs to be downsampled and data 3 needs to be upsampled.

At this time, downsampling can utilize a well-known statistical calculation method such as an average, but upsampling has a wide variety of processing methods, and the resulting data restoration effect is also very different, so at least one of the missing data processing methods described in FIG. You can do this by applying one. However, since this is only an example, the method of performing upsampling and downsampling can be applied without limitation.

After combining the data, the processor 150 may perform data processing 1020 again on the combined data. In this case, the data processing 1020 may be the same as the data processing 900, and the data processing 1020 and the data processing 900 may be performed by the same processor or different processors. More specifically, the processor 150 processes a plurality of collected data, respectively, obtains a plurality of processed data, combines the plurality of processed data, processes abnormal data among the combined data, and processes the processed data among the combined data. Information on missing data including abnormal data may be identified, and the missing data may be processed using at least one missing data processing method based on the information on the missing data. The processor 150 may process the missing data and integrate the data ( 1030 ).

Claims

In electronic devices,

Setting a section of first data to be processed among data collected for at least one characteristic;

resetting the missing data included in the section of the first data to generate second data;

An electronic device including a processor that processes the second data based on a data supplementation condition prepared to select data requiring supplementation.
According to claim 1,

the processor,

Setting the data complement condition based on at least one of the ratio, period, and number of missing data included in the second data;

An electronic device for selecting third data that satisfies the data complement condition among the second data.
According to claim 2,

the processor,

The electronic device processing the second data when a ratio of missing data included in the second data is higher than a predefined value.
According to claim 2,

the processor,

An electronic device that processes the second data when a period of missing data included in the second data is higher than a predefined value.
According to claim 2,

the processor,

The electronic device processing the second data when the number of missing data included in the second data is higher than a predefined value.
According to claim 2,

the processor,

An electronic device that sets a first section of the first data based on the number of missing data included in each section among a plurality of sections of the first data.
According to claim 6,

the processor,

The electronic device that sets a first section of the first data based on the number of consecutive missing data included in the first section or the summed number of missing data included in the first section.
According to claim 1,

the processor,

Processing abnormal data among the collected data,

Identifying information about missing data including the processed abnormal data among the collected data;

An electronic device that processes the missing data by using at least one missing data processing method based on the information on the missing data.
According to claim 8,

the processor,

An electronic device that identifies information about the missing data including at least one of information about a location of the missing data and information about continuity of the missing data.
According to claim 8,

the processor,

Identifying abnormal data including certain abnormal data and uncertain abnormal data among the collected data;

An electronic device that processes the sure-abnormal data and the uncertain-abnormal data, respectively.
According to claim 1,

the processor,

An electronic device for identifying at least one missing data processing method to process missing data corresponding to at least one section based on the missing data information.
According to claim 11,

Including more input,

the processor,

An electronic device that receives a user input related to at least one missing data processing method to process the missing data corresponding to the at least one section through the input unit.
According to claim 8,

the processor,

Acquiring a plurality of processed data by respectively processing the collected data including a plurality of collected data;

combining the plurality of processed data;

Processing abnormal data among the combined data;

Identifying information about missing data including the processed abnormal data among the combined data;

An electronic device that processes the missing data by using at least one missing data processing method based on the information on the missing data.
In the method for performing data selection based on data supplementation conditions,

setting a section of first data to be processed among data collected for at least one characteristic;

generating second data by resetting missing data included in the section of the first data; and

and processing the second data based on data supplementation conditions prepared to select data requiring supplementation.
According to claim 14,

The step of generating the second data,

Processing abnormal data among the first data;

The step of processing the second data,

identifying information about missing data including the processed abnormal data; and

and processing missing data included in the second data using at least one missing data processing method based on the missing data information.