WO2023182661A1

WO2023182661A1 - Electronic device for analyzing big data and operation method thereof

Info

Publication number: WO2023182661A1
Application number: PCT/KR2023/002189
Authority: WO
Inventors: 김정곤; 도기석; 배국진
Original assignee: 삼성전자 주식회사
Priority date: 2022-03-24
Filing date: 2023-02-15
Publication date: 2023-09-28
Also published as: KR20230138604A

Abstract

An electronic device according to various embodiments comprises a memory and at least one processor. The at least one processor can be set to: crawl and acquire a cost policy of a big data providing device that provides big data to be analyzed; acquire information about the big data; select a portion of the big data as data to be analyzed; perform an analysis task on the selected data; predict the analysis task execution time in a resource environment of a first condition on the basis of the cost policy, the information about the big data, and the result of the analysis of the selected data; and calculate, on the basis of the analysis task execution time in the resource environment of the first condition, the analysis task execution time and the billing cost in a resource environment having a plurality of second conditions different from the first condition. Various other embodiments may be provided.

Description

Electronic device for analyzing big data and method of operation thereof

Various embodiments relate to the analysis of big data, and more specifically, to an electronic device that analyzes big data based on cost and time efficiency and a method of operating the same.

Recently, as the use of big data has increased, big data service providers collect and utilize various data from individuals, institutions, etc. Data users can search and store desired data using big data services provided by big data service providers through the big data analysis server.

When operating the server, the big data analysis server can retrieve the data desired by the user from the big data platform and perform analysis. At this time, costs are incurred depending on the resources and time corresponding to the amount of big data read from the big data platform, and therefore a method to minimize these costs is required.

Various embodiments of the present disclosure can provide an apparatus and operating method for automatically predicting the integrated cost and time for analyzing big data based on the pricing policies of various big data platforms.

The technical challenges to be achieved in the various embodiments are not limited to the technical challenges mentioned above, and other technical challenges not mentioned may be explained to those skilled in the art to which the various embodiments of the present disclosure belong from the description below. You will be able to understand it clearly.

An electronic device according to various embodiments includes a memory; and at least one processor, wherein the at least one processor crawls and obtains a cost policy of a big data providing device that provides big data to be analyzed, obtains information about the big data, and selects the big data among the big data. Select some data as data to be analyzed, perform an analysis operation on the selected data, and based on the cost policy, information on the big data, and analysis results of the selected data, resources of the first condition Predicting the analysis task execution time in the environment, based on the analysis task execution time in the resource environment of the first condition, the analysis task execution time in a resource environment having a plurality of second conditions different from the first condition, and It can be set to calculate billing costs.

A method of operating an electronic device according to various embodiments includes an operation of crawling and obtaining a cost policy of a big data providing device that provides big data to be analyzed; An operation of obtaining information about the big data; An operation of selecting some data from the big data as data to be analyzed; performing an analysis task on the selected data; predicting an analysis task execution time in a resource environment of a first condition based on the cost policy, the information about the big data, and an analysis result of the selected data; and calculating analysis task execution time and cost in a resource environment having a plurality of second conditions different from the first condition, based on the analysis task execution time in the resource environment of the first condition. .

Devices and methods for analyzing and providing big data according to various embodiments of the present disclosure can automatically predict the integrated cost and time for analyzing big data based on the pricing policies of various big data platforms.

The apparatus and method for analyzing and providing big data according to various embodiments of the present disclosure predicts cost and time based on the pricing policy of various big data platforms, allowing users to perform big data analysis according to desired cost and time. You can.

1 is a diagram illustrating a system for analyzing and providing big data according to various embodiments.

Figure 2 is a block diagram illustrating a big data analysis device according to various embodiments.

Figure 3 is a diagram for explaining the operation of a big data analysis device according to various embodiments.

Figure 4 is a diagram for explaining the operation of a big data analysis device according to various embodiments.

Figure 5 is a diagram for explaining the operation of a big data analysis device according to various embodiments.

Figure 6 is a diagram showing an example of a three-step query of a big data device according to various embodiments.

FIG. 7 is a diagram illustrating an electronic device implementing a big data analysis device according to various embodiments.

The various embodiments of the present disclosure and the terms used herein are not intended to limit the technical features described in the present disclosure to specific embodiments, and should be understood to include various changes, equivalents, or replacements of the embodiments. In connection with the description of the drawings, similar reference numbers may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of the items, unless the relevant context clearly indicates otherwise. In the present disclosure, “A or B”, “at least one of A and B”, “at least one of A or B”, “A, B or C”, “at least one of A, B and C”, and “A Each of phrases such as “at least one of , B, or C” may include any one of the items listed together in the corresponding phrase, or any possible combination thereof. Terms such as "first", "second", or "first" or "second" may be used simply to distinguish one component from another, and to refer to that component in other respects (e.g., importance or order) is not limited. One (e.g., first) component is said to be “coupled” or “connected” to another (e.g., second) component, with or without the terms “functionally” or “communicatively.” When mentioned, it means that any of the components can be connected to the other components directly (e.g. wired), wirelessly, or through a third component.

The term “module” used in various embodiments of the present disclosure may include a unit implemented in hardware, software, or firmware, and is interchangeable with terms such as logic, logic block, component, or circuit, for example. can be used A module may be an integrated part or a minimum unit of the parts or a part thereof that performs one or more functions. For example, according to one embodiment, the module may be implemented in the form of an application-specific integrated circuit (ASIC).

Various embodiments of the present disclosure may be implemented as software including one or more instructions stored in a storage medium that can be read by a machine. For example, the processor of the device may call at least one instruction among one or more instructions stored from a storage medium and execute it. This allows the device to be operated to perform at least one function according to the at least one instruction called. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' only means that the storage medium is a tangible device and does not contain signals (e.g. electromagnetic waves), and this term refers to cases where data is semi-permanently stored in the storage medium. There is no distinction between temporary storage.

According to one embodiment, methods according to various embodiments of the present disclosure may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or via an application store (e.g. Play Store ^TM ) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online. In the case of online distribution, at least a portion of the computer program product may be at least temporarily stored or temporarily created in a machine-readable storage medium, such as the memory of a manufacturer's server, an application store server, or a relay server.

According to various embodiments, each component (e.g., module or program) of the above-described components may include a single or plural entity, and some of the plurality of entities may be separately placed in other components. . According to various embodiments, one or more of the components or operations described above may be omitted, or one or more other components or operations may be added. Alternatively or additionally, multiple components (eg, modules or programs) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each component of the plurality of components in the same or similar manner as those performed by the corresponding component of the plurality of components prior to the integration. . According to various embodiments, operations performed by a module, program, or other component may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or , or one or more other operations may be added.

Referring to FIG. 1, a system for analyzing and providing big data according to various embodiments may include a big data analysis device 110, a big data consumption device 120, and a big data provision device 130. .

The big data providing device 130 may include a big data platform and database that collects and manages big data based on a database cloud service. In one embodiment, the big data provision device 130 is provided by Google ^TM Cloud, such as Google ^TM BigQuery, AWS (amazon web service), Microsoft ^TM Azure, and Kubernetes (hereinafter, Kubernetes) may be included.

The big data analysis device 110 may read big data from the big data providing device 130 at the request of the big data consuming device 120 and provide it to the big data consuming device 120.

The big data consumption device 120 may be used by a big data consumer (eg, user). The big data consumption device 120 can check a cost table indicating time and resources for each data analysis method provided by the big data analysis device 110. The big data consumption device 120 can select a desired big data analysis method based on the confirmed cost table for big data analysis. In one embodiment, if a big data consumer pre-registers a desired condition, for example, a condition that prioritizes cost over time, to the big data consumption device 120, the big data consumer device 120 configures the big data consumer's conditions. You can automatically select the big data analysis method that matches your needs.

The big data consumption device 120 may request the big data analysis device 110 to perform a big data analysis task according to the selected big data analysis method.

The big data analysis device 110 can collect, analyze, and store big data based on the big data analysis method selected by the big data consumption device 120. The big data analysis device 110 may provide the big data analysis results to the big data consumption device 120.

The big data analysis device 110 can collect big data from the big data providing device 130. The big data analysis device 110 may request big data from the big data providing device 130 and receive big data transmitted from the big data providing device 130 upon request.

The big data analysis device 110 may be allocated resources for big data collection from the big data provision device 130. The big data analysis device 110 can manage the resources allocated from the big data providing device 130 and the time required to collect big data from the big data providing device 130. The cost for collecting big data may vary depending on the cost policy of the big data providing device 130. For example, when using Google Big Query as a data source, data is stored in column units, and charges can be made based on the size of the data stored and the column of data scanned when querying the data. For example, when using the first platform (e.g. Kubernetes) as a platform, the amount of resources required may vary depending on the configuration of containers, nodes, pods, etc., and the required resources may vary accordingly. Costs charged may also vary. The big data analysis device 110 can secure optimized costs and resources by predicting the time and cost for each job in advance based on the cost policy for each platform and data source of the big data provision device 130. there is.

The big data analysis device 110 can check resources and tasks that affect costs based on the data structure of the big data provided from the big data provision device 130. In one embodiment, when the big data providing device 130 uses Google Big Query as a data source, the big data analysis device 110 can check the number of queries and the amount of data per query. In another embodiment, when the big data provision device 130 uses the Kubernetes platform, the big data analysis device 110 may check the number of pods, live time of the pod, resources for each pod, etc. .

The big data analysis device 110 may obtain cost-related information by crawling the cost policies for each platform/data source of the big data provision device 130.

The big data analysis device 110 may obtain information about the analysis target of the big data providing device 130. In one embodiment, the data structure of Google Big Query may have a logical structure of project, data set, table, and job, and when using Google Big Query, the big data analysis device 110 creates a data set list. , table list, table schema information (for example, number of columns, number of rows, etc.) can be obtained. One project may include multiple data sets, and a data set may be a table set with multiple tables. Each table stores actual data, and a job can refer to a command given to data, such as a query, data loading, or deletion.

In one embodiment, when the big data provision device 130 uses the Kubernetes platform, the big data analysis device 110 may obtain the number of containers, the number of pods, and the resource type of each pod. A pod is the most basic data distribution unit in Kubernetes and can contain one or more containers, and disk volumes can be shared between containers deployed within the pod.

The big data analysis device 110 can query and retrieve big data from the big data provision device 130, and can perform a three-step data selection process of sampling, filtering, and randomizing tables to minimize query costs. there is. The three-stage data selection process will be described in detail later with reference to the attached drawings.

The big data analysis device 110 can perform a three-step data selection process and infer the total task execution time in the basic resource environment based on the selected big data. In one embodiment, the big data analysis device 110 calculates the maximum task execution time for one task in the basic resource environment set by the big data consumer, and calculates the total data to be acquired from the platform that performs the task based on this. The total task execution time can be predicted.

The big data analysis device 110 can compare the relative processing speed according to the resources required for big data collection and predict the cost expected to be charged for big data collection.

The big data analysis device 110 may store the final predicted cost-related information in a table. For example, when using the Kubernetes platform, the big data analysis device 110 outputs the number of containers, the number of pods, the resource type of each pod, the total execution time of the task, and the total cost in a table. You can.

FIG. 2 is a block diagram illustrating a big data analysis device 110 according to various embodiments.

Referring to FIG. 2, the big data analysis device 110 may include a cost management unit 210, a big data management unit 220, and a system control unit 230.

The cost management unit 210 may perform operations to predict the time and cost required for big data analysis.

The cost management unit 210 may be allocated resources for big data collection from the big data providing device 130. The big data analysis device 110 can manage the resources allocated from the big data providing device 130 and the time required to collect big data from the big data providing device 130. The cost for collecting big data may vary depending on the cost policy of the big data providing device 130. For example, when using Google Big Query as a data source, data is stored in column units, and charges can be made based on the size of the data stored and the column of data scanned when querying the data. For example, when using Kubernetes as a platform, the amount of resources required may vary depending on the configuration of containers, nodes, pods, etc., and the costs required may also vary accordingly. The cost management unit 210 can secure optimized costs and resources by predicting the time and cost for each job in advance based on the cost policy for each platform and data source of the big data provision device 130.

The cost management unit 210 can identify resources and tasks that affect costs based on the data structure of the big data provided from the big data providing device 130. In one embodiment, when the big data providing device 130 uses Google Big Query as a data source, the cost management unit 210 can check the number of queries and the amount of data per query. In another embodiment, when the big data provision device 130 uses the Kubernetes platform, the cost management unit 210 may check the number of pods, the live time of the pod, and the resources of the pod.

The cost management unit 210 may obtain cost-related information by crawling the cost policies for each platform/data source of the big data provision device 130.

The cost management unit 210 may obtain information about the analysis target of the big data providing device 130. In one embodiment, the data structure of big data Google Big Query may have a logical structure of project, data set, table, and job, and when using Google Big Query, the cost management unit 210 creates a data set list. , table list, table schema information (for example, number of columns, number of rows, etc.) can be obtained.

In one embodiment, when the big data providing device 130 uses the Kubernetes platform, the cost management unit 210 may obtain the number of containers, the number of pods, and the resource type of each pod.

The cost management unit 210 can query and retrieve big data from the big data providing device 130, and can perform a three-stage data selection process of sampling, filtering, and randomizing tables to minimize query costs. The three-stage data selection process will be described in detail later with reference to the attached drawings.

The cost management unit 210 can perform a three-step data selection process to infer the total task execution time in the basic resource environment based on the selected big data. In one embodiment, the cost management unit 210 calculates the maximum task execution time for one task in the basic resource environment set by the big data consumer, and based on this, calculates the total data to be acquired from the platform performing the task. The total task execution time can be predicted. In one embodiment, the cost management unit 210 can compare relative processing speeds according to resources required for big data collection and predict expected costs.

The cost management unit 210 can secure the final predicted cost-related information in a table. For example, when using the Kubernetes platform, the cost management unit 210 can output the number of containers, the number of pods, the resource type of each pod, the total execution time of the task, and the total cost in a table.

The cost management unit 210 can set conditions regarding big data analysis costs. Big data analysis cost conditions may include information related to the limited cost of big data analysis provided by the big data consumption device 120 and big data analysis time. The cost management unit 210 may receive priorities for predicting big data analysis costs from the big data consumption device 120. For example, based on the request of the big data consumption device 120, the big data analysis cost may be set to have priority over the big data analysis time.

In one embodiment, the cost management unit 210 may select a big data analysis method in which big data analysis cost is prioritized over big data analysis time based on a request from the big data consumption device 120.

The big data management unit 220 may include a data resource management unit 221, a data management unit 223, a data storage 225, and a data transmission unit 227.

The data resource management unit 221 can manage information about data resources. The resource management unit 221 may collect data from the big data providing device 130. The data resource management unit 221 may request data from the big data providing device 130 and receive data transmitted from the big data providing device 130 upon request.

The data management unit 223 may review the data and, if modification to the data is required, may modify the data.

The data management unit 223 may secure data stored in the big data analysis device 110 and data provided from the big data providing device 130.

The data management unit 223 may perform data analysis tasks based on the big data analysis cost conditions set by the big data consumption device 120.

The data management unit 223 may store data provided from the big data providing device 130 and/or analyzed data in the data storage 225.

The data transmission unit 227 may transmit the analyzed data to the big data consumption device 120.

The system control unit 230 may manage and control a processing process for executing instructions between the cost management unit 210 and the big data management unit 220. The system control unit 230 can process commands between the big data analysis device 110 and the big data consumption device 120. The system control unit 230 may be at least one hardware processor.

According to one embodiment, the cost management unit 210, the big data management unit 220, the system control unit 230, the data resource management unit 221, the data management unit 223, the data storage 225, and the data transmission unit 227. ) may be program modules and may communicate with an external device or system. Program modules may be included in the big data provision device 100 in the form of an operating system, application program module, and other program modules.

FIG. 3 is a diagram for explaining the operation of the big data analysis device 110 according to various embodiments.

In this disclosure, for convenience of explanation, the case where the big data providing device 130 is Google Big Query and Kubernetes is taken as an example, but it is understood by those skilled in the art that the same technical idea can be applied to other devices providing big data. It is self-explanatory.

Referring to FIG. 3, in operation 310, the big data analysis device may obtain a cost policy for each platform and/or data source of the big data providing device. Big data provision devices using Google Big Query, Kubernetes, etc. disclose cost policies, and big data analysis devices can obtain cost-related information by crawling the cost policies provided by big data provision devices.

In one embodiment, when using Google Big Query as a data source, the data structure of Big Query may have a logical structure of project, data set, table, and job, and the big data analysis device 110 stores the data. You can obtain set list, table list, table schema information (for example, number of columns, number of rows, etc.). One project may include multiple data sets, and a data set may be a table set with multiple tables. Tables store actual data, and jobs can refer to commands issued to data, such as queries, data loading, or deletion.

In operation 320, the big data analysis device may obtain information about the big data to be analyzed. The big data analysis device can identify resources and tasks that affect costs based on the data structure of the big data provided from the big data provision device 130. In one embodiment, when the big data providing device uses Google Big Query as a data source, the big data analysis device can check the number of queries and the amount of data per query. In another embodiment, when the big data providing device uses the Kubernetes platform, the big data analysis device can check the number of pods, live time of the pods, resources for each pod, etc.

In operation 330, the big data analysis device may acquire arbitrary data for big data analysis. The big data analysis device can query and retrieve big data from the big data providing device 130, and can perform a three-step data selection process of sampling, filtering, and randomizing the table to minimize query costs. The three-stage data selection process will be described in detail later with reference to the attached drawings.

In operation 340, the big data analysis device may predict the task execution time in the basic resource environment based on the information obtained in operations 310 to 330. In one embodiment, the big data analysis device calculates the maximum task execution time for one task in the basic resource environment set by the big data consumer, and based on this, calculates the total data to be acquired from the platform performing the task. Task execution time can be predicted. In one embodiment, the big data analysis device can determine whether to skip analysis of the corresponding column based on the column name/type, and calculate the total filtering time based on system performance. In one embodiment, the big data analysis device can infer the time required to perform the three-step data selection process depending on the amount of data, and can additionally consider network conditions.

In operation 350, the big data analysis device may compare relative processing speeds according to resources required to collect big data and calculate an expected cost. In one embodiment, the cost may vary for each resource when providing a node or pod in a big data platform such as Kubernetes, so the big data analysis device adjusts performance and cost according to resource changes when performing a task. It is predictable. For example, if a big data analysis device performs a query using a relatively large memory, processing speed can be increased compared to querying using a small memory. Therefore, the big data analysis device can calculate the processing speed and cost for each resource provided by each platform, and accordingly select the resource with the desired cost or allow the consumer to choose.

In operation 360, the big data analysis device can secure the final predicted cost-related information in a table. For example, when using the Kubernetes platform, the big data analysis device can output the number of containers, number of pods, resource type of each pod, total execution time of the task, and total cost in a table.

[Table 1] below shows the data for one job when there are five resource types depending on the combination of CPU (central processing unit) processing speed, memory capacity, disk size, and network bandwidth. This example shows the results of processing speed and resulting costs.

resource typeresource type	CPUCPU	메모리Memory	디스크 크기disk size	네트워크 대역폭network bandwidth	......	1개의 job에 대한 처리속도Processing speed for one job	비용expense
1One	lowlow	64GB64GB	30GB30GB	1Mbps1Mbps	……	1H 10MIN1H 10MIN	￦XXX￦XXX
22	midmid	64GB64GB	30GB30GB	1Mbps1Mbps	……	1H1H	￦XXX￦XXX
33	midmid	128GB128GB	100GB100GB	10Mbps10Mbps	……	30MIN30MIN	￦XXX￦XXX
44	highhigh	128GB128GB	100GB100GB	1Gbps1Gbps	……	10MIN10MIN	￦XXX￦XXX
55	highhigh	256GB256GB	1TB1TB	1Gbps1Gbps	……	3MIN3MIN	￦XXX￦XXX

Figure 4 is a diagram for explaining the operation of a big data analysis device according to various embodiments. In one embodiment, the big data analysis device may be a device that is substantially the same as the big data analysis device 110 of FIG. 1 or the big data analysis device 110 of FIG. 2. Referring to FIG. 4, the big data providing device When using Google Big Query as a data source, Big Query 410 may include a plurality of data sets (Dataset #1 and Dataset #N) 401 and 402. Each of the plurality of data sets may include a plurality of tables 420. Other data sources other than BigQuery may also have a data structure similar to that of FIG. 4.

A big data analysis device can obtain a data set list and table and column list information for each data set from a plurality of data sets.

The big data analysis device reads data in column units (421) for big data analysis based on the table (420) included in the data set, and based on the table (420), information about the table, for example, the number of rows. , you can check the number of columns and determine whether to skip the analysis of the column based on the column name/type. In one embodiment, the big data analysis device can be set to skip the column without analyzing it if the column name includes a specific character. In another embodiment, the big data analysis device may be set to skip the column without analyzing it if the column type is "int". In another embodiment, according to a model inference technique, the big data analysis device may exclude analysis targets based on the combination of each column. For example, if user information is included in the analysis target, but device information is not included in the analysis target, the big data analysis device infers the combination of columns "user" and "name", where "name" is the user name. and included in the analysis target, and for the combination of "model_id" and "name", "name" can be inferred as device information and excluded from the analysis target.

Figure 5 is a diagram for explaining the operation of a big data analysis device according to various embodiments. In one embodiment, the big data analysis device may be a device that is substantially the same as the big data analysis device 110 of FIG. 1 or the big data analysis device 110 of FIG. 2.

When a big data analysis device reads big data from a big data provision device for big data analysis, if the entire data is read, the amount of data is very large, so it takes a lot of resources and time to read the data, and the cost increases accordingly. . Therefore, the big data analysis device queries and retrieves some data for efficient big data analysis, but can read the data randomly to represent the entire column data. According to various embodiments, a big data analysis device may perform a three-step query to analyze big data at minimal cost.

Referring to FIG. 5, in operation 501, the big data analysis device may perform table sampling.

The big data analysis device can perform table sampling based on the physical configuration of the data source of the big data providing device. For example, if the data of Google Big Query consists of blocks of 1 GB, and the column data of the table you want to analyze is 3 GB, the column data of the table you want to analyze can be stored in three 1 GB blocks. In addition, the big data analysis device can reduce costs by sampling and querying only one block out of three blocks containing the table to be analyzed, that is, 1 GB of data.

In operation 502, the big data analysis device may perform filtering based on the distribution number so that data retrieved from the sampled table is representative. For example, when performing filtering based on frequency (ordering), the big data analysis device can filter only the top n data from the sampled table.

In operation 503, the big data analysis device may finally randomly select up to N pieces of data among the n pieces of data filtered based on the distribution number in order to minimize data analysis time and/or cost.

In one embodiment, a big data analysis device can predict the time and resources needed to analyze one column data by performing a three-step query. In one embodiment, the big data analysis device can predict the resources, task execution time, and cost required to actually perform desired tasks based on the time and resources required to analyze one column data, and secure the prediction results in a table. You can. For example, when using the Kubernetes platform, the big data analysis device can output the number of containers, number of pods, resource type of each pod, total execution time of the task, and total cost in a table. In one embodiment, the number of containers may vary depending on the environment in which pods can share resources. In one embodiment, the resource type of each pod may vary depending on CPU type, memory, disk space, network allocation, etc.

Figure 6 is a diagram showing an example of a three-step query of a big data analysis device according to various embodiments.

Referring to FIG. 6, the big data analysis device can perform table sampling through the “TABLESAMPLE SYSTEM (50 PERCENT)” query (601).

The big data analysis device can filter XX of the tables sampled in 601 based on the frequency (order) through “Column> ORDER BY 2 DESC LIMIT XX” (602).

The big data analysis device, for the XX tables filtered at 602, says, “The data analysis device can perform randomization to 20% of the tables through the filter at 602 (603).

Referring to FIG. 7 , the electronic device 700 may be a general-purpose computer system that operates as a big data analysis device 110. The electronic device 700 may include at least a portion of a processing unit 710, a communication unit 720, a memory 730, a storage 740, and a bus 790. Components of the electronic device 700, such as the processing unit 710, the communication unit 720, the memory 730, and the storage 740, may communicate with each other through the bus 790.

The processing unit 710 may be a semiconductor device that executes processing instructions stored in the memory 730 or storage 740. For example, the processing unit 710 may be at least one hardware processor. The processing unit 710 can process tasks required for the operation of the electronic device 700. The processing unit 710 may execute the code of the operation or step of the processing unit 710 described in the embodiments. The processing unit 710 may generate, store, and output information described in the embodiments, and may perform operations performed in the electronic device 700. In one embodiment, the processing unit 710 may perform the same or similar operations as the system control unit 230 of FIG. 2.

The communication unit 720 may be connected to the network 799. Data or information required for the operation of the electronic device 700 may be received, and data or information required for the operation of the electronic device 700 may be transmitted. The communication unit 720 may transmit data to another device (e.g., the big data consuming device 120 and/or the big data providing device 130 of FIG. 1) through the network 799, and may transmit data to another device (e.g., the big data consuming device 120 and/or the big data providing device 130 of FIG. 1) through the network 799. Data may be received from the big data consumption device 120 and/or the big data provision device 130.

Memory 730 and storage 740 may be various types of volatile or non-volatile storage media. For example, the memory 730 may include at least one of ROM 731 and RAM 732. Storage 740 may include a built-in storage medium such as RAM, flash memory, and hard disk, and may include a removable storage medium such as a memory card.

A function or operation of the electronic device 700 may be performed as the processing unit 710 executes at least one program module. Memory 730 and/or storage 740 may store at least one program module. At least one program module may be configured to be executed by the processing unit 710.

The electronic device 700 may further include a user interface (UI) input device 750 and a UI output device 760. The UI input device 750 may receive user input required for operation of the electronic device 700. The UI output device 760 may output information or data according to the operation of the electronic device 700.

So far, the present invention has been examined focusing on its preferred embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative rather than a restrictive perspective. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope should be construed as being included in the present invention.

Claims

In electronic devices,

Memory; and

Contains at least one processor,

The at least one processor,

Obtain by crawling the cost policy of the big data provision device that provides the big data subject to analysis,

Obtain information about the big data,

Select some of the big data as data to be analyzed,

Perform analysis on the selected data,

Based on the cost policy, the information about the big data, and the analysis result of the selected data, predict the execution time of the analysis task in the resource environment of the first condition,

An electronic device configured to calculate the analysis task execution time and billing cost in a resource environment having a plurality of second conditions different from the first condition, based on the analysis task execution time in the resource environment of the first condition.
According to paragraph 1,

The cost policy includes information on time, resources, and billing costs according to the amount of data required to analyze the big data.
According to paragraph 1,

Each of the data constituting the big data includes a plurality of data sets, and each of the plurality of data sets includes a plurality of tables,

Information about the big data above,

An electronic device including a data set list included in the big data, a table for each data set, and column list information within the table.
According to paragraph 3,

Information about the big data above,

An electronic device including the number of queries required to obtain the big data and the amount of data for each query.
According to paragraph 1,

Each of the data constituting the big data includes a plurality of data sets, and each of the plurality of data sets includes a plurality of tables,

The at least one processor,

Sample the plurality of tables constituting the big data,

Filter the sampled table,

An electronic device configured to randomly select a predetermined number of tables from the filtered table and some data from the big data as data to be analyzed.
According to clause 5,

The at least one processor,

Based on the physical data unit of the big data providing device, a table of data corresponding to the data unit is sampled,

Based on the distribution number of a predetermined standard, filter some tables that meet the predetermined standard among the sampled tables,

An electronic device that selects the predetermined number of tables based on the time it takes to perform the analysis task.
According to clause 5,

The at least one processor,

Based on the name and/or type of the column constituting the table, determine whether to perform analysis on the column,

An electronic device configured to predict an analysis task execution time in the resource environment of the first condition by additionally considering whether to perform analysis on the corresponding column.
According to paragraph 1,

The at least one processor,

Calculating the analysis task execution time and billing cost in a resource environment having the plurality of second conditions based on a combination of the processing speed of the central processing unit (CPU) of the electronic device, memory capacity, disk size, and network bandwidth. electronic device that does.
In a method of operating an electronic device,

An operation of crawling and obtaining the cost policy of a big data provision device that provides big data to be analyzed;

An operation of obtaining information about the big data;

An operation of selecting some data from the big data as data to be analyzed;

performing an analysis task on the selected data;

predicting an analysis task execution time in a resource environment of a first condition based on the cost policy, the information about the big data, and an analysis result of the selected data; and

A method comprising calculating analysis task execution time and billing cost in a resource environment having a plurality of second conditions different from the first condition, based on the analysis task execution time in the resource environment of the first condition.
According to clause 9,

The cost policy includes information on time, resources, and billing costs according to the amount of data required to analyze the big data.
According to clause 9,

Each of the data constituting the big data includes a plurality of data sets, and each of the plurality of data sets includes a plurality of tables,

Information about the big data above,

A method including a list of data sets included in the big data, a table for each data set, column list information in the table, the number of queries required to obtain the big data, and the amount of data for each query.
According to clause 9,

Each of the data constituting the big data includes a plurality of data sets, and each of the plurality of data sets includes a plurality of tables,

The operation of selecting some data among the big data as data to be analyzed is,

An operation of sampling the plurality of tables constituting the big data;

Filtering the sampled table; and

A method including an operation of randomly selecting a predetermined number of tables from the filtered tables.
According to clause 12,

The operation of sampling the table constituting the big data is,

Based on the physical data unit of the big data providing device, a table of data corresponding to the data unit is sampled,

The operation of filtering the sampled table is:

Based on the distribution number of a predetermined standard, filter some tables that meet the predetermined standard among the sampled tables,

The operation of randomly selecting a predetermined number of tables from the filtered tables is,

A method of selecting the predetermined number of tables based on the time it takes to perform the analysis task.
According to clause 12,

Based on the name and/or type of the column constituting the table, it is determined whether to perform analysis on the column,

The operation of predicting the analysis task execution time in the resource environment of the first condition further considers whether to perform analysis on the corresponding column.
According to clause 9,

The operation of calculating the analysis task execution time and billing cost in a resource environment having the plurality of second conditions includes,

A method of calculating analysis task execution time and billing costs in the plurality of resource environments based on a combination of the processing speed of the central processing unit (CPU) of the electronic device, memory capacity, disk size, and network bandwidth.