TW201909002A

TW201909002A - Data set transaction and computing resource integration method with which the consumers may conveniently search desired data sets according to their needs, and purchase the desired data sets with payments

Info

Publication number: TW201909002A
Application number: TW106124238A
Authority: TW
Inventors: 丁詠倫; 張峯偉
Original assignee: 中華電信股份有限公司
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2019-03-01
Also published as: TWI629604B

Abstract

The present invention discloses a data set transaction and computing resource integration method, which is established on a Hadoop platform. The present invention is based on the conventional resource renting mode, and additionally provides the users with a data set transaction mechanism, thereby allowing the user not only to use the general renting application, but also to upload data sets owned by the user to a data market of the present invention for sale. Furthermore, in combination with a data dictionary of the present invention to be inquired and purchased by other users, the consumers may conveniently search desired data sets according to their needs, and take the desired data sets by purchasing with payments.

Description

Data set transaction and computing resource integration method

本發明係關於一種資料集交易與運算資源整合方法，特別是關於一種用戶購買資料集後可無縫銜接至其所申租的平台資源並進行運算的資料集交易與運算資源整合方法。 The invention relates to a data set transaction and computing resource integration method, in particular to a data set transaction and computing resource integration method for a user to purchase a data set and seamlessly connect to the platform resources that are rented and operated.

目前，因巨量資料分析風潮興起，資料間的碰撞結合、交叉分析，延伸出資料集供需的需求，但目前分散式Hadoop平台(本說明書後面皆簡稱Hadoop平台)的使用模式，都僅止於用戶申租空間與運算資源的方式，沒有平台可提供一個資料集交易的程序機制。 At present, due to the rise of massive data analysis, the collision and cross-analysis between data extends the demand for supply and demand of data sets. However, the current usage patterns of the distributed Hadoop platform (hereinafter referred to as the Hadoop platform) are only limited to The way users rent space and computing resources, there is no platform to provide a program mechanism for data set transactions.

巨量資料分析興起，越來越多政府、企業、組織建立開放資料平台給大眾取用，希望透過資料的趨勢分析來協助決策，但開放資料(Open Data)在分析實務上常伴隨著某些需要克服的關卡，資料內容常常是混亂不齊、品質不一，並且分布在無數個伺服器中不易下載。 With the rise of massive data analysis, more and more governments, enterprises, and organizations have established open data platforms for the public to use, and hope to assist decision-making through trend analysis of data, but Open Data is often accompanied by certain analysis practices. Levels that need to be overcome, the content of the data is often chaotic, the quality is different, and it is not easy to download in a myriad of servers.

分析人員於開放資料平台上除了無法有效率的尋找想要的資料集外，還需處理資料接取、格式統一、資料清洗、資料驗證等繁複的過程與手續，經常導致整體分析工作延宕、效率不彰。 In addition to the inability to efficiently find the desired data set, analysts also need to deal with complicated procedures and procedures such as data access, unified format, data cleaning, and data verification, which often leads to delays and efficiency in overall analysis. Not at all.

另一方面，由於目前Hadoop平台中，並沒有針對資料集的供需提供一個資料集販售及購買的交易機制，經過用心整備的資料集，除了自己使用外，似乎只剩下免費貢獻的選項，缺少一個交易管道可以讓資料集整備者透過販售其所擁有的資料集來獲取收益。 On the other hand, because the current Hadoop platform does not provide a data set sales and purchase transaction mechanism for the supply and demand of data sets, the carefully prepared data set, except for its own use, seems to have only the option of free contribution. The lack of a trading pipeline allows dataset keepers to generate revenue by selling the datasets they own.

有鑑於上述習知技藝之問題，本發明資料集交易與運算資源整合方法之目的就是在提供一種以Hadoop平台上傳統的申租模式為基礎，額外提供用戶資料集的交易機制。讓用戶除了一般空間、運算資源的申租使用外，還可將其整備的資料集，上架到本發明的資料集市場進行販售，一方面提供資料需求者一個購買管道來方便合法取得所需的資料集，另一方面也提供資料整備者一個獲取收益的管道。 In view of the above-mentioned problems of the prior art, the purpose of the data set transaction and computing resource integration method of the present invention is to provide a transaction mechanism that additionally provides a user data set based on the traditional renting mode on the Hadoop platform. In addition to the general space and computing resources, the user can also put the prepared data set on the data collection market of the present invention for sale. On the one hand, the data requester can purchase a pipeline to facilitate legally obtaining the required data. The data set, on the other hand, also provides information to the pipeline to get a profitable pipeline.

本發明之第二目的乃提供資料與分析環境整合之便利性，以改良式權限控管設計與Hadoop平台的特性，建構出即買即用的便利環境，讓用戶購買資料後可以立即銜接平台上申租之運算資源來進行運算分析作業，免除用戶資料搬移之不便性，大幅降低資料傳輸花費與建構分析環境之成本。 The second object of the present invention is to provide convenience for data and analysis environment integration, to improve the characteristics of the privilege control design and the Hadoop platform, and to construct a convenient environment ready for use, so that users can immediately connect to the platform after purchasing the materials. Renting computing resources to perform computational analysis operations, eliminating the inconvenience of user data movement, greatly reducing the cost of data transmission and the cost of constructing an analysis environment.

本發明之另一目的由所揭露之資料字典機制提供需求者以更有效率的方式進行查詢、瀏覽、購買需要的資料集。資料字典機制，透過貼標資訊將資料集進行系統化分類，並搭配熱門度分析優化程序來提供精準的推薦機制，供資料需求者輕鬆獲取想要的資料集，讓資料分析者站在經過良好整備的資料來源基礎上，更方便、快速的進行深度分析與探索，創造出更多巨量資料的價值。 Another object of the present invention is to provide a data set that the demander can query, browse, and purchase in a more efficient manner by the disclosed data dictionary mechanism. Data dictionary mechanism, systematically classify data sets through labeling information, and use the popularity analysis optimization program to provide accurate recommendation mechanism for data users to easily obtain the desired data set, so that data analysts stand well. Based on the source of the prepared data, it is more convenient and quick to carry out in-depth analysis and exploration to create more valuable data.

本發明之資料集交易與運算資源整合方法包含下列步驟：根據用戶提出之資源申租請求，於Hadoop平台分配儲存空間與運算資源；運行資料集交易平台於Hadoop平台以提供資源申租請求所對應之用戶進行資料集交易；以及於Hadoop平台上設置授權管理代理模組，根據用戶提出的資料集交易請求進行權限整合，提供用戶直接於Hadoop平台上存取經由資料集交易對應的資料集。 The data set transaction and computing resource integration method of the present invention comprises the following steps: allocating storage space and computing resources on the Hadoop platform according to the resource requesting request by the user; and running the data set trading platform on the Hadoop platform to provide the resource renting request The user performs data set transaction; and sets an authorization management agent module on the Hadoop platform, and performs permission integration according to the data set transaction request submitted by the user, and provides the user to directly access the data set corresponding to the transaction via the data set transaction on the Hadoop platform.

承上所述，依本發明之資料集交易與運算資源整合方法，其可具有一或多個下述優點： As described above, the data set transaction and computing resource integration method according to the present invention may have one or more of the following advantages:

1.基於Hadoop平台，額外提供一新穎的資料集交易方式，用戶除了申租Hadoop平台作為資料儲存、運算等功能之外，用戶亦可販售其原始資料供其他用戶進行分析使用，資料提供用戶可從中獲得收益，而收益亦可抵償申租費用的運作模式。 1. Based on the Hadoop platform, an additional data set transaction method is provided. In addition to the application of the Hadoop platform as a data storage and computing function, users can also sell their original data for analysis and use by other users. Earnings can be derived from it, and the proceeds can also offset the operating mode of the subscription fee.

2.本發明以資料字典模式提供用戶有效率的資料探索方式，結合即時推薦分析與貼標分類探索，協助用戶快速的在資料市集上找出適用的資料與降低搜索時間成本。 2. The invention provides a user-efficient data exploration mode in a data dictionary mode, and combines real-time recommendation analysis and labeling classification exploration to help users quickly find applicable data and reduce search time cost in the data market.

3.本發明於Hadoop平台之上提供一資料集買賣的交易模式，提高來源資料集來源取得的便利性、易用性、更新性、可讀性，以利資料消費用戶統整利用。 3. The invention provides a transaction mode of data collection and sale on the Hadoop platform, and improves the convenience, ease of use, updateability and readability of the source of the source data set, so as to facilitate the utilization of data consumption users.

4.本發明提供無縫整合環境，資料消費用戶在購買資料集，並經本發明授權機制之管理整合，能立即使用資料集並直接銜接至Hadoop平台上所申租的平台之資源與空間進行分析運算，省去額外架設及維護分析環境之時間與成本。 4. The present invention provides a seamless integration environment. The data consumer user purchases the data set and is managed and integrated by the authorization mechanism of the present invention, and can immediately use the data set and directly connect to the resources and space of the platform rented on the Hadoop platform for analysis. Calculations eliminate the time and cost of additional erection and maintenance of the analysis environment.

S110~S140、S210~S224、S410~S430‧‧‧步驟 S110~S140, S210~S224, S410~S430‧‧‧ steps

圖1係為本發明之資料集交易與運算資源整合方法之流程圖。 FIG. 1 is a flow chart of a method for integrating data set transaction and computing resources according to the present invention.

圖2係為本發明之資料集交易與運算資源整合方法之另一流程圖。 2 is another flow chart of the method for integrating data set transaction and computing resources according to the present invention.

圖3係為本發明之資料集交易與運算資源整合方法之又一流程圖。 FIG. 3 is still another flow chart of the method for integrating data set transaction and computing resources according to the present invention.

以下將描述具體之實施例以說明本發明之實施態樣，惟其並非用以限制本發明所欲保護之範疇。 The specific embodiments are described below to illustrate the embodiments of the invention, but are not intended to limit the scope of the invention.

請參閱圖1，其為本發明資料集交易與運算資源整合方法的流程圖，本發明之步驟包含：S110：根據用戶提出之資源申租請求，於Hadoop平台分配儲存空間與運算資源；S120：運行資料集交易平台於Hadoop平台以提供資源申租請求所對應之用戶進行資料集交易；S130：於Hadoop平台上設置授權管理代理模組，根據用戶提出的資料集交易請求進行權限整合，提供用戶直接於Hadoop平台上存取經由資料集交易對應的資料集；S140：根據用戶之資料交易以及資源申租請求計算費用。 Please refer to FIG. 1 , which is a flowchart of a data set transaction and computing resource integration method according to the present invention. The steps of the present invention include: S110: Allocating storage space and computing resources on a Hadoop platform according to a resource request request submitted by a user; S120: The data set transaction platform is used in the Hadoop platform to provide data collection transactions for users corresponding to the resource subscription request; S130: setting an authorization management agent module on the Hadoop platform, and performing permission integration according to the data set transaction request submitted by the user, providing users Accessing the data set corresponding to the data set transaction directly on the Hadoop platform; S140: calculating the fee according to the user's data transaction and the resource subscription request.

以下則針對前述各步驟做一詳細說明。 The following is a detailed description of each of the above steps.

首先S110申租平台資源負責依據用戶提出的資源申租內容，在Hadoop平台上進行儲存空間分配與運算資源Vcore(virtual core，核心)數量、記憶體(Memory)大小的動態資源分配設定。本發明中步驟S120，提供一個資料集上架販售及購買的交易機制，負責用戶間資料集交易處理，其組成請參考圖2，主結構依據用戶行為分成S210上架程序及S220購買程序，其執行細節包含S211資料集貼標與資訊登錄、S212資料集定價、S221查詢資料字典並進行資料購買、S222紀錄資料集購買資訊、S223即時回饋熱門度分析模型給資料字典以及S224資料權限設定共6項程序。 First, the S110 subscription platform resource is responsible for realizing the dynamic resource allocation setting of the storage space allocation and the computing resource Vcore (virtual core) and the memory size according to the resource subscription content proposed by the user. In the step S120 of the present invention, a transaction mechanism for selling and purchasing data sets is provided, which is responsible for transaction processing of data sets between users, and the composition thereof is referred to FIG. 2, and the main structure is divided into a S210 shelf-up program and an S220 purchase program according to user behavior, and the execution thereof is executed. Details include S211 dataset labeling and information registration, S212 dataset pricing, S221 query data dictionary and data purchase, S222 record dataset purchase information, S223 instant feedback popularity analysis model to data dictionary and S224 data permission setting a total of 6 items program.

當用戶欲將其整備後的資料集於資料市集進行販售時，將進行S210上架程序，先由S211資料集貼標與資訊登錄程序依上架資料集的資料性質進行貼標與資訊登錄作業，其登錄資訊包含屬性標籤、資料類別與資料集的格式等必要之資訊內容，其中屬性與類別屬於多值型態，登錄作業會將相關資訊登記至資料市集的metadata檔案中。 When the user wants to collect the data set in the data market for sale, the S210 will be put on the shelf. The S211 data set labeling and information registration program will be used for labeling and information registration according to the nature of the data set. The login information includes the necessary information content such as the attribute label, the data category and the format of the data set, wherein the attribute and the category belong to a multi-value type, and the login operation registers the related information into the metadata file of the data market.

經整備後並透過程序S211內的資訊登錄作業而上架的資料集，需符合Hadoop平台上之資料集規範，為Hadoop平台內通用之檔案格式，進而達到資料集格式的一致化，以提供購買用戶使用資料集的便利性、易用性、更新性、可讀性。 After the preparation, the data set uploaded through the information registration operation in the program S211 shall conform to the data set specification on the Hadoop platform, and be a common file format in the Hadoop platform, thereby achieving the consistency of the data set format to provide the purchase user. The convenience, ease of use, updateability, and readability of using data sets.

上架用戶接續進行步驟S212資料集定價，此一程序提供設定介面讓用戶對其所販售之資料集進行定價。 The on-board user continues to perform the step S212 data set pricing, which provides a setting interface for the user to price the data set they are selling.

在本發明中資料集購買定義上採限時授權使用性質，非一次買斷性質，用戶所制定之價格為資料消費用戶在特定時間單位內授權使用該資料集的價格，其特定時間單位稱之為資料使用授權時間，為一個最小不可分拆的時間區間，資料消費用戶的購買行為需以整數倍數的時間單位進行購買，本平台也會以整數倍數的時間單位進行資料授權，因此購買的資料集於購買授權的期限內，資料消費用戶皆可持續使用，若超過資料集授權期限，則須重新購買資料集。完成以上S210上架程序後，該資料集即可於資料市集上進行銷售。 In the present invention, the definition of the data set purchase is limited in the nature of the use of the license, and the non-buy-off property, the price set by the user is the price at which the data consumer authorized to use the data set in a specific time unit, and the specific time unit is called The data use authorization time is a minimum non-separable time interval. The purchase behavior of the data consumer user needs to be purchased in integer multiples of time units. The platform also authorizes the data in integer multiples of time units, so the purchased data set During the period of the purchase authorization, the data consumer users will be able to use it continuously. If the data set authorization period is exceeded, the data set must be re-purchased. After completing the above S210 shelf-up program, the data set can be sold on the data market.

資料消費用戶從購買程序內的S221查詢資料字典並進行資料購買的步驟，付費取得所需之資料集。S221程序中所提及之資料字典包含一個購買介面與一個查詢介面，作為資料消費用戶資料集查詢與購買的入口。 The data consumer user obtains the required data set by purchasing the data dictionary from the S221 in the purchase program and purchasing the data. The data dictionary mentioned in the S221 program contains a purchase interface and a query interface as an entry point for the query and purchase of the data consumer user data set.

資料字典之查詢介面依據資料貼標屬性來提供系統性的分類瀏覽，也具備關鍵字查詢、綜合關鍵字查詢與資料貼標屬性的方式來進行快速篩選。資料字典內部查詢引擎乃依據熱門度分析所產生之數據模型進行查詢結果的排序，提供查詢者一份可能感興趣的結果清單，目的在協助減少查詢探索的時間花費。 The query interface of the data dictionary provides systematic classification browsing according to the data labeling attribute, and also has the methods of keyword query, comprehensive keyword query and data labeling attribute for quick screening. The data dictionary internal query engine sorts the query results based on the data model generated by the popularity analysis, and provides a list of results that the queryer may be interested in, in order to help reduce the time spent on query exploration.

為了讓查詢資訊更具參考價值，資料字典中呈現資訊包含該資料集的資料欄位說明、資料格式、更新頻率、資料筆數、檔案大小、出售價格、以及資料集貼標中上架者提供的資訊。 In order to make the query information more reference value, the information presented in the data dictionary contains the data field description, data format, update frequency, number of data, file size, sale price, and information provided by the collector in the data set. News.

資料字典之特性在於透過資料貼標分類查詢、關鍵字結合熱門度查詢，以提供用戶一種更有效率、具參考價值的資料搜尋介面，協助用戶快速找到適用的資料集，讓用戶在查詢評估後購買所需資料集進行使用。 The characteristic of the data dictionary is to provide users with a more efficient and referenced data search interface through data labeling and query, keyword and popularity query, to help users quickly find the applicable data set, so that users can query after evaluation. Purchase the required data set for use.

當購買行為發生後，接著執行程序S222紀錄資料集購買資訊，此一紀錄程序進行收集資料市集上用戶購買的資料集資訊，此資料集資訊至少包含資料集名稱、資料集貼標屬性、用戶購買日期、購買次數、購買單價、資料集空間用量以及當下該資料集於資料字典的快照資訊。 After the purchase behavior occurs, the program S222 records the data set purchase information, and the record program collects the data set information purchased by the user on the data market. The data set information includes at least the data set name, the data set label attribute, and the user. Purchase date, purchase count, purchase unit price, dataset space usage, and snapshot information of the current data set in the data dictionary.

為了讓資料字典達到更準確的查詢結果，本發明使用一種即時更新數據模型的分析架構，經由S223即時回饋熱門度分析模型給資料字典的程序，以最新的購買數據所生成之數據模型來即時更新資料字典查詢引擎，解決因模型過舊無法反映現階段數據而失準的現象。 In order to make the data dictionary achieve more accurate query results, the present invention uses an analysis framework for instantly updating the data model, and instantly updates the popularity analysis model to the data dictionary program via S223, and instantly updates the data model generated by the latest purchase data. The data dictionary query engine solves the phenomenon that the model is too old to reflect the current stage of data and is inaccurate.

其運作程序以S222程序中所紀錄之資料集購買資訊為基礎，透過Hadoop平台之即時串流元件記錄資料傳遞至S223程序內進行熱門度分析作業，分析包含一至多種資料分析演算法組成，視需求可調整、組合、抽換，將依據資料市集上資料集的最新販售資訊、用戶購買歷史紀錄產出熱門度推薦模型，此熱門度推薦模型再利用Hadoop平台之即時串流元件回饋到資料字典的查詢引擎進行模型更新。 The operation procedure is based on the information collection information recorded in the S222 program. The real-time streaming component record data of the Hadoop platform is transmitted to the S223 program for the popularity analysis operation. The analysis includes one or more data analysis algorithms, depending on the requirements. It can be adjusted, combined, and exchanged. It will be based on the latest sales information of the data set on the data market, and the user purchase history record popularity recommendation model. This popularity recommendation model will then use the real-time streaming component of the Hadoop platform to feed back the data. The dictionary's query engine performs model updates.

接著執行S224資料權限設定，此程序設計目的是為了解決Hadoop平台上既有權限管控能力之不足，無法進行授權時限管理與多組用戶權限設定之組合而進行設計改良。 Then execute the S224 data permission setting. The purpose of this program is to solve the problem of the lack of the authority and control ability on the Hadoop platform, and the design limitation of the combination of the authorization time management and the multi-group user permission setting.

該程序S224負責處理資料集存取權限管理，並產出一份權限規則檔案來進行規則設定作業，將依據用戶帳號以及購買的時間單位，設定用戶對於所購買資料集之讀取使用權限，並設定授權使用時間起訖，當日期超過授權使用期間後將無法繼續使用該資料集。 The program S224 is responsible for processing the data set access authority management, and generating a permission rule file for the rule setting operation, and setting the user's read permission for the purchased data set according to the user account and the purchased time unit, and After setting the authorization time, the data set will not be used after the date exceeds the authorized use period.

權限管理改良乃透過於在Hadoop平台每台主機中(一個hadoop平台中有多台主機)埋設代理程式(Agent，每台都要裝代理程式，以控管權限)，由Agent進行存取層面的控制，搭配權限規則檔案紀錄之每組用戶認證金鑰、授權資料集項目與資料授權有效日期訊息所達成，而上述權限規則檔案將透過加密機制傳遞至Hadoop平台上每台主機所埋設的Agent中，Agent依據權限規則檔案來進行用戶認證並授予對應資料集的存取權限。 The rights management improvement is based on the fact that each agent in the Hadoop platform (multiple hosts in a hadoop platform) buryes agents (agents, each of which must be installed with an agent to control permissions), and the agent performs access level. Control, with each set of user authentication key, authorization data set item and data authorization effective date message of the permission rule file record, and the above permission rule file will be transmitted to the Agent embedded in each host on the Hadoop platform through the encryption mechanism. The Agent performs user authentication according to the permission rule file and grants access rights to the corresponding data set.

為了防止權限檔案被竄改之風險，設計採用多方比對權限檔案SHA256碼的確認機制，當Agent接收存取權限要求時會先與其他Agent連接來進行權限檔案比對作業，透過SHA256演算法對權限檔案運算來得出之編碼與其他主機權限檔案之SHA256碼進行比對，若相同一致則通過完整性驗證，表示沒有被修改，反之若與任一主機權限檔案之SHA256碼比對結果不符，則啟動多數決同步機制，將具備多數SHA256碼的權限檔案同步至其他主機，來進行權限修復。 In order to prevent the risk of the privilege file being falsified, the design uses the multi-party privilege file SHA256 code confirmation mechanism. When the Agent receives the access permission request, it will first connect with other Agents to perform the privilege file comparison operation, and the permission through the SHA256 algorithm. The code obtained by the file operation is compared with the SHA256 code of other host permission files. If the same is the same, the integrity verification is performed, indicating that it has not been modified. Otherwise, if the comparison result of the SHA256 code of any host permission file does not match, the file is started. The majority synchronization mechanism synchronizes the rights files with most SHA256 codes to other hosts for permission repair.

參閱圖1，在S130的程序中，由改良式權限控管設計與Hadoop平台的特性來建構出即買即用的便利環境。用戶於資料市集經付費購買資料集後，其資料集權限經由S224程序所設定之權限規則檔案，將傳遞至Hadoop平台上每台主機中所埋設的Agent以進行權限設定，Agent會架設一個管制層於每個Hadoop運算服務介面前，並自動執行對應資料集的權限設定，透過權限設定並搭配權限規則檔案內的資料使用時限來管制倉儲(data warehouse)內的資料集及資料集的實際檔案之讀取權限。 Referring to FIG. 1, in the program of S130, the improved privilege control design and the characteristics of the Hadoop platform are used to construct a convenient environment for ready-to-use. After the user purchases the data set in the data market, the data set permission is transferred to the agent embedded in each host on the Hadoop platform through the permission rule file set by the S224 program, and the agent sets up a control. Layers in front of each Hadoop computing service, and automatically implements the permission setting of the corresponding data set, and controls the actual file of the data set and the data set in the data warehouse by setting the authority and matching the data usage time limit in the permission rule file. Read access.

權限經開通後，用戶即可取用及讀取倉儲內的資料集及資料集的實際檔案，購買用戶無需額外進行資料的搬運，即可無縫的取用所購買的資料集，而各運算服務介面也能夠存取倉儲內的資料集或資料集的實際檔案，讓用戶使用其所申租之運算資源並搭配Hadoop平台上之運算服務，例如Spark、Hive...等運算架構，直接於平台上對資料集進行分析、運算及利用，不須額外架設其他分析運算環境，進而達到資料集與運算資源整合之目的。 After the permission is opened, the user can access and read the actual file of the data set and the data set in the warehouse, and the purchase user can seamlessly access the purchased data set without the additional data carrying, and each computing service The interface can also access the actual file of the data set or data set in the warehouse, allowing users to use the computing resources they rent and use the computing services on the Hadoop platform, such as Spark, Hive, etc., directly on the platform. The analysis, calculation and utilization of the data set do not require additional analysis and computing environments to achieve the purpose of data set and computing resource integration.

每個出帳週期平台(Hadoop平台)將進行S140綜合計費程序，綜合用戶在平台上申租資源的費用以及資料市集的收入支出費用，進而產出用戶於該次出帳週期的帳單，其組成請參考圖3，S140綜合計費程序中包含S410平台資源計費程序、S420資料市集計費程序、S430依計費項目產出用戶收支單，以下將針對細節進行說明闡述。 Each billing cycle platform (Hadoop platform) will carry out the S140 integrated billing program, which integrates the user's expenses for renting resources on the platform and the income and expenses of the data market, thereby generating the user's bill for the billing cycle. For the composition, please refer to Figure 3. The S140 integrated billing program includes the S410 platform resource billing program, the S420 data market billing program, and the S430 billing project output user income and expenditure sheet. The following details will be explained.

儲存資源部分，用戶所使用之儲存資源將由S411：分散式檔案系統(HDFS，Hadoop Distributed File System)申租空間收集程序進行處理，該程序會收集用戶HDFS申租空間的資訊，其中收集資料包含申租空間大小、原始資料佔用空間、資料備援佔用空間、資料備援份數、剩餘可用空間等用量相關資訊，並將資料彙整與寫入本平台的計費資料庫中。 In the storage resource part, the storage resources used by the user will be processed by the S411: Hadoop Distributed File System (HDFS) application space collection program, which collects information about the HDFS subscription space of the user, and the collected data includes the application. The information about the size of the rent space, the space occupied by the original data, the space occupied by the data backup, the number of data backup copies, the remaining available space, etc., and the data is collected and written into the billing database of the platform.

運算資源部分，當用戶於Hadoop平台上進行資料分析運算後，其資源用量資訊皆會由平台內的資源管理系統紀錄在日誌檔案中，透過執行S412：虛擬核心(Vitual Core，Vcore)運算資源用量統計與S413：記憶體(Memory)運算資源用量統計程序，對各用戶所產生的資源用量日誌檔案進行分析，以獲得用戶於本平台內資源用量資訊與使用時間之數據。 In the computing resource part, when the user performs data analysis on the Hadoop platform, the resource usage information will be recorded in the log file by the resource management system in the platform, and the resource usage is calculated by executing S412: virtual core (Vcore) (Vcore). Statistics and S413: Memory computing resource usage statistics program analyzes the resource usage log files generated by each user to obtain the user's resource usage information and usage time data in the platform.

其中，所採用之量化方式，是以該次運算分析中Vcore使用數目乘上實際執行時間進行量化、Memory採用該運算作業所分配之Memory大小乘上實際執行時間進行量化，並將所收集之用量彙整寫入本平台的計費資料庫中。 The quantization method used is quantized by multiplying the number of Vcores used in the calculation by the actual execution time, and Memory is quantized by the memory size allocated by the operation operation by the actual execution time, and the collected amount is used. The summary is written into the billing database of the platform.

而在S412及S413程序運算資源用量統計之實作細節內，為了降低分析出帳周期內龐大資源用量日誌檔案的執行時間成本，在分析用量日誌作業上，本發明採取微量批次分析策略，將出帳週期切割成多個用量資料收集週期，每個用量資料收集週期將對日誌檔案屬於該週期時間之紀錄進行資源用量分析與彙總，並利用檔案讀取指標於每次分析後來紀錄目前位置來加速下次讀取作業。 In the implementation details of the S412 and S413 program operation resource usage statistics, in order to reduce the execution time cost of the huge resource usage log file in the analysis of the posting period, the present invention adopts a micro batch analysis strategy in analyzing the usage log operation. The billing cycle is cut into multiple data collection periods. Each usage data collection cycle will analyze and summarize the resource usage records of the log files belonging to the cycle time, and use the file reading indicators to record the current position after each analysis. Speed up the next read job.

以上設計可有效使用與釋放所使用之分析資源，分散單次資源用量分析所耗費之時間成本，能夠更有效率的統計出用戶於平台上所使用之Vcore及Memory的運算資源用量。 The above design can effectively use and release the analysis resources used, and the time cost of dispersing the single resource usage analysis can more effectively calculate the amount of computing resources used by the Vcore and Memory used by the user on the platform.

以上所收集之資訊將交由S414申租資源計費程序在設定的出帳週期進行用戶申租資源之費用計算，將S411：HDFS申租空間收集、S412：Vcore運算資源用量統計、413Memory運算資源用量統計三者所取得之用量資訊，搭配資源租用單價計算出各項費用，三項彙總後作為該用戶出帳週期內之平台資源申租費用。 The information collected above will be submitted to the S414 renting resource billing program to calculate the cost of the user's leased resources in the set billing cycle. S411: HDFS lease space collection, S412: Vcore computing resource usage statistics, 413Memory computing resources The usage information obtained by the usage statistics is calculated by using the unit price of the resource, and the three items are used as the platform resource subscription fee in the user's billing cycle.

另一部份針對資料市集之帳目處理是由S420資料市集計費程序執行，首先透過S421收集資料市集交易紀錄程序依據前文所提及之S222程序中所記錄之用戶購買紀錄，針對個別用戶計算本次出帳週期內用戶購買總額度費用，另一方面針對資料集來計算本期中個別資料集的總銷售額度資訊，並將所彙總的交易紀錄傳遞至S422資料市集計費程序中進行帳目計算處理。 The other part of the account processing for the data market is executed by the S420 data market billing program. The first step is to collect the data market transaction record program through S421 according to the user purchase record recorded in the S222 procedure mentioned above. The user calculates the total purchase cost of the user during the current billing cycle, and calculates the total sales information of the individual data sets in the current period for the data set, and passes the summarized transaction records to the S422 data market billing program. Perform account calculation processing.

接著S422資料市集計費程序負責依據平台設定之出帳週期計算資料市集的帳目資訊，產出的帳目資訊分為兩部分：其一是身為資料消費用戶應支付之購買價金，依據用戶所購買資料集之購買單位乘上該資料集銷售單價所得；其二是身為資料提供用戶販售資料集之收益金，依據該資料集之總銷售額度扣除平台抽取總銷售額度之一定比例作為上架費用，最後所得即為資料提供用戶於該出帳週期於資料市集內所得之收益。 Then, the S422 data market billing program is responsible for calculating the account information of the data market according to the billing cycle set by the platform, and the output account information is divided into two parts: one is the purchase price that the data consumer user should pay. According to the purchase unit of the data set purchased by the user, the sales unit price of the data set is multiplied; the second is the income of the user sales data set as the data, and the total sales amount of the data set is deducted according to the total sales amount of the data set. The ratio is used as the cost of the shelf, and the final income is the data to provide the user with the proceeds from the data market during the billing cycle.

最後於S430綜合計費程序負責依據平台設定之出帳週期處理平台用戶之帳單總結算，將結合以上S414申租資源計費所提供之資源申租所產生之費用，以及S422資料市集計費所提供之資料集購買與銷售之帳目，產生此次出帳週期用戶之帳單。 Finally, the S430 integrated billing program is responsible for processing the billing total settlement of the platform users according to the billing cycle set by the platform, and the fee generated by the resource renting provided by the above S414 renting resource billing, and the S422 data market billing The account set purchase and sale account provided provides the bill for the user of the billing cycle.

接著，則以一應用實施例，對本發明資料集交易與運算資源整合方法進行實際運作上的具體說明。 Then, an application embodiment is used to specifically describe the actual operation of the data set transaction and computing resource integration method of the present invention.

本發明以Hadoop平台為基礎，提供用戶申租資源以及建構用戶間資料集交易管道，用戶可以使用Hadoop平台上之資源對資料集進行運算，並可選擇將產出的資料集上架至本平台所提供的資料市集上進行販售。 The invention is based on the Hadoop platform, provides user renting resources and constructs a data flow pipeline between users, and the user can use the resources on the Hadoop platform to calculate the data set, and can choose to upload the generated data set to the platform. The information provided is sold on the market.

Hadoop平台內常見包括分散式檔案系統(HDFS，Hadoop Distributed File System)、資源管理系統(YARN)、分散式資料倉儲資料庫(Hive)、記憶體分析運算(Spark)等元件，分散式檔案系統使用分散式方式儲存檔案資料，資源管理系統YARN負責Hadoop叢集的資源管理及調度，分散式資料倉儲資料庫Hive和記憶體分析運算Spark可用來執行資料查詢、資料分析、資料探索，而分散式資料倉儲資料庫Hive以及記憶體分析運算Spark所執行之分析應用查詢皆由資源管理系統YARN進行資源分配及調度管理。 Hadoop platform commonly includes distributed file system (HDFS, Hadoop Distributed File System), resource management system (YARN), distributed data repository database (Hive), memory analysis operation (Spark) and other components, distributed file system use Decentralized storage of archives, resource management system YARN is responsible for resource management and scheduling of Hadoop clusters, decentralized data repository database Hive and memory analysis operations Spark can be used to perform data query, data analysis, data exploration, and distributed data storage The database application Hive and the memory analysis operation Spark perform the analysis application query by the resource management system YARN for resource allocation and scheduling management.

圖1為本發明之流程圖，用戶將透過程序S110申租資源，包括HDFS空間及平台運算資源，可儲存整備後的資料或分析過的資料，平台再依據用戶提出的資源申租內容，在Hadoop平台上進行儲存空間分配與運算資源Vcore數量、Memory大小的動態資源分配設定，再於程序S120內進行資料市集交易。 FIG. 1 is a flowchart of the present invention. The user will rent a resource through the program S110, including the HDFS space and the platform computing resource, and may store the prepared data or the analyzed data. The platform then uses the resource subscription content proposed by the user. The Hadoop platform performs dynamic resource allocation setting of the storage space allocation and the number of computing resources Vcore and Memory, and then performs data market trading in the program S120.

在進行完資料集交易後，即可於程序S130整合平台運算資源與購買的資料集，本發明提供即買即用的整合性與便利性，用戶購買資料集後將立即經由分散式Hadoop平台內的授權機制開通授權使用，並且由於資料市集內提供的資料集已經經過整備並匯入分散式資料倉儲資料庫內，提供一制式的資料來源介面，用戶在購買資料集及經過資料集使用授權後，可以使用Hadoop平台上提供可以介接分散式資料倉儲資料庫介面或分散式檔案系統介面的分析運算工具，例如Hive、Spark...等進行分析運算，資料分析完的結果也可儲存於申租的HDFS空間內，省去架設環境與資料載入的時間與成本。 After the data set transaction is completed, the platform computing resource and the purchased data set can be integrated in the program S130. The present invention provides ready-to-use integration and convenience, and the user purchases the data set immediately after passing through the distributed Hadoop platform. The authorization mechanism is opened for authorization, and since the data set provided in the data market has been prepared and imported into the decentralized data storage database, a standard data source interface is provided, and the user purchases the data set and authorizes the data set. After that, you can use the Hadoop platform to provide analysis and calculation tools that can interface with the distributed data repository database interface or the distributed file system interface, such as Hive, Spark, etc., and the data analysis results can also be stored in In the HDFS space for rent, the time and cost of erecting the environment and loading data are eliminated.

而程序S140綜合計費將依據用戶於資料市集的交易結果及用戶於Hadoop平台上的空間、運算資源使用情形進行計費。以下將以一實際使用案例進行說明。 The program S140 comprehensive billing will be based on the user's transaction results in the data marketplace and the user's space and computing resource usage on the Hadoop platform. The following will be explained in a practical use case.

在以下實施例中，將Hadoop平台上之申租用戶，依照供需兩端的角色定義為「資料提供用戶」及「資料消費用戶」。 In the following embodiments, the tenant users on the Hadoop platform are defined as "data providing users" and "data consumer users" according to the roles of the two ends.

「資料提供用戶」於程序S110內向Hadoop平台申租一分散式檔案系統(HDFS)儲存空間存放資料，並將該資料載入資料倉儲中，例如分散式資料倉儲資料庫(Hive)表格中，稱為「資料集」，「資料提供用戶」可在資料市集內申請上架其載入資料倉儲中的資料集。 The "data providing user" in the program S110 rents a distributed file system (HDFS) storage space to the Hadoop platform to store the data, and loads the data into the data storage, such as the distributed data storage database (Hive) table, said For the "data set", the "data providing user" can apply for the data set in the data storage in the data market.

圖2為本發明之資料市集交易流程圖，分為S210上架程序及S220購買程序，以下將針對「資料提供用戶」的資料集上架程序，以及「資料消費用戶」的資料集購買程序依序說明。 2 is a flow chart of the data market transaction of the present invention, which is divided into a S210 racking program and an S220 purchasing program. The following is a data set listing procedure for the "data providing user" and a data set purchasing procedure for the "data consumer user". Description.

「資料提供用戶」的資料集上架程序舉例如下：資料提供用戶有一份經過整備的2015年台北市公車各站地點及搭乘人次的資料集，透過S211資料集貼標與資訊登錄程序針對資料集進行屬性貼標及分類的動作，將該資料集貼標交通、公車、台北市、人次、地點等貼標資訊，並透過介面登錄資料集的名稱、原始資料格式、資料集欄位定義、資料範例檔案等基本資訊。 The data collection user program of the "data providing users" is as follows: The data providing user has a data set of the locations and attendances of the 2015 Taipei bus stations, and the data collection is carried out through the S211 data set labeling and information registration procedures. The function of attribute labeling and classification, labeling the data set with the labeling information of traffic, bus, Taipei city, person, location, etc., and registering the name of the data set, the original data format, the data set field definition, and the data sample through the interface. Basic information such as files.

於S211程序內將上架的資料集轉成符合本平台所提供之資料倉儲儲存格式，接下來透過S212資料集定價程序之使用者介面，針對資料集設定資料使用授權時間單位的單價，本實施例中授權時間單位設定為1個月，所標示之價格1000元即為該資料集於1個月內使用的收費價格，Hadoop平台會將「資料提供用戶」所制定的資料集單價寫入資料庫中儲存，上架後的資料集會顯示於資料字典之中進行販售。 In the S211 program, the data set on the shelf is converted into the data storage and storage format provided by the platform, and then the user interface of the S212 data set pricing program is used to set the unit price of the data use authorization time unit for the data set. The authorized time unit is set to 1 month, and the price marked 1000 yuan is the price charged for the data set within 1 month. The Hadoop platform will write the unit price of the data set set by the "data provider" into the database. In the middle store, the data set after the shelf is displayed in the data dictionary for sale.

「資料消費用戶」的資料集購買程序如下：「資料消費用戶」於程序S221查詢資料字典並進行資料購買程序中，用戶可透過資料字典的查詢介面進行分類查詢、瀏覽、搜尋，從而選擇欲購買的資料集，根據資料集貼標資訊，該資料集將可於交通、公車類別內查找到，也可透過關鍵字查詢、綜合關鍵字查詢搜尋，例如透過關鍵字搜尋「公車」，則會根據S223程序所回饋的資料的熱門度、相關度顯示搜尋結果的排序，最相關及最熱門的資料的顯示順序越高，方便資料消費用戶有效率的查找資料。 The data collection process of "data consumer users" is as follows: "data consumer users" in the program S221 to query the data dictionary and carry out the data purchase process, the user can sort, query, browse and search through the query interface of the data dictionary to select the purchase According to the information set labeling information, the data set will be found in the traffic and bus categories. It can also be searched by keyword search or comprehensive keyword search. For example, searching for "bus" by keyword will be based on The popularity and relevance of the data returned by the S223 program show the ranking of the search results. The higher the display order of the most relevant and hottest data, the more convenient the data consumer can find the data.

資料字典所顯示之查詢結果包含了各個資料集名稱、資料集原始資料格式、資料集欄位定義、資料範例檔案、資料集瀏覽次數、資料集下載次數、資料熱門度分析結果等資訊，供「資料消費用戶」瀏覽。 The query results displayed in the data dictionary contain information such as each data set name, dataset original data format, dataset field definition, data sample file, dataset browsing times, dataset download times, data popularity analysis results, etc. Data consumer users" browse.

「資料消費用戶」進行資料集購買行為時，需決定欲購買之資料集的使用期限，所購買的時間單位需為資料使用授權時間單位的整數倍數，本實施例內資料消費用戶購買了2個月的資料使用期限。 When the data consumer purchases the data set purchase behavior, it is necessary to determine the use period of the data set to be purchased. The time unit purchased must be an integral multiple of the data use authorization time unit. In this embodiment, the data consumer user purchases 2 Monthly data usage period.

當購買行為完成後，將會啟動程序S222紀錄資料集購買資訊，Hadoop平台會對於「資料消費用戶」該次購買行為進行紀錄，紀錄內容包含資料集名稱、資料集貼標屬性、用戶購買日期、購買次數、購買單價、購買使用期限、資料傳輸用量以及該資料集於資料字典的資訊。 After the purchase is completed, the program S222 will be started to record the data collection purchase information. The Hadoop platform will record the purchase behavior of the "data consumer user". The record content includes the data set name, the data set labeling attribute, the user purchase date, The number of purchases, the purchase price, the purchase period, the data transfer amount, and the information in the data dictionary.

每次購買行為發生時，程序S223即時回饋熱門度分析模型給資料字典，會使用即時串流元件將購買紀錄傳送至分析架構進行熱門度模型的更新運算，每一筆新的購買資訊都將納入分析基礎資料，分析演算法可使用例如協同式過濾分析搭配關聯式規則分析演算法，但不限於上述組合，即時計算熱門度推薦模型，並將最新的熱門度推薦模型透過串流元件即時回饋到資料字典中來優化查詢，其他「資料消費用戶」於資料字典查詢時就可以依據最新的銷售狀況進行推薦與優化排序結果。 When each purchase occurs, the program S223 immediately returns the popularity analysis model to the data dictionary, and uses the instant streaming component to transmit the purchase record to the analysis framework for the update operation of the popularity model, and each new purchase information will be included in the analysis. Basic data, analysis algorithms can use, for example, collaborative filtering analysis with associated rule analysis algorithms, but not limited to the above combination, instant calculation of the popularity recommendation model, and the latest popularity recommendation model is instantly fed back to the data through the streaming component In the dictionary to optimize the query, other "data consumer users" can be recommended and optimized based on the latest sales status when querying the data dictionary.

Hadoop平台將依據用戶所購買的資料集授權時間單位，於程序S224資料權限設定內依據「資料消費用戶」所購買資料集產生權限規則設定檔案，檔案內包含了用戶識別碼、資料集識別碼、資料集類型、資料集路徑、使用時效。並將權限規則檔案傳入S130程序由Hadoop平台上每台中所埋設的Agent(代理程式)實際進行權限管理，控制資料集及其實際檔案之讀取權限與時間限制。 The Hadoop platform will set the file according to the authorized data set of the data set purchased by the user in the data permission setting of the program S224 according to the data set purchased by the “data consumer user”. The file contains the user identification code, the data set identification code, Dataset type, dataset path, and usage time. The permission rule file is transferred to the S130 program. The Agent (agent) embedded in each of the Hadoop platforms actually manages the rights, and controls the read permission and time limit of the data set and its actual files.

Hadoop平台對於「資料消費用戶」所購買的資料集將以使用時效進行具有使用時間限制的授權，所購買的資料集將依照「資料消費用戶」購買的使用期限內開通授權使用，本實施例內資料消費用戶所購買能使用資料集的期限即為2個月，資料集使用一旦超過使用期限平台會自動視為授權逾期，須重新購買方能再度使用。 The data set purchased by the Hadoop platform for "data consumer users" will be authorized for use time limit by the time limit for use. The purchased data set will be authorized for use within the period of use purchased by the "data consumer user", in this embodiment. The period of time that the data consumer can purchase the data set is 2 months. Once the data set is used, the platform will automatically be deemed to be overdue and must be re-purchased before it can be used again.

透過S130整合平台運算資源與購買的資料集，用戶購買資料集後將立即經由本專利中所改良之Hadoop平台授權機制開通授權使用，權限經開通後，「資料消費用戶」即可在購買的2個月的使用期限內，針對資料集的內容進行資料的分析及利用，並透過埋設之Agent於Hadoop平台上提供之分析運算工具介面前端所建構之管制層自動進行權限整合管理，讓Hadoop平台工具可以讀取倉儲內的資料集及資料集的實際檔案，故購買用戶可以使用Hadoop平台上所提供之分析運算工具介接分散式資料倉儲資料庫介面，直接進行分析運算，無需進行任何資料複製搬移到用戶空間的手續。 Through the S130 integration platform computing resources and purchased data sets, users will immediately activate the authorization through the Hadoop platform authorization mechanism modified in this patent after purchasing the data set. After the permission is opened, the “data consumer users” can purchase 2 Within the life of the month, analyze and utilize the data of the data set, and automatically manage the permissions through the control layer constructed by the embedded agent on the Hadoop platform. The Hadoop platform tool is used. The data file and the actual file of the data set can be read in the warehouse. Therefore, the purchase user can use the analysis and calculation tool provided on the Hadoop platform to interface with the distributed data storage database interface, and directly perform analysis and calculation without any data copying and moving. The procedure to the user space.

圖3為本發明之綜合計費流程圖，計費項目主要分為兩個部分，分別為S410平台資源計費程序及S420資料市集計費程序，S410程序內主要為Hadoop平台上HDFS空間及運算資源的申租費用，S420程序則為用戶間資料集交易所產生之相關費用，以下將針對兩個程序運作方式進行說明。 3 is a comprehensive billing flowchart of the present invention. The billing item is mainly divided into two parts, namely, the S410 platform resource billing program and the S420 data market billing program, and the S410 program is mainly the HDFS space and operation on the Hadoop platform. The application fee for the resource, the S420 program is the related fee generated by the data exchange between users. The following describes the operation mode of the two programs.

在S410程序中，S411：HDFS申租空間收集會收集每位用戶所申租的HDFS空間資訊，本實施例內用戶申租之HDFS空間為30T，當用戶進行資料運算分析(例如：Hive查詢、Spark Job...)的過程中，將分別透過S412：Vcore運算資源用量統計程序及S413：Memory運算資源用量統計程序收集所使用之Vcore運算資源及Memory運算資源，透過分析資源管理系統內用戶運算過程所產生的資源用量日誌檔案，量化方式分別為運算過程所使用的Vcore數量乘上實際執行時間、運算過程所配置的memory大小乘上實際執行時間作為用量統計的收集。 In the S410 program, the S411: HDFS subscription space collection collects the HDFS space information that each user subscribes to. In this embodiment, the user subscribes to the HDFS space of 30T, and when the user performs data operation analysis (for example, Hive query, During the process of Spark Job...), the Vcore computing resources and Memory computing resources used by the S412: Vcore computing resource usage statistics program and the S413: Memory computing resource usage statistics program are collected, and the user operations in the resource management system are analyzed. The resource usage log file generated by the process is quantified by the number of Vcores used in the operation process multiplied by the actual execution time, the memory size configured by the operation process, and the actual execution time as the collection of usage statistics.

本實施例中定義平台的用量資料收集週期為1天，每隔1天就執行資料收集週期程序411、412、413，來收集空間及資源用量並持續彙整存入本平台的資料庫中。 In this embodiment, the usage data collection period of the platform is defined as one day, and the data collection cycle programs 411, 412, and 413 are executed every other day to collect space and resource usage and continuously collect and store them in the database of the platform.

用戶1天的Vcore使用量=Σ(運算過程所使用的Vcore數量*分析運算所使用的秒數)=4500 User 1 day Vcore usage = Σ (the number of Vcores used in the calculation process * the number of seconds used in the analysis operation) = 4500

用戶1天的Memory使用量=Σ(運算過程所配置的memory大小(MB)*分析運算所使用的秒數)=3200000 Memory usage by user for 1 day = Σ (memory size (MB) configured in the calculation process * number of seconds used for analysis operation) = 3200000

程序S414申租資源計費以1個月的出帳週期進行出帳，與前述之用量資料收集週期之關係為：出帳週期是由數個用量資料收集週期所構成，須為用量資料收集週期的整數倍數。程序S414從資料庫中取出S411、S412、S413等三程序所寫入之數值，為該用戶當月於平台上資源使用的用量數值，其中S412、S413程序將以1個月的出帳週期做為統計基礎，抓取1個月內的使用量。 The program S414 leases the resource billing by the one month billing cycle, and the relationship with the foregoing usage data collection period is: the billing period is composed of several usage data collection periods, and must be the usage data collection period. Integer multiple. The program S414 takes out the values written by the three programs S411, S412, S413 and the like from the database, and is the usage value of the resource used by the user on the platform in the current month, wherein the S412 and S413 programs will use the one-month billing cycle as the The statistical basis is to capture the usage within one month.

用戶Vcore 1個月內使用量=Σ(用戶1天的Vcore使用量)=60000 User Vcore usage within 1 month = Σ (user's 1 day Vcore usage) = 60000

用戶Memory 1個月內使用量 =Σ(用戶1天的Memory使用量)=5000000 User Memory usage within 1 month = Σ (user 1 day memory usage) = 5000000

資料庫內的用量數值資訊包含用戶唯一識別碼、帳務期數、HDFS申租空間、Vcore總用量、Memory總用量，搭配Hadoop平台上訂定的HDFS空間單價、Vcore資源單價、Memory資源單價，計算出該用戶於Hadoop平台上使用的資源所產生之費用，其計算公式為：用戶資源計費=(Vcore資源單價* Vcore 1個月內使用量)+(Memory資源單價* Memory 1個月內使用量)+(HDFS資源單價* HDFS申租空間) The usage value information in the database includes the user's unique identification code, the number of accounts, the HDFS subscription space, the total Vcore usage, and the total amount of Memory. It is matched with the HDFS space unit price, the Vcore resource unit price, and the Memory resource unit price set on the Hadoop platform. Calculate the cost of the user's resources used on the Hadoop platform. The calculation formula is: user resource billing = (Vcore resource unit price * Vcore 1 month usage) + (Memory resource unit price * Memory within 1 month Usage) + (HDFS resource unit price * HDFS subscription space)

承上舉例，若系統所設定Vcore資源單價是每秒一核心為0.005元、Memory資源單價是每秒一MB為0.0001元、HDFS資源單價是每TB為80元，用戶於該月份當期的平台資源計費為3200元，其計算方式如下：用戶資源計費=(60000 * 0.005)+(5000000 * 0.0001)+(30*80)=3200 For example, if the system sets the Vcore resource unit price to be 0.005 yuan per core, the memory resource unit price is 0.0001 yuan per second, and the HDFS resource unit price is 80 yuan per TB, the user's platform in the current month. The resource billing is 3,200 yuan, and its calculation method is as follows: user resource billing = (60000 * 0.005) + (5000000 * 0.0001) + (30 * 80) = 3200

在S420程序中，先透過S421程序收集資料市集交易紀錄程序，彙總本次出帳週期內經由前文所提及之S222紀錄資料集購買及使用資訊程序中所記錄之用戶購買紀錄，其中記錄了用戶購買2015年台北市公車各站地點及搭乘人次的資料集、資料集單價為1000元、資料使用授權時間單位為2個月，並將所彙總的交易紀錄傳遞至S422資料市集計費程序中進行帳目計算處理。 In the S420 program, the data market transaction record program is first collected through the S421 program, and the user purchase records recorded in the S222 record data set purchase and use information program mentioned in the previous billing cycle are summarized. The user purchases the data sets and data sets of the stations and the number of passengers in Taipei City in 2015. The unit price is 1000 yuan, the data usage authorization time unit is 2 months, and the collected transaction records are transmitted to the S422 data market billing program. Perform account calculation processing.

接著S422資料市集計費程序負責依據平台設定的1個月的出帳週期，計算資料市集的帳目資訊，該帳目資訊分為兩部分：其一是「資料消費用戶」應支付之購買價金，依據「資料消費用戶」所購買資料集之資料使用授權時間單位乘上該資料集於資料字典內所標示之資料集單價價格所得；其二是「資料提供用戶」販售資料集之收益金，依據該資料集之總銷售額度扣除平台抽取總銷售額度之一定比例，本實施例內之手續費為10%，最後所得即為資料提供用戶於該出帳週期於資料市集內所得之收益，計算公式如下：資料消費用戶於資料市集計費=資料集單價*購買單位n=1000 * 2=2000 Then, the S422 data market billing program is responsible for calculating the account information of the data market according to the one-month billing cycle set by the platform. The account information is divided into two parts: one is the purchase of the "data consumer user". The price is based on the data usage authorization time unit of the data set purchased by the “data consumer user” multiplied by the unit price of the data set indicated in the data dictionary; the second is the “data provider user” sales data set. The proceeds are based on the total sales of the data set minus a certain percentage of the total sales of the platform. The handling fee in this embodiment is 10%. The final income is the data provided by the user in the data market during the billing cycle. The income, the calculation formula is as follows: data consumer users in the data market billing = data set unit price * purchase unit n = 1000 * 2 = 2000

資料提供用戶收益=上架資料集之銷售總額*(1-手續費比例)=2000 * 90%=1800 Data to provide user income = total sales of the shelf data set * (1 - handling fee ratio) = 2000 * 90% = 1800

透過程序S414得到用戶於平台資源使用上產生的費用，及透過程序S422得到資料市集內之消費(資料消費用戶)或獲利(資料提供用戶)，將產出「S430綜合計費」，將以此為收費基準，向用戶送出用戶收支單。而Hadoop平台經營者將對每個用戶收取用戶資源計費及資料市集交易之手續費，計算公式及範例如下：資料消費用戶綜合計費=資源計費+資料市集計費=3200+2000=5200 The fee generated by the user in using the platform resource is obtained through the program S414, and the consumption (data consumer user) or the profit (data providing user) in the data market is obtained through the program S422, and the "S430 comprehensive billing" will be produced. Based on this, the user's income and expenditure is sent to the user. The Hadoop platform operator will charge the user resource billing and data market transaction fees for each user. The calculation formula and examples are as follows: data consumption user comprehensive billing = resource billing + data market billing = 3200 + 2000 = 5200

Hadoop平台經營者收益 =資源收費款項+(資料市集銷售總額*手續費比例)=3200+(2000 * 10%)=3400 Hadoop platform operator income = resource charge + (data market total sales * commission fee ratio) = 3200 + (2000 * 10%) = 3400

綜上所述，本發明之資料集交易與運算資源整合方法，可讓資料需求者付費取得所需要的資料集並可立即於平台上執行運算分析，大幅降低建構分析環境的不便性，並為資料提供者帶來收益，滿足資料供需兩端的需求。 In summary, the data set transaction and computing resource integration method of the present invention allows the data requester to pay for the required data set and can immediately perform computational analysis on the platform, thereby greatly reducing the inconvenience of constructing the analysis environment, and The data provider brings benefits and meets the needs of both ends of the data supply and demand.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

Claims

A data set transaction and computing resource integration method includes: allocating storage space and computing resources on a Hadoop platform according to a resource claim request by a user; running a data set trading platform on the Hadoop platform to provide the resource for renting The user corresponding to the request performs a data set transaction; and an authorization management agent module is set on the Hadoop platform, and the permission integration is performed according to the data set transaction request submitted by the user, and the user is directly stored on the Hadoop platform. Take a data set corresponding to the transaction through the data set.

The data set transaction and computing resource integration method of claim 1, further comprising the steps of: calculating a fee according to the data transaction of the user and the resource subscription request; and transmitting the fee to the user.

The data set transaction and computing resource integration method according to claim 1, wherein the step of running the data set trading platform on the Hadoop platform to provide the resource subscription request to the user to perform the data set transaction further comprises the following steps: Providing a data dictionary as a query purchase interface; the query purchase interface further provides the user a recommended data set according to a category browsing, a keyword query and a popularity analysis; and setting the data according to the data set selected by the user A set of usage rights.

The data set transaction and computing resource integration method according to claim 3, wherein the step of the authorization management agent module performing permission integration according to the data set transaction request submitted by the user further comprises the following steps: performing permission integration according to the usage authority And analyzing and computing the data set on a platform.

The data set transaction and computing resource integration method of claim 2, further comprising the steps of: charging according to the result of the data set transaction and a resource subscription status on the Hadoop platform; wherein the resource is rented The situation includes multiple usage log files.

The data set transaction and computing resource integration method according to claim 3, further comprising the steps of: determining that the data set transaction is a data set shelf program or a purchase data set program; if the data set is on the shelf program, performing the following Step: Label the data set; log in to the data set; and price the data set.

The data set transaction and computing resource integration method according to claim 6, if the purchase data set program, the following steps are performed: querying the data dictionary; purchasing the data set; recording a data set purchase information; purchasing according to the data set The information provides the popularity analysis to the data dictionary; and sets the usage rights.

The data set transaction and computing resource integration method according to claim 5, wherein the charging according to the resource subscription status on the Hadoop platform further comprises the following steps: performing a platform resource charging procedure, which comprises the following steps: obtaining The user rents a system space of one of the distributed file systems; obtains a computing amount of a virtual core; obtains a resource usage amount of a memory; and calculates the fee according to the system space, the operation amount, and the resource usage.

The data set transaction and computing resource integration method according to claim 5, wherein the result of the data set transaction further comprises the following steps: performing a data set transaction charging procedure, comprising the following steps: obtaining the data set trading platform a transaction result; and calculating the fee based on the result of the transaction.