TWI684147B

TWI684147B - Cloud self-service analysis platform and analysis method thereof

Info

Publication number: TWI684147B
Application number: TW107124789A
Authority: TW
Inventors: 陳昱全; 范登凱
Original assignee: 中華電信股份有限公司
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2020-02-01
Also published as: TW202006617A

Abstract

The present invention is a cloud self-service analysis platform and an analysis method thereof. The analysis method includes: searching the existing data extraction rules applicable to a data analysis task in a shared resource pool, wherein the shared resource pool stores multiple existing data extraction rules; analyzing whether there is a conflict between the data analysis task and the existing data extraction rules, so as to generate a final data extraction rule; generating an acquisition script according to the final data extraction rule to perform the required data collection; and allocating an analysis algorithm and the collected data to a computing resource pool for calculation according to the priority of the analysis task and generating the final analysis result.

Description

Cloud self-service analysis platform and its analysis method

本發明係有關多人協同資訊分享技術，詳而言之，係關於一種雲端自助分析平台與其分析方法。 The present invention relates to multi-person collaborative information sharing technology. In detail, it relates to a cloud self-service analysis platform and analysis method.

一般而言，資料分析專案的檔案大多為高容量，且一份檔案常常具有許多不同型態或屬性，當資料分析師面對第一次接觸的資料時，常採取嘗試錯誤方法來設定資料擷取規則，藉此取得該資料有意義的輪廓，舉例來說：每天統計屬性值A出現的次數、每天統計某欄位值大於設定值的次數...等。相反地，倘若資料分析師熟悉該檔案結構或掌握前人分享經驗，則分析任務將變成件容易的事，因而透過數據歸類整理藉以達到高效率分析成為眾所努力的目的。在現行雲端自助分析的議題中，重覆資料是一項重要的議題，簡單來說，一位資料分析師建立一個分析準則，另外一位資料分析師與其不同研究屬性，但前述分析準則仍可適用(僅些許差異)，但在缺乏多人協同分享的概念，會導致重覆資料的情況，即兩位資料分析師各建立一套分析準則，但兩者明顯近似且通用，故現行雲端自助分析的領域中，無論是原始分析資料或是透過資料擷取規則產生的衍生性資料，常因為缺乏資訊分享機制，進而造成資料重覆存放且浪費運算資源，導致自助分析效率較差。 Generally speaking, the files of data analysis projects are mostly high-capacity, and a file often has many different types or attributes. When faced with the first data contact, data analysts often adopt trial and error methods to set up data capture Take the rules to get a meaningful outline of the data, for example: count the number of times the attribute value A appears every day, count the number of times a certain field value is greater than the set value every day, etc... On the contrary, if the data analyst is familiar with the file structure or masters the experience shared by the predecessors, the analysis task will become an easy task. Therefore, it is an effort of the public to achieve high-efficiency analysis through data classification and organization. In the current topic of self-service analysis in the cloud, repeating data is an important issue. In short, one data analyst establishes an analysis criterion, and another data analyst has different research attributes from it, but the aforementioned analysis criterion can still be Applicable (only a few differences), but the lack of the concept of multi-person collaborative sharing will lead to repeated data, that is, two data analysts each establish a set of analysis criteria, but the two are obviously similar and common, so the current cloud self-service analysis In the field, whether it is raw analysis data or derivative data generated through data extraction rules, the lack of information sharing mechanism often results in repeated storage of data and waste of computing resources, resulting in poor self-service analysis efficiency.

由上可知，若能找出一種多人協同資訊分享技術，特別是，適用於雲端自助分析平台，讓資料分析師在進行資料分析時，有機會參考到前人建立的分析準則，藉此提升分析效率且避免運算資源浪費，此將成為目前本技術領域人員急欲解決之技術問題。 It can be seen from the above that if a multi-person collaborative information sharing technology can be found, in particular, it is applicable to the cloud self-service analysis platform, so that data analysts have the opportunity to refer to the analysis criteria established by the predecessors when performing data analysis, thereby improving Analyzing efficiency and avoiding waste of computing resources will become a technical problem that those skilled in the art are eager to solve.

本發明之目的係提出一種基於多人協同資訊分享來提昇雲端自助分析效率的機制與服務，透過參酌前人分享的資料擷取規格，藉此達到分析效率之提升以及避免重複資料的運算資源浪費。 The purpose of the present invention is to propose a mechanism and service to improve the efficiency of cloud self-service analysis based on multi-person collaborative information sharing, by referring to the data retrieval specifications shared by previous people, thereby improving analysis efficiency and avoiding waste of computing resources for duplicate data .

為了達成上述或其他目的，本發明提出一種雲端自助分析平台，包括：共享資源池，係儲存多筆既有資料擷取規則；資料擷取規則設定模組，係連接該共享資源池且用於接收外部所輸入之資料分析任務，以透過該資料分析任務至該共享資源池進行搜尋以取得適用於該資料分析任務之該既有資料擷取規則；資料擷取規則分析模組，係連接該資料擷取規則設定模組且用於分析該資料分析任務與該資料擷取規則設定模組所取得之該既有資料擷取規則是否存在衝突，以據此產生最終資料擷取規則；資料集蒐集模組，係連接該資料擷取規則分析模組及該共享資源池，該資料集蒐集模組用於依據該最終資料擷取規則產生擷取腳本，以進行所需資料之蒐集；以及分析任務排程器，係連接該資料集蒐集模組且用於根據分析任務優先順序將分析演算法及所蒐集之該所需資料分配至運算資源池進行運算，以產生最終分析結果。 In order to achieve the above or other objectives, the present invention proposes a cloud self-service analysis platform, including: a shared resource pool, which stores multiple existing data retrieval rules; a data retrieval rule setting module, which is connected to the shared resource pool and used for Receive the data analysis task input from outside to search through the data analysis task to the shared resource pool to obtain the existing data extraction rules applicable to the data analysis task; the data extraction rule analysis module is connected to the The data retrieval rule setting module is used to analyze whether the data analysis task conflicts with the existing data retrieval rule obtained by the data retrieval rule setting module, so as to generate the final data retrieval rule accordingly; the data set The collection module is connected to the data retrieval rule analysis module and the shared resource pool, and the data collection module is used to generate a retrieval foot according to the final data retrieval rule This is used to collect the required data; and the analysis task scheduler is connected to the data collection module and used to allocate the analysis algorithm and the collected required data to the computing resource pool according to the priority of the analysis task Perform calculations to produce the final analysis results.

於上述系統中，更包括連接至該資料擷取規則分析模組及該共享資源池之視覺化呈現模組，係用於透過圖形或表格呈現資料統計分佈。 In the above system, it further includes a visual presentation module connected to the data acquisition rule analysis module and the shared resource pool, which is used to present the statistical distribution of data through graphs or tables.

於上述系統中，更包括連接至該視覺化呈現模組之分析演算法選取模組，係用於供使用者選擇該分析演算法以及設定相關參數。 In the above system, it further includes an analysis algorithm selection module connected to the visual presentation module, which is used for a user to select the analysis algorithm and set related parameters.

於一實施例中，該資料集蒐集模組係於該所需資料存在於該共享資源池時，自該共享資源池取得該資源集以作為該所需資料。 In an embodiment, the data collection module is to obtain the resource set from the shared resource pool as the required data when the required data exists in the shared resource pool.

於另一實施例中，於該資料擷取規則分析模組分析該資料分析任務與該既有資料擷取規則存在衝突時產生警告訊息，且選擇延用該既有資料擷取規則或建立新的資料擷取規則的其中一者作為該最終資料擷取規則。 In another embodiment, a warning message is generated when the data acquisition rule analysis module analyzes that the data analysis task conflicts with the existing data acquisition rule, and chooses to extend the existing data acquisition rule or create a new one One of the data retrieval rules of is used as the final data retrieval rule.

於又一實施例中，該資料擷取規則設定模組更於未取得適用於該資料分析任務之該既有資料擷取規則時，由使用者自行建立新的資料擷取規則以作為該最終資料擷取規則。 In yet another embodiment, when the data retrieval rule setting module does not obtain the existing data retrieval rule applicable to the data analysis task, the user himself creates a new data retrieval rule as the final Data retrieval rules.

本發明復提出一種雲端自助分析方法，係包括下列步驟：依據所接收之資料分析任務至儲存多筆既有資料擷取規則之共享資源池中搜尋出適用於該資料分析任務之該既有資料擷取規則；分析該資料分析任務與所取得適用於該資料分析任務之該既有資料擷取規則是否存在衝突，以據此產生最終資料擷取規則；依據該最終資料擷取規則產生擷取腳本，以進行所需資料之蒐集；以及根據分析任務優先順序將分析演算法及所蒐集之該所需資料分配至運算資源池進行運算，以產生最終分析結果。 The invention further proposes a cloud self-service analysis method, which includes the following steps: searching for the data suitable for the data analysis task based on the received data analysis task into a shared resource pool storing multiple existing data retrieval rules There are data retrieval rules; analyze whether there is a conflict between the data analysis task and the existing data retrieval rules obtained for the data analysis task to generate the final data retrieval rules accordingly; generate according to the final data retrieval rules Retrieve the script to collect the required data; and allocate the analysis algorithm and the collected required data to the computing resource pool for operation according to the priority of the analysis task to generate the final analysis result.

於上述方法中，進行所需資料之蒐集更包括於該所需資料存在於該共享資源池時，自該共享資源池取得資源集以作為該所需資料。 In the above method, collecting required data further includes obtaining a resource set from the shared resource pool as the required data when the required data exists in the shared resource pool.

於上述方法中，於該資料分析任務與該既有資料擷取規則存在衝突時產生警告訊息，且選擇延用該既有資料擷取規則或建立新的資料擷取規則的其中一者作為該最終資料擷取規則。 In the above method, a warning message is generated when the data analysis task conflicts with the existing data retrieval rule, and one of the existing data retrieval rule or the creation of a new data retrieval rule is selected as the Final data retrieval rules.

於上述方法中，於未取得適用於該資料分析任務之該既有資料擷取規則時，由使用者自行建立新的資料擷取規則以作為該最終資料擷取規則。 In the above method, when the existing data retrieval rule applicable to the data analysis task is not obtained, the user creates a new data retrieval rule as the final data retrieval rule.

相較於現有技術，本發明提出的雲端自助分析平台與其分析方法，於接收使用者提出的資料分析任務時，透過資料擷取規則設定模組從共享資源池內推薦高度相關規則，配合視覺化呈現，使用者可快速掌握資料輪廓，藉此提升雲端自助分析效能，且在使用者選擇或自行建立資料擷取規則後，資料擷取規則分析模組會確認是否衝突於現有資料擷取規則，最後資料集蒐集模組負責整備所需資料，其中，資料集蒐集模組會先比對共享資源池內現有資料集，並不會重新擷取重覆資料，以節省運算資源。 Compared with the prior art, the cloud self-service analysis platform and its analysis method proposed by the present invention, when receiving data analysis tasks proposed by users, recommend highly relevant rules from the shared resource pool through the data extraction rule setting module, and cooperate with the visualization The user can quickly grasp the data outline to improve the self-service analysis performance of the cloud. After the user selects or creates the data extraction rule, the data extraction rule analysis module will confirm whether it conflicts with the existing data extraction rule. Finally, the data collection module is responsible for preparing the required data. Among them, the data collection module will first compare the existing data set in the shared resource pool. It does not recapture duplicate data to save computing resources.

100‧‧‧雲端自助分析平台 100‧‧‧ Cloud self-service analysis platform

101‧‧‧共享資源池 101‧‧‧ shared resource pool

102‧‧‧資料擷取規則設定模組 102‧‧‧Data extraction rule setting module

103‧‧‧資料擷取規則分析模組 103‧‧‧Data extraction rule analysis module

104‧‧‧資料集蒐集模組 104‧‧‧Data collection module

105‧‧‧分析任務排程器 105‧‧‧Analysis task scheduler

106‧‧‧運算資源池 106‧‧‧ computing resource pool

107‧‧‧視覺化呈現模組 107‧‧‧Visualized presentation module

108‧‧‧分析演算法選取模組 108‧‧‧Analysis algorithm selection module

200‧‧‧資料分析任務 200‧‧‧Data analysis task

401~411‧‧‧流程 401~411‧‧‧Flow

501~509‧‧‧流程 501~509‧‧‧Flow

601~609‧‧‧流程 601~609‧‧‧Flow

701~707‧‧‧流程 701~707‧‧‧Flow

S301~S304‧‧‧步驟 S301~S304‧‧‧Step

第1圖為本發明之雲端自助分析平台的系統架構圖；第2圖為本發明之雲端自助分析平台一具體實施例的系統架構圖；第3圖為本發明之雲端自助分析方法的步驟圖；第4圖為本發明一實施例中雲端自助分析方法的執行流程圖；第5圖為本發明之雲端自助分析方法有關資料擷取規則設定的執行流程圖；第6圖為本發明之雲端自助分析方法有關資料擷取規則分析的執行流程圖；以及第7圖為本發明之雲端自助分析方法有關資料集蒐集的執行流程圖。 Figure 1 is a system architecture diagram of the cloud self-service analysis platform of the present invention; Figure 2 is a system architecture diagram of a specific embodiment of the cloud self-service analysis platform of the present invention; Figure 3 is a step diagram of the cloud self-service analysis method of the present invention Figure 4 is an execution flow chart of a cloud self-service analysis method in an embodiment of the invention; Figure 5 is an execution flow chart of a data acquisition rule setting of the cloud self-service analysis method of the invention; Figure 6 is a cloud embodiment of the invention The execution flow chart of the data acquisition rule analysis of the self-service analysis method; and FIG. 7 is the execution flow chart of the data collection of the cloud self-service analysis method of the present invention.

以下藉由特定的具體實施形態說明本發明之技術內容，熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The technical content of the present invention will be described below with specific specific implementation forms, and those skilled in the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied by other specific embodiments.

第1圖說明本發明之雲端自助分析平台的系統架構圖。如圖所示，雲端自助分析平台100可供資料分析專家透過具備瀏覽器之工作站提交資料分析任務200至雲端自助分析平台100，藉此平台提供資料蒐集規則的協助，以於節省運算資源下找出所需資料，其中，雲端自助分析平台100 包括共享資源池101、資料擷取規則設定模組102、資料擷取規則分析模組103、資料集蒐集模組104、分析任務排程器105及運算資源池106。 Figure 1 illustrates the system architecture of the cloud self-service analysis platform of the present invention. As shown in the figure, the cloud self-service analysis platform 100 can be used by data analysis experts to submit data analysis tasks 200 to the cloud self-service analysis platform 100 through a workstation with a browser, by which the platform provides assistance in data collection rules to save computing resources Provide the required information, including the cloud self-service analysis platform 100 It includes a shared resource pool 101, a data retrieval rule setting module 102, a data retrieval rule analysis module 103, a data collection module 104, an analysis task scheduler 105, and a computing resource pool 106.

共享資源池101用於儲存多筆既有資料擷取規則。前述既有資料擷取規則即為先前所設定的各種資料擷取規則，因而本案及提供參考先前資料擷取規則，藉此達到節省運算資源和避免資料重複存放的情況。 The shared resource pool 101 is used to store multiple existing data retrieval rules. The aforementioned existing data retrieval rules are various data retrieval rules previously set, so this case and the reference to the previous data retrieval rules are provided to save computing resources and avoid duplicate data storage.

資料擷取規則設定模組102連接共享資源池101且用於接收外部所輸入之資料分析任務200，藉以透過資料分析任務200至共享資源池101進行搜尋以取得適用於該資料分析任務200之既有資料擷取規則。雲端自助分析平台100收到資料分析任務200後，會驅動資料擷取規則設定模組102進行資料庫查詢，判定此分析檔案是否存在之前的資料擷取規則，若該檔案已被分析過，則可以推薦先前存在的既有資料擷取規則，或者進行新的資料擷取規則制定，使用者可直接透過介面設定規則，或由平台系統根據屬性型態推薦適用的資料擷取規則，也就是說，當資料擷取規則設定模組102未取得適用於資料分析任務200之既有資料擷取規則時，則由使用者自行建立新的資料擷取規則以作為最終資料擷取規則。 The data extraction rule setting module 102 is connected to the shared resource pool 101 and is used to receive the externally input data analysis task 200, so as to search through the data analysis task 200 to the shared resource pool 101 to obtain the data suitable for the data analysis task 200 There are data retrieval rules. After receiving the data analysis task 200, the cloud self-service analysis platform 100 will drive the data extraction rule setting module 102 to query the database to determine whether the analysis file exists before the data extraction rule. If the file has been analyzed, then You can recommend the existing existing data retrieval rules, or make new data retrieval rules, users can set the rules directly through the interface, or the platform system recommends the applicable data retrieval rules according to the attribute type, that is to say When the data retrieval rule setting module 102 does not obtain the existing data retrieval rules applicable to the data analysis task 200, the user creates a new data retrieval rule as the final data retrieval rule.

資料擷取規則分析模組103連接資料擷取規則設定模組102且用於分析該資料分析任務200與該資料擷取規則設定模組102所取得之既有資料擷取規則是否存在衝突，以據此產生最終資料擷取規則。資料擷取規則分析模組 103會確認資料分析任務200是否衝突於現有資料擷取規則，若有衝突則提示警告，使用者最後仍可以決定是否要新增此資料結擷取規則，也就是說，當資料擷取規則分析模組103分析資料分析任務200與既有資料擷取規則存在衝突時會產生警告訊息，此時將延用既有資料擷取規則或建立新的資料擷取規則的其中擇一作為最終資料擷取規則。 The data retrieval rule analysis module 103 is connected to the data retrieval rule setting module 102 and is used to analyze whether there is a conflict between the data analysis task 200 and the existing data retrieval rule obtained by the data retrieval rule setting module 102. Based on this, the final data extraction rules are generated. Data extraction rule analysis module 103 will confirm whether the data analysis task 200 conflicts with the existing data extraction rules, and if there is a conflict, it will warn you, and the user can finally decide whether to add this data junction extraction rule, that is, when the data extraction rule is analyzed The module 103 analyzes the data analysis task 200 and the existing data retrieval rule conflicts and generates a warning message. At this time, the existing data retrieval rule or a new data retrieval rule will be selected as the final data retrieval Take the rules.

資料集蒐集模組104連接資料擷取規則分析模組103及共享資源池101，該資料集蒐集模組104用於依據該最終資料擷取規則產生擷取腳本，以進行所需資料之蒐集。資料集蒐集模組104會根據輸入的資料擷取規則產生擷取腳本，並將此腳本提交至分析任務排程器105進行資料擷取任務。 The data collection module 104 is connected to the data retrieval rule analysis module 103 and the shared resource pool 101. The data collection module 104 is used to generate a retrieval script according to the final data retrieval rule to collect the required data. The data collection module 104 generates an extraction script according to the input data extraction rules, and submits the script to the analysis task scheduler 105 for data acquisition tasks.

分析任務排程器105連接資料集蒐集模組104且用於根據分析任務優先順序將分析演算法及所蒐集之該資料分配至運算資源池106進行運算，以產生最終分析結果。當資料蒐集完成時會觸發分析任務排程器105，分析任務排程器105會根據分析任務優先順序，將分析演算法及資料集分配至運算資源池106內進行運算，最後將結果儲存至共享資源池101內，共享資源池101內儲存最後分析結果外，也會將前步驟產生的資料集、資料擷取規則等資訊一併儲存。 The analysis task scheduler 105 is connected to the data collection module 104 and is used to allocate the analysis algorithm and the collected data to the computing resource pool 106 for operation according to the priority of the analysis task to generate the final analysis result. When the data collection is completed, the analysis task scheduler 105 will be triggered. The analysis task scheduler 105 will allocate the analysis algorithm and data set to the computing resource pool 106 for operation according to the priority of the analysis task, and finally store the results to the share In the resource pool 101, in addition to storing the final analysis result in the shared resource pool 101, information such as the data set and data extraction rules generated in the previous step will also be stored together.

於一實施例中，倘若所需資料存在於共享資源池101時，資料集蒐集模組104可自共享資源池101取得該資源集以作為該所需資料。 In an embodiment, if the required data exists in the shared resource pool 101, the data collection module 104 can obtain the resource from the shared resource pool 101 Set as the required information.

透過上述方式，資料分析專家可將資料分析任務200送至雲端自助分析平台100，雲端自助分析平台100會依據任務內容搜尋適用的資料擷取規則給資料分析專家參考，資料分析專家可沿用既有的資料擷取規則，也可建立新的資料擷取規則，並判斷是否存在衝突。最後在選定資料擷取規則後，資料集蒐集模組104進行資料蒐集，並由分析任務排程器105考量任務優先順序、分析演算法等進行排程，並至運算資源池106進行運算，最終得到分析結果。 Through the above method, the data analysis expert can send the data analysis task 200 to the cloud self-service analysis platform 100. The cloud self-service analysis platform 100 will search for the applicable data extraction rules according to the task content to the data analysis expert for reference. The data analysis expert can use the existing The data retrieval rules can also create new data retrieval rules and determine whether there is a conflict. Finally, after selecting the data retrieval rules, the data collection module 104 collects the data, and the analysis task scheduler 105 considers the task priority order, analysis algorithm, etc. to schedule, and then to the computing resource pool 106 for calculation. Get the analysis result.

第2圖說明本發明之雲端自助分析平台一具體實施例的系統架構圖。如圖所示，雲端自助分析平台100之共享資源池101、資料擷取規則設定模組102、資料擷取規則分析模組103、資料集蒐集模組104、分析任務排程器105及運算資源池106與第1圖所述相同，故不再贅述。於本實施例中，雲端自助分析平台100更包括視覺化呈現模組107和分析演算法選取模組108。 FIG. 2 illustrates a system architecture diagram of a specific embodiment of the cloud self-service analysis platform of the present invention. As shown in the figure, the shared resource pool 101, the data retrieval rule setting module 102, the data retrieval rule analysis module 103, the data collection module 104, the analysis task scheduler 105, and the computing resources of the cloud self-service analysis platform 100 The pool 106 is the same as described in FIG. 1, so it will not be described in detail. In this embodiment, the cloud self-service analysis platform 100 further includes a visual presentation module 107 and an analysis algorithm selection module 108.

視覺化呈現模組107連接至資料擷取規則分析模組103及共享資源池101，可用於透過圖形或表格呈現資料統計分佈。具體來說，視覺化呈現模組107可透過不同圖形(例如圓餅圖、長條圖等)、表格呈現資料統計分佈，以方便使用者知悉要選擇的資料擷取規則內容為何、機率排序等資訊。 The visual presentation module 107 is connected to the data extraction rule analysis module 103 and the shared resource pool 101, and can be used to present the statistical distribution of data through graphs or tables. Specifically, the visual presentation module 107 can present the statistical distribution of data through different graphs (such as pie charts, bar charts, etc.) and tables, so that the user can easily know what the content of the data extraction rule to be selected, the probability of sorting, etc. News.

分析演算法選取模組108連接至視覺化呈現模組107，可用於供使用者選擇所要採用之分析演算法以及設定相關參數。具體來說，當確定資料擷取規則後，透過分析演算法選取模組108挑選分析演算法，(例如支援向量機、決策數、迴歸分析等，並設定相關參數，之後選定的分析演算法會供分析任務排程器105進行排程並於運算資源池106進行運算。 The analysis algorithm selection module 108 is connected to the visual presentation module 107, and can be used for the user to select the analysis algorithm to be used and related settings parameter. Specifically, after determining the data extraction rules, the analysis algorithm selection module 108 is used to select the analysis algorithm (such as support vector machine, decision number, regression analysis, etc.), and set the relevant parameters, and then the selected analysis algorithm will The analysis task scheduler 105 schedules and performs calculations in the calculation resource pool 106.

另外，雲端自助分析平台100除了上述元件模組外，更掌握分析任務的提交/暫停/回復/刪除等功能，最後回報分析任務結果給資料分析專家。 In addition, in addition to the above-mentioned component modules, the cloud self-service analysis platform 100 also masters the functions of submitting/suspending/replying/deleting analysis tasks, and finally returns the results of the analysis tasks to the data analysis experts.

第3圖說明本發明之本發明之雲端自助分析方法的步驟圖。如圖所示，於步驟S301中，依據所接收之資料分析任務至儲存多筆既有資料擷取規則之共享資源池中搜尋出適用於該資料分析任務之該既有資料擷取規則。具體來說，當收到資料分析任務時，會先至共享資源池中搜尋是否有適用之既有資料擷取規則，若有使用者可選擇套用，藉此減少計算資源的浪費。 Figure 3 illustrates the steps of the cloud self-service analysis method of the present invention. As shown in the figure, in step S301, the existing data retrieval rules applicable to the data analysis task are searched for in the shared resource pool storing multiple existing data retrieval rules according to the received data analysis task. Specifically, when receiving a data analysis task, it will first search the shared resource pool for applicable existing data retrieval rules, and if any user can choose to apply it, thereby reducing the waste of computing resources.

於一實施例中，於未取得適用於該資料分析任務之該既有資料擷取規則時，由使用者自行建立新的資料擷取規則以作為最終資料擷取規則。 In an embodiment, when the existing data retrieval rules applicable to the data analysis task are not obtained, the user creates a new data retrieval rule as the final data retrieval rule.

於步驟S302中，分析該資料分析任務與所取得適用於該資料分析任務之該既有資料擷取規則是否存在衝突，以據此產生最終資料擷取規則。簡言之，不論是套用既有資料擷取規則或者是建立新的資料擷取規則，都需與現有資料擷取規則進行分析，判斷是否存在衝突，藉此避免往後計算出現不同結果。 In step S302, it is analyzed whether there is a conflict between the data analysis task and the existing data extraction rules obtained for the data analysis task, so as to generate a final data extraction rule accordingly. In short, whether it is applying existing data acquisition rules or creating new data acquisition rules, it is necessary to analyze with existing data acquisition rules to determine whether there is a conflict, thereby avoiding different results in subsequent calculations.

於一實施例中，於該資料分析任務與該既有資料擷取規則存在衝突時產生警告訊息，且選擇延用該既有資料擷取規則或建立新的資料擷取規則的其中一者作為該最終資料擷取規則。 In one embodiment, a warning message is generated when the data analysis task conflicts with the existing data retrieval rule, and one of the existing data retrieval rule or the creation of a new data retrieval rule is selected as The final data retrieval rules.

於步驟S303中，依據該最終資料擷取規則產生擷取腳本，以進行所需資料之蒐集。當選定最終資料擷取規則後，則開始進行資料蒐集，特別是，在進行所需資料之蒐集時，若該所需資料存在於共享資源池時，自共享資源池取得資源集以作為該所需資料。最後，所蒐集資料會被送至排程器進行運算排程。 In step S303, an extraction script is generated according to the final data extraction rule to collect the required data. When the final data extraction rule is selected, data collection begins. In particular, when the required data is collected, if the required data exists in the shared resource pool, the resource set is obtained from the shared resource pool as the institute Information required. Finally, the collected data will be sent to the scheduler for calculation scheduling.

於步驟S304中，根據分析任務優先順序將分析演算法及所蒐集之該所需資料分配至運算資源池進行運算，以產生最終分析結果。排程器會根據分析任務優先順序，將選定的分析演算法及前步驟蒐集到的資料集分配至運算資源池內進行運算，以產生最終分析結果。另外，運算資源池最終分析結果與相關資料(資料集、資料擷取規則等)，會回存至共享資源池，以供後續分析時可參考利用。 In step S304, the analysis algorithm and the collected required data are allocated to the calculation resource pool for calculation according to the analysis task priority order, so as to generate a final analysis result. The scheduler will allocate the selected analysis algorithm and the data set collected in the previous step to the calculation resource pool for calculation according to the analysis task priority order, so as to generate the final analysis result. In addition, the final analysis results of the computing resource pool and related data (data sets, data extraction rules, etc.) will be restored to the shared resource pool for reference and use in subsequent analysis.

第4圖為本發明一實施例中雲端自助分析方法的執行流程圖，其說明本發明的核心方法流程圖。如圖所示，於流程401中，使用者透過介面選定分析資料後，開始設定資料擷取規則，接著使用者可採取下列不同方式進行規則設定，例如流程402中，挑選平台系統推薦的資料擷取規則，其中，平台系統會根據資料屬性進行推薦，另外於流程403中，當無適合的推薦規則時，使用者可自行建立資料擷取規則。 FIG. 4 is an execution flowchart of a cloud self-service analysis method according to an embodiment of the present invention, which illustrates the core method flowchart of the present invention. As shown in the figure, in the process 401, the user selects and analyzes the data through the interface, and then starts to set the data extraction rules, and then the user can adopt the following different ways to set the rules, for example, in the process 402, select the platform system recommended data acquisition Rules, where the platform system recommends based on data attributes. In addition, in process 403, when there is no suitable recommendation rule, the user can establish Material retrieval rules.

當完成資料規則集設定後，進入流程404，提交資料擷取規則至資料擷取規則分析模組進行分析作業，於流程405中，資料擷取規則分析模組根據相似度判斷是否存在衝突的資料擷取規則，若存在衝突規則，則進入流程406，使用者可依據平台系統提示，嘗試解決該衝突規則，若無衝突規則，則進入流程407，使用者可將資料擷取規則集提交至資料蒐集模組進行蒐集任務。 After the data rule set is set, enter the process 404 and submit the data extraction rule to the data extraction rule analysis module for analysis. In the process 405, the data extraction rule analysis module determines whether there is conflicting data according to the similarity Retrieve rules, if there are conflict rules, then enter the process 406, the user can try to resolve the conflict rules according to the platform system prompt, if there are no conflict rules, then enter the process 407, the user can submit the data retrieval rule set to the data The collection module performs collection tasks.

於流程408中，資料蒐集模組會先確認共享資源池內是否存在相對應資料集，若存在則進入流程409，直接從共享資源池中取得該資料集，若不存在則進入流程410，開始進行資料集蒐集任務，最後進入流程411，將資料集存入共享資源池，即完成資料整備任務。 In the process 408, the data collection module will first confirm whether there is a corresponding data set in the shared resource pool. If it exists, enter the process 409, directly obtain the data set from the shared resource pool, if not, enter the process 410, start Perform the data collection task, and finally enter the process 411, store the data collection in the shared resource pool, and complete the data preparation task.

第5圖為本發明之雲端自助分析方法有關資料擷取規則設定的執行流程圖，即說明第1、2圖中資料擷取規則設定模組102的運作。使用者可透過網頁介面上傳分析資料(資料分析任務)，或者選擇雲端自助分析平台上現存資料，此時，資料擷取規則設定模組除了推薦使用者高度相關的資料擷取規則外，亦可以讓使用者自行建立規則。 FIG. 5 is an execution flowchart of data extraction rule setting of the cloud self-service analysis method of the present invention, that is, the operation of the data extraction rule setting module 102 in FIGS. 1 and 2 is explained. Users can upload analysis data (data analysis tasks) through the web interface, or choose the existing data on the cloud self-service analysis platform. At this time, the data extraction rule setting module can also recommend highly relevant data extraction rules for users. Let users create their own rules.

於流程501中，當選定欲分析資料後，使用者與資料擷取規則模組進行互動，即依據選定的資料檔案，查詢共享資源池是否已存在相對應的資料擷取規則，若有則可直接選用既有資料擷取規則。接著，於流程502中，無論前一步驟是否有挑選既有資料擷取規則，在本流程中皆可指定任一資料欄位屬性(稱目標屬性ta)，進行相關規則推薦。 In the process 501, after selecting the data to be analyzed, the user interacts with the data extraction rule module, that is, according to the selected data file, query whether the corresponding data extraction rule already exists in the shared resource pool, if so, Directly select existing data retrieval rules. Next, in the process 502, no matter whether the previous step selects the existing data retrieval rule, any data field attribute (referred to as the target attribute ta ) can be specified in this process to recommend related rules.

於流程503中，首先根據資料欄位屬性推薦資料擷取，利用同義字字典篩選出與目標屬性名稱相近之字詞，連同目標屬性本身形成候選詞集合C，例如：目標屬性名稱為“住家位置”，同義字典中發現“地址”、“住址”、“位置”、“郵遞區號”為同義字，此時候選詞集合C包含{“住家位置”,“地址”,“住址”,“位置”,“郵遞區號”}。於流程504中，過濾與目標屬性的不同資料型態之同義字詞，例如：目標屬性“住家位置”資料型別為“字串”，然而同義字“位置”的屬性為“整數”與資料型別“字串”不同，故從候選詞集合C中剔除“位置”字詞，此時候選詞集合C包含{“住家位置”,“地址”,“住址”,“郵遞區號”}。接下來，於流程505中，根據前步驟過濾的候選詞集合C，過濾屬性值域互斥且獨立之同義字，如下表一所示：

In the process 503, first of all, according to the data field attribute recommendation data extraction, the synonym dictionary is used to filter out words that are similar to the target attribute name, together with the target attribute itself to form a candidate word set C , for example: the target attribute name is "home location"", synonymous dictionary found "address", "address", "location", "postal code" are synonymous words, this time candidate set C contains {"home location", "address", "address", "location","Postalcode"}. In the process 504, the synonyms of different data types of the target attribute are filtered, for example: the data type of the target attribute "home location" is "string", but the attributes of the synonymous word "location" are "integer" and data The type "character string" is different, so the word "position" is removed from the candidate word set C. At this time, the candidate word set C contains {"home location", "address", "address", "postal code"}}. Next, in the process 505, according to the candidate word set C filtered in the previous step, the filter attribute value fields are mutually exclusive and independent synonyms, as shown in Table 1 below:

於上表一中，“郵遞區號”與目標屬性“住家地址”彼此間的值域互斥且獨立，因此從候選詞集合C中剔除“郵遞區號”，候選詞集合僅存{“住家位置”,“地址”,“住址”}。最後，於流程506中，根據同義字屬性值域交集機率排序推薦，如下表二所示：

In Table 1 above, the range of "postal code" and the target attribute "home address" are mutually exclusive and independent, so the "postal code" is removed from the candidate word set C , and the candidate word set only exists {"home location" , "Address", "Address"}. Finally, in the process 506, the recommendations are sorted according to the probability of intersection of synonymous attribute value ranges, as shown in Table 2 below:

除原本目標屬性“住家地址”外，依序為“地址”、“住址”，流程506會從共享資源池中取出同義詞相對應的資料擷取規則，並以機率大小排序。接著，於流程507中，使用者可挑選適當的資料擷取規則。當使用者發現沒有適合的規則時，亦可透過流程508自行建立資料擷取規則。使用者可重覆操作流程502~508，針對一個或多個目標屬性建立數個規則。最後，流程509即將流程507、508所產生的資料擷取規則傳送至資料擷取規則分析模組，並且完成資料擷取規則制定。 In addition to the original target attribute "home address", in order of "address" and "home address", the process 506 will extract the data extraction rules corresponding to the synonyms from the shared resource pool and sort them by probability. Then, in the process 507, the user can select appropriate data extraction rules. When the user finds that there is no suitable rule, he can also create a data retrieval rule by process 508. The user can repeat the operation flow 502~508 to establish several rules for one or more target attributes. Finally, the process 509 sends the data extraction rules generated by the processes 507 and 508 to the data extraction rule analysis module, and completes the data extraction rule formulation.

第6圖說明本發明之雲端自助分析方法有關資料擷取規則分析的執行流程圖，即說明第1、2圖中資料擷取規則分析模組103的運作，此模組主要目的是分析是否有衝突規則存在。當提交資料擷取規則後，於流程601中，使用者與資料擷取規則分析模組進行互動，根據欲新增資料擷取規則與現有規則計算距離(相似度)，方法即將輸入的資料擷取規則進行模型化。於流程602中，主要透過空間向量模型(Vector Space Model)表示每一個屬性，並利用流程603、604計算相似度。舉例來說，如下表三所示，假設共享資源池中存在三條資料擷取規則，每條規則均包含四個可設定條件之屬性{“住家地址”,“年收入”,“性別”,“房屋興建日期”}，其資料型態分別為“字串”,“整數”,“布林”,“日期”。而新增的規則希望擷取的資料必須符合：“住家地址”出現[中正路或中山路]，且“年收入”大於500，且“性別”為男性，且“房屋興建日期”為2002年。 FIG. 6 illustrates the execution flow chart of the data extraction rule analysis of the cloud self-service analysis method of the present invention, that is, the operation of the data extraction rule analysis module 103 in FIGS. 1 and 2, the main purpose of this module is to analyze whether there is Conflicting rules exist. After submitting the data retrieval rules, in the process 601, the user interacts with the data retrieval rule analysis module, calculates the distance (similarity) according to the data retrieval rules to be added and the existing rules, and the method is about to input the data retrieval Take rules to model. In the process 602, mainly through the space The vector space model (Vector Space Model) represents each attribute, and the similarity is calculated using the processes 603 and 604. For example, as shown in Table 3 below, suppose that there are three data extraction rules in the shared resource pool, and each rule contains four attributes that can be set conditions {"home address", "annual income", "sex", " "Building date"}, the data types are "string", "integer", "Brin", "date". The newly added rules require that the data retrieved must meet: "Home Address" appears on [Zhongzheng Road or Zhongshan Road], and "Annual Income" is greater than 500, and "Gender" is male, and "House Construction Date" is 2002 .

流程603首先會計算數字、日期、時間、布林及位元資料型態向量距離，透過準則計算欲新增規則與現有規則之距離，準則包括(1)計算兩數值絕對值後開根號；(2)若兩規則運算元不一致，則將上述值放大，這裡取最大值後加1。若運算元一致記錄是否符合涵蓋條件，例如：規則“>400” 涵蓋規則“>500”。經過流程603後，現有規則與欲新增規則的相似度如下表四所示：

The process 603 will first calculate the distance between numbers, date, time, Bollinger and bit data type vectors, and calculate the distance between the rules to be added and the existing rules through the criteria. The criteria include (1) calculating the absolute value of the two values and then opening the root sign; (2) If the two rule operands are inconsistent, the above value will be enlarged, and the maximum value will be added here. If the operand consistent records meet the coverage conditions, for example: rule ">400" coverage rule ">500". After the process 603, the similarity between the existing rules and the rules to be added is shown in Table 4 below:

於上表中，現有規則3的運算元與新增規則運算元相左，因此擴增兩規則的向量距離。接著流程604會計算字元、字串、列舉及文字資料型態向量距離，並透過準則計算欲新增規則與現有規則之距離，準則包括(1)判斷兩文字是否相同，相同取0，相反取1；(2)若兩規則運算元不一致，將上個步驟值加1，反之多記錄是否符合涵蓋條件。經過流程604後，現有規則與欲新增規則的相似度如下表五所示：

In the above table, the existing rule 3 operand is different from the newly added rule operand, so the vector distance between the two rules is increased. Then the process 604 calculates the distance between the character, string, enumeration and text data type vector, and calculates the distance between the rule to be added and the existing rule through the criteria. The criteria include (1) judging whether the two texts are the same, the same takes 0, the opposite Take 1; (2) If the two rule operands are inconsistent, add 1 to the value of the previous step, otherwise many records will meet the coverage conditions. After the process 604, the similarity between the existing rules and the rules to be added is shown in Table 5 below:

最後流程605依據向量距離排序顯示，並透過流程606判斷是否存在衝突的資料擷取規則，當距離為0時表示已存在完全相同設定條件的資料擷取規則。而當某規則所有涵蓋條件均成立時，也視為衝突，例如：現有規則1雖然與新增規則距離甚遠，但其條件均涵蓋新增規則之設定。當有衝突規則成立時，流程607會顯示資料擷取規則衝突警告，使用者可透過流程608修改衝突資料擷取規則，操作包含有：採用現有的規則、修改新增規則、刪除新增規則。倘若沒有出現衝突規則時，則進入流程609，即完成資料擷取規則建立。 Finally, the process 605 sorts and displays according to the vector distance, and determines whether there is a conflicting data extraction rule through the process 606. When the distance is 0, it indicates that there are already data acquisition rules with the same set conditions. And when all the coverage conditions of a rule are established, it is also regarded as a conflict. For example, although the existing rule 1 is far away from the new rule, its conditions cover the setting of the new rule. When a conflicting rule is established, the process 607 will display a data retrieval rule conflict warning. The user can modify the conflicting data retrieval rule through the process 608. The operations include: adopting existing rules, modifying new rules, and deleting new rules. If there are no conflicting rules, then flow 609 is entered to complete the establishment of data extraction rules.

第7圖說明本發明之雲端自助分析方法有關資料集蒐集的執行流程圖，即說明第1、2圖中資料集蒐集模組104的運作，此模組主要目的是當接收資料擷取規則組合後，根據規則設定實際去產生相對應的資料集。於流程701中，即接收資料擷取規則組合，以根據規則設定實際去產生相對應的資料集。於流程702中，會確認資源池是否存在對應資料擷取規則之資料集，若存在則流程703直接從共享資源池取得資料集，不用重覆產生資料，若共享資源池不存在該資料集時，則進入流程704，資料集蒐集模組會產生資料擷取腳本，此腳本可以於命令列直接執行，透過標準格式存在，如xml、json、yml等，主要內容包含有(1)資料來源；(2)資料擷取規則；(3)資料集名稱；(4)資料集儲存位置；(5)優先權重等，而Meta資訊包含：建立者、建立時間、版本等。 FIG. 7 illustrates the execution flow chart of data collection in the cloud self-service analysis method of the present invention, that is, the operation of the data collection module 104 in FIGS. 1 and 2. The main purpose of this module is to receive data collection rule combinations Afterwards, according to the rules set the actual to produce the corresponding data set. In the process 701, a combination of data extraction rules is received to generate a corresponding data set according to the rule setting. In the process 702, it is confirmed whether there is a data set corresponding to the data extraction rule in the resource pool, and if it exists, the process 703 directly from the sharing The resource pool obtains the data set without generating data repeatedly. If the data set does not exist in the shared resource pool, the process enters the process 704. The data collection module will generate a data retrieval script, which can be directly executed on the command line. Standard formats exist, such as xml, json, yml, etc. The main content includes (1) data source; (2) data retrieval rules; (3) data set name; (4) data set storage location; (5) priority weight The Meta information includes: creator, creation time, version, etc.

當腳本建立完成後，進入流程705，排程器根據目前可用運算資源及資料擷取腳本優先權，依序排程器執行資料擷取腳本來產生資料集。接著，流程706中，將產生的資料集及腳本相關資訊儲存至共享資源池內，最後進入流程707，即完成資料集整備任務。 After the script is created, the flow enters the process 705, and the scheduler executes the data retrieval script in sequence to generate the data set according to the currently available computing resources and data retrieval script priority. Next, in the process 706, the generated data set and script-related information are stored in the shared resource pool, and finally the process enters the process 707, that is, the data set preparation task is completed.

下面舉一實施案例，說明本發明如何於多人協同資訊分享來提昇雲端自助分析效率的機制與服務。基於本案所述建置客戶旅程(Customized Customer Journey)平台，而資料集以電信業資料輔以說明，此平台透過建置機器學習以供行銷部門、資料分析師等使用，目地在於利用此服務幫助客服、行銷部門增強客戶服務體驗，例如：偵測客戶有離網意圖或是曾高頻率接收競業簡單，以於客戶真正離網前提供促銷方案或是相關的挽回策略和提高銷售的機會，藉本案所提出之架構，資料分析專家可以統整不同資料源、及自訂不同事件屬性、並共享機器學習模型及分析資料。 The following is an implementation case to illustrate the mechanism and service of the present invention on how multi-person collaborative information sharing can improve the efficiency of cloud self-service analysis. Based on the establishment of the Customized Customer Journey platform described in this case, and the data set is supplemented by the information of the telecommunications industry. This platform is built for machine marketing for use by marketing departments, data analysts, etc. The purpose is to use this service to help The customer service and marketing departments enhance the customer service experience, for example: detecting that the customer has an intention to be off-grid or has received high-frequency competition simply, so as to provide a promotional plan or related redemption strategy and increase sales opportunities before the customer actually leaves the network. With the structure proposed in this case, data analysis experts can integrate different data sources and customize different event attributes, and share machine learning models and analysis data.

下面一併參考第1、5和7圖進行說明。假定欲分析資料已整備結構性原始電信用戶的資料集，這些原始資料來自不同業務及資料源(data channel)。資料集舉例如下：客戶行動電話撥打及收話記錄、市話撥打及收話記錄、客戶影視租用記錄、客戶申訴客服資料、客戶固網寬頻租用資料、4G LTE網路品質等。如下表六所示：

The description will be made with reference to Figs. 1, 5 and 7 below. It is assumed that the data to be analyzed has been prepared with a structured data set of original telecommunications users, and these original data come from different services and data sources (data channels). Examples of data sets are as follows: customer mobile phone dialing and receiving records, local dialing and receiving records, customer video rental records, customer complaint customer service data, customer fixed broadband lease data, 4G LTE network quality, etc. As shown in Table 6 below:

各項資料集均帶有唯一用戶識別屬性(unique identifier)供辨識不用資料來源的客戶，以進行用戶歸戶。而每項資料集擁有不同屬性例如客戶申訴客服資料帶有“申訴原因”屬性、4G LTE網路品質帶有“基地台位置”屬性、客戶接收簡訊資料帶有“簡訊內容長度”。上述這類原始資料集整備並以結構化資料儲存於共享資源池101。 Each data set has a unique user identification attribute (unique identifier) for identifying customers who do not use the data source for user home ownership. Each data set has different attributes such as customer complaint customer service data with the "appeal reason" attribute, 4G LTE network quality with the "base station location" attribute, and customer received SMS data with the "text content length". The above-mentioned original data sets are prepared and stored in the shared resource pool 101 as structured data.

原始資料集整備完成即可進行資料分析任務，以下以資料分析專家代稱行銷人員或使用者，資料分析專家選擇欲進行資料分析，需要組合資料分析所需的屬性，這些屬性在此以事件擷取規則代稱，這些事件擷取規則來自選定分析資料，而資料來自不同資料源，參考資料分析任務200。如第1圖所示，由資料擷取規則設定模組102選定資料集後，接著參考第5圖，如流程501透過使用者與資料擷取規則設定模組互動，選定欲組合事件擷取規則的資料欄位，例如選擇了客戶行動電話撥打及收話資料集及4G LTE網路品質資料集及客戶申訴客服資料集三份不同資料源，欲建立新資料擷取規則名稱為4G行動用戶離網意圖r1。透過流程502指定資料欄位，在客戶行動電話撥打及收話資料集選擇目標屬性“行銷專線”，系統於流程503利用同義字字典篩選名稱相近之屬性，例如是否接受競業行銷電話、促銷電話、行銷專線、是否主動撥打競業行銷號碼、行銷時間、疑似行銷簡訊等相似資料擷取規則。系統經流程504過濾不同資料型態之資料擷取規則及流程505剔除值域互斥且獨立於目標屬性，如下表七所示：

The data analysis task can be performed after the original data set is completed. The following data analysis experts are referred to as marketing personnel or users. The data analysis experts choose to perform data analysis and need to combine the attributes required for data analysis. These attributes are captured here as events The rule claims that these event extraction rules come from selected analysis data, and the data come from different data sources, refer to the data analysis task 200. As shown in FIG. 1, after the data collection rule setting module 102 selects the data set, and then refer to FIG. 5, such as process 501, the user interacts with the data retrieval rule setting module to select the event retrieval rules to be combined Data fields, for example, three different data sources are selected for the customer mobile phone dialing and receiving data set and the 4G LTE network quality data set and the customer complaint customer service data set. To create a new data extraction rule, the name is 4G mobile user. Net intention r1. Specify the data field through process 502, select the target attribute "marketing line" in the customer mobile phone dialing and receiving data set, and the system uses the synonym dictionary to filter the attributes with similar names in process 503, such as whether to accept competitive marketing calls and promotional calls , Marketing dedicated line, whether to actively dial competitive marketing number, marketing time, suspected marketing newsletter and other similar data retrieval rules. The system filters the data extraction rules of different data types through the process 504 and the process 505 excludes the range of values that are mutually exclusive and independent of the target attribute, as shown in Table 7 below:

候選詞集合為{“促銷電話”,“行銷專線”,“疑似行銷簡訊號碼”}，並經由流程506根據屬性值域交集機率排序推薦，候選詞值域如下表八所示：

The candidate word set is {"promotional phone", "marketing hotline", "suspected marketing newsletter number"}, and sorted and recommended according to the probability of intersection of attribute value fields through process 506. The candidate word value fields are shown in Table 8 below:

經屬性值域交集計算推薦機率，計算方法如下式一所示，以促銷電話屬性舉例計算如式二所示。 The recommended probability is calculated through the intersection of attribute value fields. The calculation method is shown in Equation 1 below, and the calculation of the attribute of the promotional call is shown in Equation 2 as an example.

最後，可得到各屬性交集機率排序推薦，如下面表九，推薦資料分析專家以下資料擷取規則，依序為：行銷專線、促銷電話、疑似行銷簡訊。經推薦機率幫助資料分析專家可快速進行增減或是從優選擇適合資料擷取規則。 Finally, you can get recommendations for sorting the probability of intersection of attributes. As shown in Table 9 below, the following data extraction rules for recommended data analysis experts are in order: marketing line, promotional phone, and suspected marketing newsletter. The recommended probability helps data analysis experts to quickly increase or decrease or select the appropriate data extraction rules.

再參考第5圖，資料分析專家在使用者挑選資料擷取規則(流程507)下選定合適的資料擷取規則，例如流程506推薦中選擇：促銷電話、疑似行銷簡訊，或是於流程502中直接指定既有之資料欄位並建立資料規則，即進入流程508，例如客戶行動電話撥打及收話資料集選擇例如通話時間、撥打對象族群為外網比例等欄位，在4G LTE網路品質資料集選擇連線品質分數、經常品質不良重新連線次數等屬性，而再客戶申訴客服資料集選擇申訴事件、客訴次數等屬性。上述均為不同資料源或是資料擷取規則，並由唯一用戶識別屬性串接相同用戶。資料分析專家於流程507或508挑選資料擷取規則後並設定各項屬性門檻值(篩選值域)，例如撥打對象族群為外網比例大於50%、連線品質低於50分或客訴次數每個月大於3次等，如下表十所示。設定完成資料擷取規則及篩選值後，該項新規則名稱4G行動用戶離網意圖r1及資料擷取規則將進行提交，即流程509，並傳送至資料擷取規則分析模組103。 Referring again to FIG. 5, the data analysis expert selects the appropriate data extraction rule under the user selection data extraction rule (flow 507), such as the selection in flow 506 recommendation: promotional call, suspected marketing newsletter, or in flow 502 Directly specify the existing data fields and create data rules, that is, enter the process 508, such as customer mobile phone dialing and receiving data set selection, such as call time, dialing group is the proportion of the external network, etc., in 4G LTE network quality The data set selects attributes such as connection quality score, frequent bad quality reconnections, etc., and the customer complaint customer service data set selects attributes such as complaint events and customer complaints. The above are different data sources or data extraction rules, and the same user is concatenated by a unique user identification attribute. The data analysis expert selects the data extraction rules in the process 507 or 508 and sets each attribute threshold (filtering value range), for example, the target group is the proportion of the external network greater than 50%, the connection quality is less than 50 points, or the number of customer complaints More than 3 times per month, etc., as shown in Table 10 below. After setting the data extraction rules and filter values, the new rule name 4G mobile user off-grid intention r1 and data extraction rules will be submitted, that is, process 509, and sent to the data extraction rule analysis module 103.

回到第1圖，資料擷取規則分析模組103將欲新增資料規則以空間向量空間(vector space model)表示，每項維度代表屬性及篩選值域，接著參考第6圖的流程602，並與存在規則計算是否衝突，如下面表十一所示。 Returning to FIG. 1, the data extraction rule analysis module 103 expresses the data rule to be added as a vector space model, each dimension represents the attribute and the filter value range, and then refers to the flow 602 of FIG. 6, And whether it conflicts with the existence rule calculation, as shown in Table 11 below.

屬性型態可分為二大類，第一類為數值、日期、時間、布林及位元，以向量相似度概念，計算欲新增規則與現有規則之距離，參考流程603，步驟如下：(1)計算兩數值相差後取平方值；(2)若兩規則運算元不一致，將平方值加1。屬性型態第二類為字元、字串、列舉及文字，參考流程604，計算相似度步驟如下：(1)判斷兩文字是否相同，相同取0，相反取1；(2)若兩規則運算元不一致，將上個步驟值加1，反之多記錄是否符合涵蓋條件。兩類屬性型態計算完成後，得到相似距離矩陣並計算規則相似度：

Attribute types can be divided into two categories. The first category is numeric, date, time, Bollinger, and bit. Based on the concept of vector similarity, calculate the distance between the rule to be added and the existing rule. Refer to process 603. The steps are as follows: ( 1) Calculate the squared value after calculating the difference between the two values; (2) If the two rules are inconsistent, add 1 to the squared value. The second type of attribute type is character, string, enumeration and text. Refer to the process 604, the steps of calculating the similarity are as follows: (1) determine whether the two texts are the same, take 0 for the same, and take 1 for the opposite; (2) if the two rules If the operands are inconsistent, add 1 to the value of the previous step, otherwise, many records will meet the coverage conditions. After the two types of attribute types are calculated, the similar distance matrix is obtained and the rule similarity is calculated:

在流程605，根據上述流程603與流程604數值總和排序顯示，在流程606，判斷是否存在規則具高度相似及衝突，當距離為0時表示已存在完全相同設定條件的資料擷取規則。若不具衝突，則進入流程609，成功建立新資料規則及存入規則r1至共享資源池101，存入資訊包括屬性值及建立規則的資料分析專家的資訊。反之，若具規則衝突，則進入流程607和流程608，系統發出規則衝突警告及觸發修改衝突資料擷取規則，讓資料分析專家重新修改。 At flow 605, the numerical sum of the flow 603 and flow 604 is sorted and displayed, and at flow 606, it is determined whether there are rules with high similarity and Conflict, when the distance is 0, it means that there are already data extraction rules with the same set conditions. If there is no conflict, then flow 609 is entered, and a new data rule and a deposit rule r1 are successfully created to the shared resource pool 101, and the stored information includes attribute values and information of the data analysis expert who created the rule. On the contrary, if there is a rule conflict, the process enters the process 607 and the process 608, the system issues a rule conflict warning and triggers the modification of the conflict data retrieval rules, and allows the data analysis expert to revise.

參考第7圖，在流程701中，當完成規則建置後，系統接收資料擷取規則組合，在流程702中，系統開始確認共享資源池101是否存在對應資料擷取規則之資料集，若共享資源池101已存在資料集，則進入流程703，系統根據該事件規則的各項屬性篩選值域進行資料擷取，資料擷取完成後存放至共享資源池101，進入流程707，即完成資料集整備任務，反之，若共享資源池101不存在資料集，進入流程704，資料分析專家可產生資料擷取腳本，接著流程705，根據資料分析專家可擁有的系統資源設定不同資料源擷取排程，接著流程706，系統依各項屬性篩選值域進行資料擷取並存入資源池，最後進入流程707，完成資料集整備任務。 Referring to FIG. 7, in the process 701, after completing the rule building, the system receives the data extraction rule combination. In the process 702, the system starts to confirm whether there is a data set corresponding to the data extraction rule in the shared resource pool 101. If there is already a data set in the resource pool 101, enter the process 703. The system selects the value range according to the event rule to perform data extraction. After the data extraction is completed, it is stored in the shared resource pool 101, and enters the process 707 to complete the data set Preparation task, conversely, if there is no data set in the shared resource pool 101, enter the process 704, the data analysis expert can generate a data extraction script, then the process 705, according to the system resources available to the data analysis expert to set different data source extraction schedule Then, following the process 706, the system filters the value range according to each attribute to extract and store the data in the resource pool, and finally enters the process 707 to complete the data set preparation task.

請一併參考第2圖，資料分析專家建立完資料規則後，可經圖形化介面進行資料分析，即由視覺化呈現模組107選定原始資料集或是上述由資料規則產生已符合規則之資料集進行資料分析，及選定合適演算法，參考分析演算法選取模組108，可整合常用的學習模型，例如支援向量機、決策樹、隨機森林分類、K-means分群、類神經、迴歸分析、PCA分析、頻繁樣式探勘等模型。選定模型及資料集後，資料集蒐集模組104開始由共享資源池101擷取選定的原始資料集或是已符合規則之資料集，資料擷取程序成功後，資料分析專家透過分析任務排程器105設定訓練模型迭代次數、調整模型分析參數，藉此開始訓練任務排程並將模型提交至運算資源池106，並根據資料分析專家所能使用之運算資源開始訓練模型並產生結果。運算資源池106將訓練資料集、資料分析專家使用者資料、選定模型、訓練模型參數及分析結果等資訊存入共享資源池101中。 Please refer to Figure 2 as well. After the data analysis experts have established the data rules, they can analyze the data through the graphical interface, that is, the visual presentation module 107 selects the original data set or the data generated by the data rules that meet the rules. Data analysis, and select the appropriate algorithm, refer to the analysis algorithm selection module 108, which can integrate commonly used learning models, such as support vector machines, Models such as decision trees, random forest classification, K-means clustering, neural-like, regression analysis, PCA analysis, frequent pattern exploration, etc. After selecting the model and data set, the data collection module 104 starts to retrieve the selected original data set or the data set that has met the rules from the shared resource pool 101. After the data retrieval process is successful, the data analysis expert schedules through the analysis task The device 105 sets the number of iterations of the training model, adjusts the model analysis parameters, thereby starting the training task schedule and submits the model to the computing resource pool 106, and starts training the model according to the computing resources available to the data analysis expert and generates results. The computing resource pool 106 stores information such as training data sets, user data of data analysis experts, selected models, training model parameters, and analysis results in the shared resource pool 101.

本案所述平台具備專案管理功能。請參考第1圖，於視覺化呈現模組107中，相同專案使用者也可透過此模組擷取其他使用者建立的資料規則、原始資料集、已整備符合規則之資料集。舉例來說，使用者a透過資料規則r1、r2、r3，產生已符合規則之資料集m1、m2、m3，使用者b可於視覺化呈現模組107中選擇合適協助分析之資料，像是選擇r2對應產生的m2及其他原始資料，系統將於共享資源池101擷取m2及相關資料至運算資源池106提交演算法並分配運算資源供使用者b使用系統資源訓練模型，或者是，使用者b參考使用者a建立的事件規則，發現r1和r2相似於欲建立的事件規則，此時使用者b可擷取r1和r2複本並自行修改規則內屬性，完成後由資料擷取規則分析模組103進行判別，建立屬於使用者b的r1’和r2’，並抽取符合事件規則的資料集m1’和m2’，舉例來說，上述使用者a建立離網意圖規則r1，包含接受競業行銷電話、通話時間、撥打對象族群為外網比例、連線品質分數、申訴事件、客訴次數等屬性及相對應之值域，而使用者b欲建立一個事件規則“客戶忠誠度”，使用者b便可參考r1，修改組合出合適的屬性，像是由客戶資費資訊挑選每月資費方案屬性，並設定值域大於1300元加入r1中，刪除通話時間屬性，修改並降低撥打對象族群為外網比例的值域為20%，如下表十二所示，使用者b完成增修後並提交至資料擷取規則分析模組103進行判別，若確認無其他事件規則衝突後便能建立新規則r2及抽取符合事件規則的資料集m2，而屬於相同專案的使用者a也能共享使用者b建立完成的r2及m2。 The platform described in this case has project management functions. Please refer to FIG. 1, in the visual presentation module 107, users of the same project can also retrieve data rules created by other users, original data sets, and data sets that have been prepared to meet the rules through this module. For example, user a generates data sets m1, m2, and m3 that have met the rules through data rules r1, r2, and r3. User b can select appropriate data for analysis in the visual presentation module 107, such as Select m2 and other original data corresponding to r2, the system will retrieve m2 and related data from the shared resource pool 101 to the computing resource pool 106 to submit the algorithm and allocate computing resources for the user b to use the system resource training model, or, use User b refers to the event rule created by user a and finds that r1 and r2 are similar to the event rule to be created. At this time, user b can retrieve copies of r1 and r2 and modify the attributes in the rule by themselves. After completion, the data extraction rule is analyzed The module 103 performs discrimination, creates r1' and r2' belonging to user b, and extracts data sets m1' and m2' that meet the event rules. For example, the above The user a establishes the off-grid intent rule r1, which includes attributes such as the acceptance of competitive marketing calls, talk time, the proportion of dialed groups for the external network, connection quality score, complaint incidents, number of customer complaints, and the corresponding value range, and If user b wants to create an event rule "customer loyalty", user b can refer to r1, modify and combine appropriate attributes, such as selecting monthly tariff plan attributes from customer tariff information, and setting the value range to be greater than 1300 yuan to join In r1, delete the call time attribute, modify and reduce the dial-up group to the proportion of the external network to 20%, as shown in Table 12 below, after user b completes the update and submits it to the data extraction rule analysis module 103 Judging, if it is confirmed that there is no conflict of other event rules, a new rule r2 can be created and a data set m2 that matches the event rule can be extracted, and the user a who belongs to the same project can also share the r2 and m2 created by the user b.

另外，使用者a建立模型後，使用者b也能在視覺化呈現模組107選擇使用者a建立的模型及選用的模型參數進行使用，亦即使用者可透過資料擷取規則分析模組103協同作加業加速模型優化。專案管理功能也具分析資料共享功能，供不同使用者間接使用資料，例如使用者a建立預測行動上網離網至競業模型，並產生預測相對應之預測高機率離網至競業客戶資料集，該資料集已存於共享資源池101中，當下次使用者b欲建立未來固網寬頻離網至遠傳大寬頻模型時，於視覺化呈現模組107中選擇該預測高機率離網至競業客戶資料集，可減少直接分析所有客戶群耗費的時間，抑或是比較客戶群及高機率離網至競業客戶對於未來固網寬頻離網至遠傳大寬頻模型預測結果差異性，有助減少分析人員整備資料時間及模型重工問題。 In addition, after user a creates the model, user b can also select the model created by user a and the selected model parameters in the visual presentation module 107, that is, the user can use the data extraction rule analysis module 103 Accelerate model optimization for collaborative work and processing The project management function also has an analysis data sharing function for indirect use of data by different users. For example, user a establishes a forecast mobile online and off-line to competitive model, and generates a predicted high probability of off-line to competitive customer data sets. The data set is already stored in the shared resource pool 101. When the next time user b wants to create a future fixed-line broadband off-network to remote transmission large-broadband model, select the predicted high probability off-network to in the visual presentation module 107 Competitive customer data sets can reduce the time it takes to directly analyze all customer groups, or compare the customer base and high probability of off-net to competitive customers. For the future fixed-line broadband off-network to long-distance large broadband model prediction results, there are Help reduce the time for analysts to prepare data and model rework.

本案分析結果可以視覺化呈現，視覺化呈現模組107可以常見的統計圖表呈現，包括長條圖、散佈圖、圓餅圖、氣泡圖及熱力圖等，協助使用者分析大量資料，例如(1)以原始資料進行分析，資料分析專家可以從原始資料統計用戶瀏覽某個特定網站的圓餅圖、用戶撥打/接收特定電話的長條圖；(2)或是由已符合規則之資料集進行資料統計，上述實施例中的離網意圖規則，以折線圖觀察用戶未來離網意圖的變化程度，資料分析專家便可針對此項資訊，裁定適合之促銷優惠方案，以提高客戶挽回率；(3)抑或是模型產生的結果，例如由行動定位、客戶資料、發話基地台等事件規則，訓練人潮移動預測模型，以地圖熱力圖的方式呈現客戶頻繁移動地點及客戶數，以有效提供擴店選址的參考依據，增加客觸的行銷機會。 The analysis results in this case can be presented visually, and the visual presentation module 107 can present common statistical charts, including bar charts, scatter charts, pie charts, bubble charts, and heat maps, to help users analyze large amounts of data, such as (1 ) Analyze with the original data. The data analysis experts can count the pie charts of users browsing a specific website and the bar graphs of users making/receiving specific calls from the original data; (2) or by the data sets that have met the rules According to statistics, the off-grid intent rules in the above embodiments observe the user’s future off-grid intent with a line chart, and data analysis experts can determine suitable promotional offers based on this information to increase customer recovery rates; ( 3) Or is it the result of the model, such as event positioning, customer information, calling base station and other event rules, training crowd movement prediction model, using map heat map method It presents the frequent mobile locations and number of customers in order to effectively provide a reference for expanding the location of the store and increase the marketing opportunities of customers.

由上可知，本發明揭露一種基於多人協同資訊分享來提昇雲端自助分析效率的機制與服務，所述服務主要包含資料擷取規則設定、資料擷取規則分析及資料集蒐集，並以此來支撐雲端自助分析作業核心流程，包括(1)使用者提交欲分析檔案至雲端自助分析平台，或選擇遠端既有的檔案；(2)產生訓練資料集；(3)選定機器學習演算法並設定演算法相關參數；(4)提交模型訓練任務；(5)模型訓練任務於雲端環境運行；(6)訓練模型存放至雲端平台或下載至本地端。相較於現有自助分析在模型訓練時，常因為相同資料源，僅每個專家領域觀點不同，而重覆上傳資料，缺乏有效資訊分享機制，導致自助分析效率較差。另外，不同分析應用，如果可以彼此取長補短，將會是一項利多，本發明提出於雲端自主分析系統內導入共享資源池概念，結合資料擷取規則設定模組及資料擷取規則分析模組，除過濾已存在之規則外，並建議相關之規則，最後由資料集蒐集模組負責整備所需資料，藉此提升自助分析效能，倘若雲端自助分析平台已存在相對應規則之資料集，則不用重覆整備，以節省運算資源。 As can be seen from the above, the present invention discloses a mechanism and service based on multi-person collaborative information sharing to improve the efficiency of cloud self-service analysis. The service mainly includes data retrieval rule setting, data retrieval rule analysis, and data collection. Support the core process of cloud self-service analysis operations, including (1) users submit files to be analyzed to the cloud self-service analysis platform, or select remote files; (2) generate training data sets; (3) select machine learning algorithms and Set the relevant parameters of the algorithm; (4) Submit the model training task; (5) The model training task runs in the cloud environment; (6) The training model is stored on the cloud platform or downloaded to the local end. Compared with the existing self-service analysis in model training, often because of the same data source, only the views of each expert field are different, and repeated uploading of data, the lack of effective information sharing mechanism, resulting in poor self-service analysis efficiency. In addition, different analysis applications, if they can learn from each other's strengths, will be a benefit. The present invention proposes to introduce the concept of shared resource pools in the cloud autonomous analysis system, combined with the data extraction rule setting module and the data extraction rule analysis module, In addition to filtering the existing rules, and recommending the relevant rules, the data collection module is responsible for the preparation of the required data, thereby improving the performance of self-service analysis. If the cloud self-service analysis platform already has a data set corresponding to the rule, it is not necessary Repeated preparation to save computing resources.

上述實施形態僅例示性說明本發明之原理及其功效，而非用於限制本發明。任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下，對上述實施形態進行修飾與改變。因此，本發明之權利保護範圍，應如後述之申請專利範圍所列。 The above-mentioned embodiments only exemplarily illustrate the principle and efficacy of the present invention, and are not intended to limit the present invention. Anyone familiar with this skill can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the rights of the present invention should be applied for The profit range is listed.

101‧‧‧共享資源池 101‧‧‧ shared resource pool

104‧‧‧資料集蒐集模組 104‧‧‧Data collection module

105‧‧‧分析任務排程器 105‧‧‧Analysis task scheduler

106‧‧‧運算資源池 106‧‧‧ computing resource pool

200‧‧‧資料分析任務 200‧‧‧Data analysis task

Claims

A cloud self-service analysis platform includes: a shared resource pool, which stores multiple existing data retrieval rules; a data retrieval rule setting module, which is connected to the shared resource pool and used to receive externally input data analysis tasks, to Search through the data analysis task to the shared resource pool to obtain the existing data retrieval rules applicable to the data analysis task; the data retrieval rule analysis module is connected to the data retrieval rule setting module and used for Analyze whether the new data retrieval rule of the data analysis task conflicts with the existing data retrieval rule obtained by the data retrieval rule setting module, so as to generate the final data retrieval rule accordingly; the data collection module Is connected to the data acquisition rule analysis module and the shared resource pool. The data collection module is used to generate an extraction script according to the final data acquisition rule to collect the required data; and the analysis task schedule The device is connected to the data collection module and is used to allocate the analysis algorithm and the collected required data to the computing resource pool for operation according to the priority of the analysis task to generate the final analysis result.

The cloud self-service analysis platform as described in item 1 of the patent application scope further includes a visual presentation module connected to the data acquisition rule analysis module and the shared resource pool, which is used to present the statistical distribution of data through graphs or tables.

The cloud self-service analysis platform as mentioned in item 2 of the patent scope, more The analysis algorithm selection module connected to the visual presentation module is used for the user to select the analysis algorithm and set related parameters.

The cloud self-service analysis platform as described in item 1 of the patent application scope, wherein the data collection module is to obtain a resource set from the shared resource pool as the required when the required data exists in the shared resource pool data.

The cloud self-service analysis platform as described in item 1 of the patent application scope, wherein, when the data extraction rule analysis module analyzes the new data extraction rule of the data analysis task and the existing data extraction rule conflicts A warning message is generated, and one of the existing data acquisition rules or the creation of the new data acquisition rules is selected as the final data acquisition rule.

The cloud self-service analysis platform as described in item 1 of the patent application scope, wherein the data extraction rule setting module is created by the user when the existing data extraction rule suitable for the data analysis task is not obtained The new data retrieval rule is used as the final data retrieval rule.

A cloud self-service analysis method includes the following steps: searching for the existing data extraction rules applicable to the data analysis task according to the received data analysis task to a shared resource pool storing multiple existing data extraction rules; Analyze whether there is a conflict between the new data extraction rules of the data analysis task and the acquired existing data extraction rules applicable to the data analysis task to generate the final data extraction rules accordingly; according to the final data extraction rules Generate a capture script to collect the required data; and According to the analysis task priority order, the analysis algorithm and the collected required data are allocated to the calculation resource pool for calculation to generate the final analysis result.

The cloud self-service analysis method as described in item 7 of the patent application scope, wherein collecting required data further includes obtaining the resource set from the shared resource pool as the institute when the required data exists in the shared resource pool Information required.

The cloud self-service analysis method as described in item 7 of the patent application scope, in which a warning message is generated when the new data retrieval rule of the data analysis task conflicts with the existing data retrieval rule, and the option to continue using the One of the existing data retrieval rules or the creation of the new data retrieval rule serves as the final data retrieval rule.

The self-service analysis method in the cloud as described in item 7 of the patent application scope, in which, when the existing data extraction rules applicable to the data analysis task are not obtained, the user himself creates the new data extraction rules as The final data retrieval rules.