TWI695327B

TWI695327B - Device and method for managing predictive models

Info

Publication number: TWI695327B
Application number: TW107144497A
Authority: TW
Inventors: 陳慧玲
Original assignee: 中華電信股份有限公司
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-06-01
Also published as: TW202022715A

Abstract

A method for managing predictive models is provided, including: storing a first predictive model and a second predictive model, wherein the second predictive model corresponds to a historical data stream; loading data stream and recording a real outcome corresponding to the data stream; calculating a predictive outcome according to a current predictive model, wherein the current predictive model is the first predictive model; calculating accuracy of the predictive outcome according to the real outcome; calculating a correlation between the data stream and the historical data stream in response to the accuracy being lower than a first threshold; and switching the current predictive model from the first predictive model to the second predictive model in response to the correlation being higher than a second threshold.

Description

Device and method for managing prediction model

本發明是有關於一種使用替換演算法的方法，且特別是有關於一種管理預測模型的裝置及方法。The present invention relates to a method of using an alternative algorithm, and in particular to an apparatus and method for managing a prediction model.

隨著人工智慧技術的發展，人們開始將基於機器學習技術所產生的預測模型應用於各種領域，諸如影像識別、語音識別、語意分析、語言翻譯或預測等。由良好的機器學習演算法所產生的模型可以達到非常高的預測（或分類）準確度（accuracy）。然而，當預測模型的輸入資料產生較大的變化時，由固定的訓練資料集所產生的預測模型往往會因輸入資料與標籤資料的差異太大，而造成預測準確度下降。因此，如何根據輸入資料的不同而動態地調整預測模型以維持較佳的預測準確度，是本領域人員致力的目標之一。With the development of artificial intelligence technology, people began to apply the prediction model based on machine learning technology to various fields, such as image recognition, speech recognition, semantic analysis, language translation or prediction. Models produced by good machine learning algorithms can achieve very high prediction (or classification) accuracy. However, when the input data of the prediction model changes greatly, the prediction model generated by the fixed training data set often causes the prediction accuracy to decrease because the input data and the label data are too different. Therefore, how to dynamically adjust the prediction model according to the input data to maintain a better prediction accuracy is one of the goals of those skilled in the art.

本發明提供一種管理預測模型的裝置，包括儲存單元以及處理器。儲存單元儲存多個模組。處理器耦接儲存單元，存取並執行多個模組。多個模組包括資料字典模組、擷取模組、計算模組以及選擇模組。資料字典模組儲存第一預測模型以及第二預測模型，其中第二預測模型對應於歷史資料流。擷取模組加載資料流以及記錄對應於資料流的真實結果。計算模組根據當前預測模型計算資料流的預測結果，並且根據真實結果計算預測結果的準確度，其中當前預測模型為第一預測模型。選擇模組響應於準確度低於第一閾值而計算資料流與歷史資料流的相關性，並且響應於相關性高於第二閾值而將當前預測模型從第一預測模型切換為第二預測模型。The invention provides a device for managing a prediction model, including a storage unit and a processor. The storage unit stores multiple modules. The processor is coupled to the storage unit, and accesses and executes multiple modules. The multiple modules include a data dictionary module, an extraction module, a calculation module, and a selection module. The data dictionary module stores the first prediction model and the second prediction model, where the second prediction model corresponds to the historical data stream. The capture module loads the data stream and records the actual results corresponding to the data stream. The calculation module calculates the prediction result of the data stream according to the current prediction model, and calculates the accuracy of the prediction result according to the real result, where the current prediction model is the first prediction model. The selection module calculates the correlation between the data stream and the historical data stream in response to the accuracy being lower than the first threshold, and switches the current prediction model from the first prediction model to the second prediction model in response to the correlation being higher than the second threshold .

本發明提供一種管理預測模型的方法，包括：儲存第一預測模型以及第二預測模型，其中第二預測模型對應於歷史資料流；加載資料流以及記錄對應於資料流的真實結果；根據當前預測模型計算資料流的預測結果，其中當前預測模型為第一預測模型；根據真實結果計算預測結果的準確度；響應於準確度低於第一閾值而計算資料流與歷史資料流的相關性；以及響應於相關性高於第二閾值而將當前預測模型從第一預測模型切換為第二預測模型。The present invention provides a method for managing a prediction model, including: storing a first prediction model and a second prediction model, where the second prediction model corresponds to a historical data stream; loading the data stream and recording the true results corresponding to the data stream; based on the current prediction The model calculates the prediction results of the data stream, where the current prediction model is the first prediction model; calculates the accuracy of the prediction results based on the actual results; calculates the correlation between the data stream and the historical data stream in response to the accuracy being lower than the first threshold; and The current prediction model is switched from the first prediction model to the second prediction model in response to the correlation being higher than the second threshold.

基於上述，本發明可在當前使用的預測模型的準確度不佳的情況下，根據資料流與歷史資料流的相關性來選擇最佳的預測模型。如此，不論輸入資料流如何變化，本發明都可動態地選出最適合輸入資料流的預測模型，從而維持較高的預測準確度。此外，本發明可在當前預測模型適用於當前資料流的情況下，將當前資料流與當前預測模型的關係記載下來，以作為未來選擇預測模型的參考。Based on the above, the present invention can select the best prediction model according to the correlation between the data stream and the historical data stream when the accuracy of the currently used prediction model is not good. In this way, no matter how the input data stream changes, the present invention can dynamically select the prediction model most suitable for the input data stream, thereby maintaining a high prediction accuracy. In addition, the present invention can record the relationship between the current data stream and the current prediction model when the current prediction model is applicable to the current data stream, so as to be a reference for selecting a prediction model in the future.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and understandable, the embodiments are specifically described below in conjunction with the accompanying drawings for detailed description as follows.

在現有技術中，伴隨著資料庫而建立的靜態資料字典無法隨著時間變化而即時地被調整，從而影響了使用所述靜態資料字典的智能應用機器學習模型的預測能力。有鑑於此，本發明提出一種方法，可隨著時間變化而動態地調整資料字典，從而幫助使用者選出較佳的預測模型。In the prior art, the static data dictionary created along with the database cannot be adjusted in time with time, thereby affecting the predictive ability of the intelligent application machine learning model using the static data dictionary. In view of this, the present invention proposes a method that can dynamically adjust the data dictionary over time to help the user select a better prediction model.

圖1根據本發明的實施例繪示一種管理預測模型的裝置10的示意圖。裝置10可包括處理器100以及儲存單元300。儲存單元300可儲存多個模組。處理器100耦接於儲存單元300，並且存取及執行儲存於儲存單元300中的多個模組，其中，多個模組包括擷取模組310、計算模組320、選擇模組330、訓練模組340以及資料字典（data dictionary）模組350。FIG. 1 is a schematic diagram of an apparatus 10 for managing a prediction model according to an embodiment of the present invention. The device 10 may include a processor 100 and a storage unit 300. The storage unit 300 can store multiple modules. The processor 100 is coupled to the storage unit 300, and accesses and executes multiple modules stored in the storage unit 300, wherein the multiple modules include an extraction module 310, a calculation module 320, and a selection module 330. Training module 340 and data dictionary module 350.

處理器100可例如是中央處理單元（central processing unit，CPU），或是其他可程式化之一般用途或特殊用途的微處理器（microprocessor）、數位信號處理器（digital signal processor，DSP）、可程式化控制器、特殊應用積體電路（application specific integrated circuit，ASIC）或其他類似元件或上述元件的組合，本發明不限於此。The processor 100 may be, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessor (microprocessor), digital signal processor (DSP), or A programmable controller, an application specific integrated circuit (ASIC) or other similar components or a combination of the above components, the invention is not limited thereto.

儲存單元300可例如是任何型態的固定式或可移動式的隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合，本發明不限於此。The storage unit 300 may be, for example, any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory (flash memory) ), hard disk drive (HDD), solid state drive (SSD) or similar components or a combination of the above components, the invention is not limited thereto.

資料字典模組350可儲存多種預測模型，且多個預測模型中的每一個分別對應於不同的歷史資料流。具體來說，資料字典模組350可將相關聯的預測模型記錄於與預測模型對應的模型元資料表中。舉例來說，一對應於序號為「201709」的歷史資料流的預測模型B的模型元資料表可如下表1所示（但本發明不限於此）。表1

The data dictionary module 350 can store multiple prediction models, and each of the multiple prediction models corresponds to a different historical data stream. Specifically, the data dictionary module 350 may record the associated prediction model in the model metadata table corresponding to the prediction model. For example, a model metadata table corresponding to the prediction model B of the historical data stream with the sequence number "201709" may be shown in Table 1 below (but the invention is not limited thereto). Table 1

資料字典模組350還可將相關聯的預測模型與歷史資料流記錄於預測模型與歷史資料流的關係表，如下表2所示。舉例來說，若資料字典模組350儲存「預測模型A」、「預測模型B」以及「預測模型C」，其中若使用「預測模型B」所計算出的對應於「歷史資料流201709」的預測結果最為準確，則資料字典模組350可將「預測模型B」與「歷史資料流201709」相關聯，並將相關聯的「預測模型B」與「歷史資料流201709」記錄於如表2所示的關係表。在表2中，「預測模型B」對應於「歷史資料流201709」、「預測模型A」對應於「歷史資料流201708」並且「預測模型C」對應於「歷史資料流201707」。表2

The data dictionary module 350 may also record the associated prediction model and historical data stream in the relationship table between the prediction model and historical data stream, as shown in Table 2 below. For example, if the data dictionary module 350 stores "prediction model A", "prediction model B", and "prediction model C", if the "prediction model B" is used to calculate the corresponding to "historical data stream 201709" If the prediction result is the most accurate, the data dictionary module 350 can associate "prediction model B" with "historical data stream 201709", and record the associated "prediction model B" and "historical data stream 201709" in Table 2 The relationship table shown. In Table 2, "prediction model B" corresponds to "historical data stream 201709", "prediction model A" corresponds to "historical data stream 201708" and "prediction model C" corresponds to "historical data stream 201707". Table 2

資料字典模組350可使用諸如雲端資料庫HBase、Live、Cassandra、MonogoDB或關聯示資料庫Teradata、SQL伺服器等來實現，本發明不限於此。The data dictionary module 350 can be implemented using cloud databases such as HBase, Live, Cassandra, MonogoDB, or related display databases Teradata, SQL server, etc. The present invention is not limited thereto.

擷取模組310可加載（load）資料流，並從資料流中擷取出元資料以產生元資料表。舉例來說，一序號為「201710」的資料流的元資料表可如下表3所示（但本發明不限於此）。此外，擷取模組310可用以記錄對應於資料流的真實結果。舉例來說，假設「資料流201710」與冷氣的設定溫度相關。若使用者將冷氣的設定溫度調整為「25.85度」，則擷取模組310可將「25.85度」作為真實結果並將其記錄於對應於「資料流201710」的元資料表，如表3中的「*4」所記載。在一實施例中，由擷取模組310所擷取的資料流（及其對應的元資料表）可作為歷史資料流（及其對應的歷史元資料表）而記錄於資料字典模組350之中。表3

The extraction module 310 can load a data stream and extract metadata from the data stream to generate a metadata table. For example, a metadata table of a data stream with the sequence number "201710" may be as shown in Table 3 below (but the invention is not limited thereto). In addition, the capture module 310 can be used to record the actual results corresponding to the data stream. For example, assume that "Data Stream 201710" is related to the set temperature of the air conditioner. If the user adjusts the set temperature of the air conditioner to "25.85 degrees", the capture module 310 can take "25.85 degrees" as the real result and record it in the metadata table corresponding to "Data Stream 201710", as shown in Table 3 "*4" in. In one embodiment, the data stream (and its corresponding metadata table) retrieved by the retrieval module 310 can be recorded in the data dictionary module 350 as a historical data stream (and its corresponding historical metadata table) Among. table 3

使用擷取模組310進行加載以及擷取的流程可使用例如Shell Scrip、Python、Java、R或SparkSQL等程式語言或Tableau、Highchart等軟體工具實現，本發明不限於此。The process of loading and capturing using the capture module 310 can be implemented using programming languages such as Shell Scrip, Python, Java, R, or SparkSQL, or software tools such as Tableau, Highchart, etc. The present invention is not limited to this.

計算模組320可用以根據當前預測模型計算資料流的預測結果，並且根據真實結果計算出預測結果的準確度。舉例來說，假設對應於「資料流201710」之設定溫度的真實結果為「25.85度」（如表3所示），且計算模組320所使用的當前預測模型為「預測模型A」。若計算模組320使用「預測模型A」作為當前預測模型以計算出對應於資料流「201710」的預測結果為「22度」，亦即，裝置10建議使用者將設定溫度調整為「22度」。如此，則計算模組320可根據真實結果「25.85度」以及預測結果「22度」計算出「預測模型A」對資料流「201710」之預測的準確度。預測結果越接近真實結果代表預測的準確度越高。準確度的計算方法可由裝置10的使用者依其需求調整，本發明不限於此。The calculation module 320 can be used to calculate the prediction result of the data stream according to the current prediction model, and calculate the accuracy of the prediction result according to the real result. For example, assume that the actual result of the set temperature corresponding to "Data Stream 201710" is "25.85 degrees" (as shown in Table 3), and the current prediction model used by the calculation module 320 is "Prediction Model A". If the calculation module 320 uses "prediction model A" as the current prediction model to calculate the prediction result corresponding to the data stream "201710" as "22 degrees", that is, the device 10 recommends that the user adjust the set temperature to "22 degrees" ". In this way, the calculation module 320 can calculate the prediction accuracy of the "prediction model A" for the data stream "201710" based on the actual result "25.85 degrees" and the prediction result "22 degrees". The closer the prediction result is to the true result, the higher the prediction accuracy. The calculation method of the accuracy can be adjusted by the user of the device 10 according to their needs, and the invention is not limited thereto.

選擇模組330可響應於預測的準確度低於一閾值而計算資料流與歷史資料流的相關性。舉例來說，假設對應於「預測模型A」的準確度的閾值為80%，如表1所示。當計算模組320使用「預測模型A」以根據對應於「資料流201710」的預測結果「22度」和真實結果「25.85度」計算出的準確度時，若所計算出的準確度低於「預測模型A」的準確度的閾值為80%，則選擇模組330可開始計算「資料流201710」以及各個歷史資料流之間的相關性。以表2所記載的歷史資料流為例，選擇模組330可分別計算「資料流201710」與「歷史資料流201707」之間的相關性、「資料流201710」與「歷史資料流201708」之間的相關性以及「資料流201710」與「歷史資料流201709」之間的相關性。The selection module 330 may calculate the correlation between the data stream and the historical data stream in response to the accuracy of the prediction being below a threshold. For example, suppose the threshold of accuracy corresponding to "prediction model A" is 80%, as shown in Table 1. When the calculation module 320 uses the "prediction model A" to calculate the accuracy based on the prediction result "22 degrees" and the actual result "25.85 degrees" corresponding to the "data stream 201710", if the calculated accuracy is less than The threshold of the accuracy of "prediction model A" is 80%, and the selection module 330 can start to calculate the correlation between the "data stream 201710" and each historical data stream. Taking the historical data stream recorded in Table 2 as an example, the selection module 330 can calculate the correlation between the "data stream 201710" and the "historical data stream 201707", and the "data stream 201710" and "historical data stream 201708" respectively. Correlation between "Data Stream 201710" and "Historical Data Stream 201709".

在一實施例中，不同的預測模型可具有相同或不同的準確度的閾值。舉例來說，若對應於「預測模型A」的準確度的閾值為80%，則對應於「預測模型B」的準確度的閾值可以與80%相同或相異。In an embodiment, different prediction models may have the same or different accuracy thresholds. For example, if the threshold of accuracy corresponding to "prediction model A" is 80%, the threshold of accuracy corresponding to "prediction model B" may be the same as or different from 80%.

在一實施例中，選擇模組330可根據資料流的元資料表以及歷史資料流的歷史元資料表計算資料流與歷史資料流之間的相關性。舉例來說，選擇模組330可根據「資料流201710」的元資料表中記載的資料欄位型別、資料欄位分佈型別以及零值比例中的至少其中之一以及「歷史資料流201709」的歷史元資料表中記載的資料欄位型別、資料欄位分佈型別以及零值比例中的至少其中之一來計算「資料流201710」與「歷史資料流201709」之間的相關性。In an embodiment, the selection module 330 may calculate the correlation between the data stream and the historical data stream according to the metadata table of the data stream and the historical metadata table of the historical data stream. For example, the selection module 330 may be based on at least one of the data field type, the data field distribution type, and the zero value ratio recorded in the metadata table of the "data stream 201710" and the "historical data stream 201709" "At least one of the data field type, data field distribution type, and zero value ratio recorded in the historical metadata table to calculate the correlation between "data stream 201710" and "historical data stream 201709" .

相關性可根據資料群間的距離來計算。例如，選擇模組330使用包括歐基里德距離（Euclidean distance）、曼哈頓距離（Manhattan distance）、馬哈蘭距離（Mahalanobis distance）、餘弦距離（cosine distance）、相關性距離（correlation distance）或訊息熵（information entropy）等方式計算出資料流與歷史資料流之間的相關性，本發明不限於此。The correlation can be calculated based on the distance between the data groups. For example, the selection module 330 uses information including Euclidean distance, Manhattan distance, Mahalanobis distance, cosine distance, correlation distance, or message. Entropy (information entropy) and other ways to calculate the correlation between the data stream and the historical data stream, the present invention is not limited to this.

在計算完資料流與歷史資料流之間的相關性後，選擇模組330可響應於所計算的相關性高於一閾值而將當前預測模型從一預測模型切換為對應於歷史資料流的另一預測模型。舉例來說，若計算模組320將「預測模型A」作為當前預測模型以計算「資料流201710」的預測結果，則選擇模組330可響應於「資料流201710」與「歷史資料流201709」之間的相關性高於相關性的閾值而將當前預測模型由「預測模型A」轉換為對應於「歷史資料流201709」的「預測模型B」。由於「資料流201710」與「歷史資料流201709」之間具有高度的相關性，故適用於「歷史資料流201709」的「預測模型B」也會比其他的預測模型還適用於「資料流201710」。因此，使用「預測模型B」來計算「資料流201710」的預測結果將會比使用「預測模型A」來計算「資料流201710」的預測結果更準確。After calculating the correlation between the data stream and the historical data stream, the selection module 330 may switch the current prediction model from a prediction model to another corresponding to the historical data stream in response to the calculated correlation being higher than a threshold A prediction model. For example, if the calculation module 320 uses "prediction model A" as the current prediction model to calculate the prediction result of "data stream 201710", the selection module 330 may respond to "data stream 201710" and "historical data stream 201709" The correlation between them is higher than the correlation threshold and the current prediction model is converted from "prediction model A" to "prediction model B" corresponding to "historical data stream 201709". Due to the high correlation between "Data Stream 201710" and "Historical Data Stream 201709", "Prediction Model B" applicable to "Historical Data Stream 201709" is also more applicable to "Data Stream 201710" than other prediction models ". Therefore, using "Prediction Model B" to calculate the prediction result of "Data Stream 201710" will be more accurate than using "Prediction Model A" to calculate the prediction result of "Data Stream 201710".

在一實施例中，若有複數個歷史資料流與資料流之間的相關性高於相關性的閾值，則選擇模組330可選擇與資料流之間具有最高的相關性之歷史資料流所對應的預測模型作為當前預測模型。In an embodiment, if the correlation between a plurality of historical data streams and data streams is higher than the correlation threshold, the selection module 330 may select the historical data stream with the highest correlation with the data stream The corresponding prediction model is used as the current prediction model.

在一實施例中，若計算模組320根據一預測模型以及一資料流之真實結果所計算出的預測結果之準確度高於一閾值，則選擇模組330可響應於所述準確度高於所述閾值而在資料字典模組350中記載所述資料流關聯於所述預測模型。舉例來說，若計算模組320根據「預測模型B」以及「資料流201710」之真實結果所計算出的預測結果之準確度高於準確度的閾值，則選擇模組330可在資料字典模組350中記載「資料流201710」關聯於「預測模型B」，而將表2修改為如下表4所示。表4

In an embodiment, if the accuracy of the prediction result calculated by the calculation module 320 based on a prediction model and the actual result of a data stream is higher than a threshold, the selection module 330 may respond to the accuracy being higher than The threshold value records in the data dictionary module 350 that the data stream is associated with the prediction model. For example, if the accuracy of the prediction result calculated by the calculation module 320 according to the actual results of the "prediction model B" and the "data stream 201710" is higher than the accuracy threshold, the selection module 330 can Group 350 records that "data stream 201710" is associated with "prediction model B", and Table 2 is modified as shown in Table 4 below. Table 4

訓練模組340可用以根據資料流的特徵資料以及標籤資料訓練對應於資料流的預測模型。具體來說，若一資料流對應的元資料表包括許多不同種類的資料，訓練模組340可基於特定資料與標籤資料的關聯程度大於一閾值而選擇將特定資料作為用以訓練預測模型的特徵資料，從而根據特徵資料以及標籤資料訓練對應於資料流的預測模型。The training module 340 can be used to train a prediction model corresponding to the data stream according to the feature data of the data stream and the label data. Specifically, if the metadata table corresponding to a data stream includes many different types of data, the training module 340 may select the specific data as a feature for training the prediction model based on the degree of association between the specific data and the label data is greater than a threshold Data to train a prediction model corresponding to the data stream based on the feature data and the label data.

以表3所記載的「資料流201710」的元資料表為例，在裝置10欲解決的問題為「建議合適的設定溫度」的前提下，訓練模組340可將「設定溫度」訂定為用以訓練預測模型的標籤資料。接著，訓練模組340可基於「室內溫度」、「室外溫度」等欄位記載之內容與「設定溫度」欄位記載之內容之間的關聯程度大於一閾值，而將「室內溫度」、「室外溫度」訂定為用以訓練預測模型的特徵資料。例如，訓練模組340可基於「設定溫度」、「室內溫度」以及「室外溫度」所記載的「mean」值來計算「設定溫度」與「室內溫度」之間的關聯程度以及「設定溫度」與「室外溫度」之間的關聯程度。由於「設定溫度」、「室內溫度」以及「室外溫度」所記載的「mean」值非常接近，故訓練模組340可計算出「室內溫度」、「室外溫度」兩者與「設定溫度」具有高度的關聯程度。因此，訓練模組340將「室內溫度」、「室外溫度」訂定為用以訓練預測模型的特徵資料。Taking the metadata table of “Data Stream 201710” described in Table 3 as an example, on the premise that the problem to be solved by the device 10 is “suggesting a suitable set temperature”, the training module 340 can set the “set temperature” as Label data used to train prediction models. Then, the training module 340 can change the "indoor temperature" and "indoor temperature" based on the degree of correlation between the content recorded in the "indoor temperature" and "outdoor temperature" fields and the content recorded in the "set temperature" field greater than a threshold. "Outdoor temperature" is defined as the characteristic data used to train the prediction model. For example, the training module 340 can calculate the degree of correlation between the "set temperature" and "indoor temperature" and the "set temperature" based on the "mean" values described in "set temperature", "indoor temperature", and "outdoor temperature" The degree of correlation with "outdoor temperature". Since the "mean" values described in "set temperature", "indoor temperature" and "outdoor temperature" are very close, the training module 340 can calculate that both "indoor temperature", "outdoor temperature" and "set temperature" have Highly correlated. Therefore, the training module 340 defines "indoor temperature" and "outdoor temperature" as the characteristic data for training the prediction model.

選擇模組330可根據諸如相關性係數（coefficient of correlation）、決定係數（coefficient of determination）、迴歸分析（regression analysis）或最小平方原理（least square principle）等方式計算關聯程度，本發明不限於此。The selection module 330 may calculate the degree of correlation according to methods such as coefficient of correlation, coefficient of determination, regression analysis, or least square principle. The present invention is not limited to this .

在一實施例中，訓練模組340可根據特徵資料與標籤資料的資料欄位分佈型別來決定預測模型所使用的演算法。具體來說，若標籤資料以及特徵資料的資料欄位分佈型別屬於連續型，則訓練模組340可決定使用迴歸演算法來訓練預測模型。另一方面，若標籤資料以及特徵資料的資料欄位分佈型別屬於離散型，則訓練模組340可決定使用分類演算法來訓練預測模型。在一實施例中，若一資料流的標籤資料未被記載於所述資料流的元資料表中，則訓練模組340可根據分群演算法來訓練預測模型。In one embodiment, the training module 340 may determine the algorithm used by the prediction model according to the distribution types of the data fields of the feature data and the label data. Specifically, if the data field distribution types of the label data and the feature data are continuous, the training module 340 may decide to use a regression algorithm to train the prediction model. On the other hand, if the data field distribution types of the label data and the feature data are discrete, the training module 340 may decide to use a classification algorithm to train the prediction model. In an embodiment, if the label data of a data stream is not recorded in the metadata table of the data stream, the training module 340 may train the prediction model according to the clustering algorithm.

圖2根據本發明的實施例繪示一種管理預測模型的方法的示意圖，其中所述方法可由如圖1所示的裝置10實施。在步驟S21，儲存第一預測模型以及第二預測模型，其中第二預測模型對應於歷史資料流。在步驟S22，加載資料流以及記錄對應於資料流的真實結果。在步驟S23，根據當前預測模型計算資料流的預測結果，其中當前預測模型為第一預測模型。在步驟S24，根據真實結果計算預測結果的準確度。在步驟S25，判斷準確度是否低於第一閾值。若準確度低於第一閾值，則進入步驟S26。若準確度高於或等於第一閾值，則回到步驟S22，加載新的資料流並且記錄對應於新的資料流的真實結果。在步驟S26，計算資料流與歷史資料流的相關性。在步驟S27，判斷相關性是否高於第二閾值。若相關性高於第二閾值，則進入步驟S28。若相關性低於或等於第二閾值，則回到步驟S26，計算資料流與另一個歷史資料流的相關性。在步驟S28，將當前預測模型從第一預測模型切換為第二預測模型。FIG. 2 is a schematic diagram of a method for managing a prediction model according to an embodiment of the present invention, where the method may be implemented by the apparatus 10 shown in FIG. 1. In step S21, the first prediction model and the second prediction model are stored, wherein the second prediction model corresponds to the historical data stream. In step S22, the data stream is loaded and the actual result corresponding to the data stream is recorded. In step S23, the prediction result of the data stream is calculated according to the current prediction model, where the current prediction model is the first prediction model. In step S24, the accuracy of the prediction result is calculated according to the real result. In step S25, it is determined whether the accuracy is lower than the first threshold. If the accuracy is lower than the first threshold, step S26 is entered. If the accuracy is higher than or equal to the first threshold, then return to step S22, load a new data stream and record the true result corresponding to the new data stream. In step S26, the correlation between the data stream and the historical data stream is calculated. In step S27, it is determined whether the correlation is higher than the second threshold. If the correlation is higher than the second threshold, step S28 is entered. If the correlation is lower than or equal to the second threshold, return to step S26 to calculate the correlation between the data stream and another historical data stream. In step S28, the current prediction model is switched from the first prediction model to the second prediction model.

綜上所述，本發明可定期地儲存資料流及其元資料，並且根據資料流的變化動態地調整所使用的預測模型。若當前使用的預測模型的準確度不佳，則本發明可根據資料流與歷史資料流的相關性來選擇最佳的預測模型。此外，本發明的訓練模組可基於資料流的資料欄位分佈型別來選擇預測模型所使用的演算法。因此，所訓練出來的預測模型將具有較高的準確度。另一方面，本發明可在當前預測模型適用於當前資料流的情況下，將當前資料流與當前預測模型的關係記載下來，以作為未來選擇預測模型的參考。In summary, the present invention can periodically store the data stream and its metadata, and dynamically adjust the prediction model used according to the changes in the data stream. If the accuracy of the currently used prediction model is not good, the present invention can select the best prediction model according to the correlation between the data stream and the historical data stream. In addition, the training module of the present invention can select the algorithm used by the prediction model based on the data field distribution type of the data stream. Therefore, the trained prediction model will have higher accuracy. On the other hand, the present invention can record the relationship between the current data stream and the current prediction model when the current prediction model is applicable to the current data stream, and serve as a reference for selecting a prediction model in the future.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed as above with examples, it is not intended to limit the present invention. Any person with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be subject to the scope defined in the appended patent application.

10:管理預測模型的裝置10: Device for managing prediction models

100:處理器100: processor

300:儲存單元300: storage unit

310:擷取模組310: Capture module

320:計算模組320: computing module

330:選擇模組330: Select module

340:訓練模組340: Training module

350:資料字典模組350: Data dictionary module

S21、S22、S23、S24、S25、S26、S27、S28:步驟S21, S22, S23, S24, S25, S26, S27, S28: steps

圖1根據本發明的實施例繪示一種管理預測模型的裝置的示意圖。圖2根據本發明的實施例繪示一種管理預測模型的方法的示意圖。FIG. 1 is a schematic diagram of an apparatus for managing a prediction model according to an embodiment of the present invention. FIG. 2 is a schematic diagram of a method for managing a prediction model according to an embodiment of the present invention.

S21、S22、S23、S24、S25、S26、S27、S28:步驟 S21, S22, S23, S24, S25, S26, S27, S28: steps

Claims

An apparatus for managing prediction models includes: a storage unit storing multiple modules; and a processor coupled to the storage unit, the processor accessing and executing the multiple modules, the multiple modules It includes: a data dictionary module that stores a first prediction model and a second prediction model, where the second prediction model corresponds to a historical data stream; an extraction module, loads a data stream, and records real results corresponding to the data stream And retrieve the metadata table of the data stream, where the metadata table includes feature data, label data and first data; the calculation module calculates the prediction result of the data stream according to the current prediction model, and according to the The true result calculates the accuracy of the prediction result, where the current prediction model is the first prediction model; a selection module that calculates the data flow and the data in response to the accuracy being lower than a first threshold Correlation of historical data streams, and in response to the correlation being higher than a second threshold, switching the current prediction model from the first prediction model to the second prediction model; and a training module, based on the The degree of association between the first data and the tag data is greater than a third threshold, the first data is selected as the feature data, and the corresponding to the data stream is trained according to the feature data and the tag data The first prediction model.

The device according to item 1 of the patent application scope, wherein the metadata table includes data field distribution types, and the training module determines the first prediction model to use according to the data field distribution types Algorithm.

The device according to item 2 of the patent application scope, wherein the training module determines that the algorithm is a regression algorithm based on the data field distribution type being continuous, and is based on the data field distribution type Instead of being discrete, the algorithm is determined to be a classification algorithm.

The device according to item 1 of the patent application scope, wherein the selection module records in the data dictionary module that the data stream is associated with the first prediction in response to the accuracy being higher than a fourth threshold model.

The device as described in item 1 of the patent application scope, wherein the data dictionary module further stores a historical metadata table of the historical data stream, wherein the selection module is further based on the metadata table and the historical metadata The data table calculates the correlation.

The device according to item 5 of the patent application scope, wherein the selection module is further based on at least one of a data field type, a data field distribution type, and a zero value ratio in the metadata table Calculate the correlation.

The device according to item 1 of the patent application scope, wherein the first threshold corresponds to the first prediction model.

A method for managing a prediction model includes: storing a first prediction model and a second prediction model, wherein the second prediction model corresponds to a historical data stream; loading the data stream and recording the actual results corresponding to the data stream, and extracting Get a meta-data table of the data stream, where the meta-data table includes feature data, label data, and first data; calculate a prediction result of the data stream according to a current prediction model, where the current prediction model is the first A prediction model; calculating the accuracy of the prediction result based on the true result; calculating the correlation between the data stream and the historical data stream in response to the accuracy being lower than the first threshold; responding to the correlation Selects the current prediction model from the first prediction model to the second prediction model when the performance is higher than the second threshold; and selects based on the degree of association between the first data and the label data is greater than the third threshold The first data is used as the feature data, and the first prediction model corresponding to the data stream is trained based on the feature data and the tag data.