TWM622331U

TWM622331U - System and device for risk prediction therefor

Info

Publication number: TWM622331U
Application number: TW110213018U
Authority: TW
Inventors: 李政旺; 黃律翔; 賴辰瑜
Original assignee: 關貿網路股份有限公司
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-01-11

Abstract

A system and a device for risk prediction are provided. The system and the device include: generating a plurality of models by using a plurality of data sets of a training group and a plurality of algorithms after training, wherein each data set includes a plurality of training data and each model is trained and generated by using one of the data sets and one of the algorithms; inputting each validation data of a validation group into each model in order to calculate a score of each data set based on an output value of each model for each validation data; selecting the data set with the highest score as the optimal set; and predicting a risk of a real-time data by using the models of the optimal set.

Description

Risk prediction system and its equipment

本創作係有關基於人工智慧的風險預測技術，且特別係有關一種用於風險預測的系統及其設備。 This creation relates to risk prediction technology based on artificial intelligence, and in particular, relates to a system and equipment for risk prediction.

對於風險判斷，例如在某些產品檢驗領域，原先設計皆由歷史記錄，設定邏輯條件以判斷風險高低，進而決定抽驗機率。例如在一般情況下，抽驗機率較低，例如20%以下；若歷史紀錄中曾有抽中不合格，則加高其抽驗機率到20%至50%之間；若再抽中不合格則逐批抽驗，抽驗機率為100%。 For risk judgment, for example, in some product inspection fields, the original design is based on historical records, and logical conditions are set to judge the risk level, and then determine the probability of random inspection. For example, under normal circumstances, the probability of random inspection is low, such as below 20%; if there is a failure in the historical record, the probability of random inspection will be increased to between 20% and 50%; Batch sampling, the sampling probability is 100%.

因此，需要自動化的風險預測技術，以根據有關風險的各種資料，針對即時資料自動預測風險，以提升效率並降低成本。 Therefore, there is a need for automated risk prediction technology to automatically predict risks based on real-time data based on various data about risks to improve efficiency and reduce costs.

多數風險預測領域皆可使用上述技術。目前已經有運用人工智慧模型的自動風險預測，然現有的模型預測多是單一演算法的應用，且忽略了用於訓練模型的資料組合本身可能造成的影響，故該技術仍有效率低、準確率低、成本高等問題。 These techniques can be used in most areas of risk forecasting. At present, there are automatic risk predictions using artificial intelligence models. However, the existing model predictions are mostly the application of a single algorithm, and the possible impact of the data combination used for training the model itself is ignored. Therefore, this technology is still inefficient and accurate. low accuracy and high cost.

為解決上述問題，本創作提供一種風險預測系統，包括訓練單元、分數單元、選取單元及預測單元。訓練單元用於使用訓練組中之複數資料組合及複數演算法經訓練後產生複數模型，其中，各該資料組合包括複數訓練資料，且各該模型係使用該等資料組合中之一者及該等演算法中之一者經訓練所產生者；分數單元耦接該訓練單元，用於將驗證組中之各驗證資料輸入各該模型後，根據各該模型對各該驗證資料之輸出值計算出各該資料組合之分數；選取單元耦接該分數單元，用於選取該等資料組合中具有最高之該分數的資料組合做為最佳組合；預測單元耦接該選取單元，用於使用該最佳組合之該等模型，以預測即時資料之風險。 In order to solve the above problems, this creation provides a risk prediction system, including a training unit, a score unit, a selection unit and a prediction unit. The training unit is used to generate a complex number model after training using the complex number data combination in the training set and the complex number algorithm, wherein each data combination includes complex number training data, and each the model uses one of the data combination and the complex number model. One of the other algorithms is generated by training; the fractional unit is coupled to the training unit, and is used to input each verification data in the verification group into each of the models, and calculate the output value of each of the verification data according to each of the models. The score of each data combination is obtained; the selection unit is coupled to the score unit, and is used to select the data combination with the highest score among the data combinations as the best combination; the prediction unit is coupled to the selection unit for using the The best combination of these models to predict the risk of real-time data.

本創作另提供一種風險預測設備，係具有處理器儲存有指令，以執行上述單元之內容。 The present invention also provides a risk prediction device, which has a processor and stores instructions to execute the content of the above unit.

本創作透過多種不同的演算法集成(ensemble)模型，並透過多種不同的資料組合，以找出用於風險預測的最佳資料組合，藉此避免以往僅使用單一演算法的預測偏誤，並解決忽略資料組合本身影響的問題，以降低風險應對或處理成本，同時兼顧風險預測的命中率，並提升整體效率。 This creation uses a variety of different algorithms to ensemble models, and through a variety of different data combinations, to find the best data combination for risk prediction, so as to avoid the prediction bias of only using a single algorithm in the past, and Solve the problem of ignoring the impact of the data portfolio itself, so as to reduce the cost of risk response or processing, while taking into account the hit rate of risk prediction, and improve the overall efficiency.

300:風險預測系統 300: Risk Prediction System

310:資料單元 310: Data Unit

320:訓練單元 320: Training Unit

330:分數單元 330: Fractional Unit

340:選取單元 340:Select unit

350:預測單元 350: Prediction Unit

S101~S112,S201~S205:方法步驟 S101~S112, S201~S205: method steps

圖1為一種風險預測方法的模型建立程序的流程圖。 FIG. 1 is a flow chart of a model building procedure of a risk prediction method.

圖2為一種風險預測方法的預測程序的流程圖。 FIG. 2 is a flow chart of a prediction procedure of a risk prediction method.

圖3為一種風險預測系統的方塊圖。 Figure 3 is a block diagram of a risk prediction system.

以下藉由特定的具體實施例說明本創作之實施方式，在本技術領域具有通常知識者可由本說明書所揭示之內容輕易地瞭解本創作之其他優點及功效。 The following describes the implementation of the present invention by means of specific embodiments, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification.

在步驟S101，進行該風險預測方法所針對領域之資料的收集與彙整。上述領域可為涉及風險的產品、商品、服務或其他領域。例如，若該領域為電子產品，則該風險可為電子產品是否符合安全法規或其他各種法規的風險；若該領域為金融機構所提供的貸款服務，則該風險可為貸款申請人是否能正常還款的風險。 In step S101, data collection and consolidation of the field targeted by the risk prediction method is performed. These areas may be products, goods, services or other areas that involve risk. For example, if the field is electronic products, the risk may be whether the electronic product complies with safety regulations or various other regulations; if the field is loan services provided by financial institutions, the risk may be whether the loan applicant can normally repayment risk.

此外，上述資料可包括與該領域的風險相關的各種資料，例如各種組織的內部資訊系統中的相關資料、客戶資料、上述的產品、商品、服務或其他領域的資訊、以及氣象資料等外部開放資料；上述組織可以是涉及該領域的政府機關、基金會與財團法人等民間組織、或公司行號等組織；上述客戶可為上述組織的客戶或使用該風險預測方法的客戶。 In addition, the above-mentioned data may include various data related to risks in this field, such as relevant data in the internal information systems of various organizations, customer data, the above-mentioned products, commodities, services or other fields of information, as well as external openings such as weather data The above-mentioned organizations can be government agencies, foundations, foundations and other non-governmental organizations involved in this field, or organizations such as company bank numbers; the above-mentioned customers can be customers of the above-mentioned organizations or customers who use the risk prediction method.

舉例言之，在金融業是否提供個人信貸的領域中，上述資料可包括個人的性別、年齡、教育程度、月收入、年收入、居住地區，以及居住地區的教育水平、公共設施多寡等等。 For example, in the field of whether the financial industry provides personal credit, the above-mentioned information may include the individual's gender, age, education level, monthly income, annual income, residential area, as well as the educational level of the residential area, the number of public facilities and so on.

再者，在步驟S101，進行多資料源的對應，也就是將來自各資料源的關於同一對象的資料合併為同一筆資料，再將各筆資料全部彙整為統一格式。例如，在金融信貸領域中，每一個對象可為一筆貸款申請；在電子產品領域中，每一個對象可為同一批進口或同一批出廠的待檢驗電子產品，依此類推。 Furthermore, in step S101, the correspondence of multiple data sources is performed, that is, the data related to the same object from each data source are merged into the same data, and then all the data are assembled into a unified format. For example, in the field of financial credit, each object can be a loan application; in the field of electronic products, each object can be the same batch of imported or the same batch of electronic products to be inspected, and so on.

在步驟S102，進行上述資料的前處理，例如將同一個字的簡體版修正為繁體版，或將台灣三百六十八個鄉鎮市區的居住地區簡化區分為高、中、低風險地區等處理。在此步驟，亦會將上述資料中訓練人工智慧模型所需的各種相關資訊定義為特徵，即可能對風險預測結果有影響之變數。 In step S102, pre-processing of the above-mentioned data is performed, for example, the simplified version of the same character is corrected to the traditional version, or the residential areas of 368 townships and urban areas in Taiwan are simplified into high, medium and low risk areas, etc. deal with. In this step, various relevant information required for training the artificial intelligence model in the above data are also defined as features, that is, variables that may have an impact on the risk prediction results.

在步驟S103，將已通過步驟S101及S102處理的資料切分為訓練組、驗證組與測試組，其中，訓練組用於訓練模型，驗證組用於找出最佳資料組合，測試組用於該風險預測方法的效益評估。在一實施例中，訓練組、驗證組與測試組可按照各筆資料的使用時間(即歷史記錄中各筆資料被用於風險相關的檢驗或評估的時間，例如某年某月某日)區分，例如訓練組最早，驗證組居中，且測試組最晚。通常訓練組含有最多資料，例如，上述三組的資料數量可為8：1：1。此外，為了強調或區別各組中的資料，可將訓練組、驗證組與測試組中的資料分別稱為訓練資料、驗證資料與測試資料。 In step S103, the data processed in steps S101 and S102 are divided into training group, verification group and test group, wherein the training group is used to train the model, the verification group is used to find the best data combination, and the test group is used to Benefit evaluation of this risk prediction method. In one embodiment, the training group, the verification group, and the test group can be used according to the usage time of each piece of data (that is, the time when each piece of data in the historical record is used for risk-related inspection or evaluation, such as a certain year, a certain month, and a certain day). Differentiate, for example, the training group is the earliest, the validation group is in the middle, and the test group is the latest. Usually the training group contains the most data, for example, the number of data in the above three groups can be 8:1:1. In addition, in order to emphasize or distinguish the data in each group, the data in the training group, the validation group, and the test group may be referred to as training data, validation data, and test data, respectively.

訓練組中的訓練資料可根據其使用時間及重抽樣(oversampling)方式劃分為複數資料組合，例如下列的表1中的四個資料組合，其分別對應自2015年開始與自2018年開始這兩種不同的資料使用時間以及等比例放大與合成少數重抽樣技術(Synthetic Minority Oversampling Technique，SMOTE)這兩種不同的重抽樣方式所構成的四種組合。 The training data in the training group can be divided into multiple data sets according to their usage time and oversampling methods. For example, the four data sets in Table 1 below correspond to the two data sets starting from 2015 and starting from 2018, respectively. Different data usage times and four combinations of two different resampling methods, namely proportional enlargement and synthetic minority oversampling technique (SMOTE).

隨著取得資料的來源不同，資料完整度與可信度可能導致不同資料組合在不同演算法上的結果有所差異，故在資料組合上可依資料使用時間進行一個面向的組合。例如，使用上述風險預測方法的組織所擁有的資料為2015年開始蒐集，然而多數取得的外部資料是2018年後才相對完整，因此在資料使用時間上可區隔為自2015年開始的資料與自2018年開始的資料(自2015年開始的資料包含自2018年開始的資料)。同時，目標變數中，高風險通常為少數，例如低風險與高風險的資料數量比例可能為10：1或8：1，故須透過不平衡資料處理以提高高風險資料比例，例如提高到5：5或7：3，以使得演算法具有較佳的學習成效。因此，處理不平衡資料的不同重抽樣方式，例如等比例放大與SMOTE為另一個面向的組合，故總共有表1所示的四種訓練組資料組合。 Depending on the source of the obtained data, the integrity and reliability of the data may lead to differences in the results of different data combinations in different algorithms. Therefore, the data combination can be combined according to the time of use of the data. For example, the data owned by the organizations using the above risk forecasting method were collected in 2015, but most of the external data obtained were relatively complete after 2018. Therefore, the data usage time can be divided into data from 2015 and data from 2015. Data from 2018 (data from 2015 includes data from 2018). At the same time, among the target variables, there are usually a small number of high-risk data. For example, the ratio of low-risk and high-risk data may be 10:1 or 8:1. Therefore, it is necessary to increase the ratio of high-risk data through unbalanced data processing, for example, to 5 : 5 or 7: 3, so that the algorithm has better learning effect. Therefore, different resampling methods for dealing with unbalanced data, such as proportional upscaling and SMOTE, are another aspect of the combination, so there are a total of four training group data combinations shown in Table 1.

本創作不限於僅有四種訓練組資料組合，例如在另一實施例中，訓練組可包含T*O種資料組合，其分別對應T種不同資料使用時間與O種不同重抽樣方式所構成的T*O種不同組合，且每個資料組合均包括複數筆訓練資料，其中，T與O均為大於一的整數。換言之，每種資料組合係由自T種資料使用時間中之一者開始的訓練資料所組成，且係根據O種重抽樣方式中之一者所產生。 The present creation is not limited to having only four training set data combinations. For example, in another embodiment, the training set may include T*O data combinations, which correspond to T different data usage times and O different resampling methods respectively. There are T*O different combinations of , and each data combination includes a plurality of training data, wherein T and O are both integers greater than one. In other words, each data combination is composed of training data starting from one of T data usage times, and is generated according to one of O resampling methods.

在步驟S104，對訓練組進行特徵篩選，以確定納入模型訓練的特徵。該特徵篩選係指去掉對模型訓練無用的特徵，例如，某產品的風險與製造公司資本額無關，則去掉製造公司資本額，不用於後面的模型訓練。 In step S104, feature screening is performed on the training group to determine features included in model training. The feature screening refers to removing features that are useless for model training. For example, if the risk of a product has nothing to do with the capital of the manufacturing company, the capital of the manufacturing company is removed and not used for subsequent model training.

在步驟S105，使用訓練組中之複數資料組合及複數演算法經訓練後產生複數模型。實際上，因為重抽樣有隨機成分，故對於每一種資料組合及每一種演算法，都需要使用該資料組合及該演算法訓練產生N個模型，才會有較穩定的預測結果。 In step S105, a complex number model is generated after training using the complex number data combination in the training set and the complex number algorithm. In practice, because of the random component of resampling, for each data combination and each An algorithm needs to use the data combination and the algorithm to train to generate N models, so as to have a relatively stable prediction result.

此外，對於上述N個模型，會為每個模型隨機產生不同的訓練資料。因此，若有S種資料組合(S=T*O)及A種演算法，則在此步驟總共訓練產生S*A*N個不同模型，其中，A及N均為大於一的整數。例如，在一實施例中，S等於4(即表1所示的四種資料組合)，A等於5，該五種演算法分別為羅吉斯迴歸(logistic regression)、決策樹(decision tree)C5.0、分類與迴歸樹(classification and regression tree,CART)、樸素貝氏分類(naive Bayes classifier)、以及隨機森林(random forest)，且N等於10，故於步驟S105總共訓練產生200個不同模型。 In addition, for the above N models, different training data are randomly generated for each model. Therefore, if there are S types of data combinations (S=T*O) and A types of algorithms, a total of S*A*N different models are generated by training in this step, where A and N are both integers greater than one. For example, in one embodiment, S is equal to 4 (that is, the four data combinations shown in Table 1), A is equal to 5, and the five algorithms are logistic regression and decision tree respectively. C5.0, classification and regression tree (CART), naive Bayes classifier, and random forest, and N is equal to 10, so in step S105 a total of 200 different Model.

在步驟S106，讀取驗證組中的下一筆驗證資料，若為第一次執行此步驟，則讀取驗證組中的第一筆驗證資料。 In step S106, the next verification data in the verification group is read, and if this step is performed for the first time, the first verification data in the verification group is read.

在步驟S107，將讀取的該驗證資料輸入步驟S105訓練產生的每一個模型，以產生各資料組合之各模型對該驗證資料之輸出值。例如，若S等於4、A等於5且N等於10，則4種資料組合中的每一種資料組合均有5種演算法的50個模型，這些模型都會被輸入該驗證資料並產生相應的輸出值。 In step S107, the read verification data is input into each model generated by the training in step S105, so as to generate the output values of the verification data for each model of each data combination. For example, if S equals 4, A equals 5, and N equals 10, then each of the 4 data sets has 50 models for 5 algorithms that are fed into the validation data and produce corresponding outputs value.

在步驟S108，根據各資料組合之各模型對該驗證資料之該輸出值，產生各資料組合之各演算法對該驗證資料之預測結果，其中，對於各資料組合及各演算法，該資料組合之該演算法對該驗證資料之該預測結果係該資料組合及該演算法所對應之該等模型對該驗證資料之該等輸出值的平均值。例如，若S等於4、A等於5且N等於10，則4種資料組合中的每一種資料組合均有5種演算法的50個模型，且每一種資料組合的每一種演算法的該預測結果為該演算法的10個模型對該驗證資料的10個輸出值的平均值。 In step S108, according to the output value of the verification data of each model of each data combination, a prediction result of each algorithm of each data combination for the verification data is generated, wherein, for each data combination and each algorithm, the data combination The prediction result of the algorithm for the verification data is the average of the output values of the data combination and the models corresponding to the algorithm for the verification data. For example, if S equals 4, A equals 5, and N equals 10, then there are 50 models for each of the 5 algorithms for each of the 4 data sets, and the prediction for each algorithm for each data set The result is the average of the 10 outputs of the 10 models of the algorithm for the validation data.

在步驟S109，根據各資料組合之該等演算法對該驗證資料之該等預測結果進行表決，以產生各資料組合對該驗證資料之表決結果。 In step S109, the prediction results of the verification data are voted on according to the algorithms of each data combination to generate a voting result of the verification data for each data combination.

詳言之，對於各資料組合，若該資料組合之該等演算法對該驗證資料的該等預測結果中，有超過半數之預測結果大於對應的閾值，則該資料組合對該驗證資料之該表決結果為高風險，否則為低風險。例如，若S等於4且A等於5，則對於4種資料組合中的每一種資料組合，若其5種演算法的5個預測結果中，有至少3個預測結果大於對應的閾值，則該資料組合對該驗證資料之該表決結果為高風險，否則為低風險。每一種演算法所對應的閾值可以相同也可以不相同。在另一實施例中，可依防測結果而有至少二種表決結果之風險。 In detail, for each data combination, if more than half of the prediction results for the verification data by the algorithms of the data combination are greater than the corresponding threshold The voting result is high risk, otherwise it is low risk. For example, if S equals 4 and A equals 5, then for each of the 4 data combinations, if at least 3 of the 5 prediction results of the 5 algorithms are greater than the corresponding threshold, then the The voting result of the data combination for the verification data is high risk, otherwise it is low risk. The thresholds corresponding to each algorithm can be the same or different. In another embodiment, there may be at least two risks of voting results depending on the detection results.

在步驟S110，檢查驗證組中的驗證資料是否已處理完畢。若驗證組中尚有驗證資料未處理，則流程返回步驟S106，否則流程進入S111。 In step S110, it is checked whether the verification materials in the verification group have been processed. If there is still unprocessed verification data in the verification group, the flow returns to step S106, otherwise the flow goes to S111.

在步驟S111，對於各資料組合，根據該資料組合對每一筆驗證資料的表決結果，計算出該資料組合之分數。 In step S111, for each data combination, the score of the data combination is calculated according to the voting result of the data combination for each piece of verification data.

詳言之，根據每一筆驗證資料的歷史記錄與表決結果，使用混淆矩陣(confusion matrix)評估每一種資料組合。例如，每一筆驗證資料的對象接受檢驗或評估的歷史記錄可能為合格或不合格，則歷史記錄的合格與不合格可分別對應表決結果的低風險與高風險。此外，利用混淆矩陣分析歷史記錄與表決結果之間的關聯，可計算出各資料組合的準確率(accuracy)、召回率(recall)、特異度(specificity)、陽性預測值(positive predictive value)、陰性預測值(negative predictive value)、以及最後的F1分數(F1 score/measure)，例如下列的表2所示，其中，該F1分數即為各資料組合之上述分數。 In detail, a confusion matrix is used to evaluate each data combination based on the historical records and voting results of each verification data. For example, the historical record of the subject of each piece of verification data being inspected or evaluated may be qualified or unqualified, and the qualified and unqualified historical records may correspond to the low risk and high risk of the voting result, respectively. In addition, using confusion matrix to analyze the relationship between historical records and voting results, the accuracy, recall, specificity, positive predictive value, The negative predictive value (negative predictive value) and the final F1 score (F1 score/measure) are shown in Table 2 below, where the F1 score is the above score of each data combination.

在步驟S112，選取各資料組合中具有最高分數的資料組合做為最佳組合。例如表2展示的兩個領域中，X領域的對應2015年與等比例放大的資料組合具有最高分數0.49，故該資料組合被選為X領域的最佳組合。另一方面，Y領域的對應2018年與等比例放大的資料組合具有最高分數0.45，故該資料組合被選為Y領域的最佳組合。該最佳組合共包括A*N個不同模型。由表2所示的評估結果可知，透過同樣的多種演算法，但不同的訓練組資料組合，在不同領域上的效果亦有所差異。 In step S112, the data combination with the highest score among the data combinations is selected as the best combination. For example, in the two fields shown in Table 2, the data combination corresponding to 2015 and proportionally enlarged in the X field has the highest score of 0.49, so this data combination is selected as the best combination in the X field. On the other hand, the data combination corresponding to 2018 and proportionally enlarged in the Y field has the highest score of 0.45, so this data combination is selected as the best combination in the Y field. The optimal combination includes A*N different models in total. From the evaluation results shown in Table 2, it can be seen that through the same various algorithms, but different combinations of training data, the effects in different fields are also different.

圖2為一種風險預測方法的預測程序的流程圖，該預測程序接續圖1的模型建立程序。 FIG. 2 is a flow chart of a forecasting procedure of a risk forecasting method, the forecasting procedure continuing the model building procedure of FIG. 1 .

在步驟S201，取得即時資料，其中，該即時資料為該風險預測方法所應用領域中的一個待預測對象的一筆風險相關資料，其格式與內容對應上述的訓練資料與檢驗資料。例如，在金融信貸領域中，該對象可為一筆貸款申請。 In step S201, real-time data is obtained, wherein the real-time data is a piece of risk-related data of an object to be predicted in the application field of the risk prediction method, and the format and content of the real-time data correspond to the above-mentioned training data and inspection data. For example, in the field of financial credit, the object may be a loan application.

在步驟S202，進行該即時資料的前處理，此步驟的前處理和步驟S102相同。 In step S202, the preprocessing of the real-time data is performed, and the preprocessing of this step is the same as that of step S102.

在步驟S203，使用步驟S112所選取的最佳組合之該等模型預測該即時資料之風險。 In step S203, the risk of the real-time data is predicted using the best combination of the models selected in step S112.

詳言之，將該即時資料輸入該最佳組合之各模型，以產生該最佳組合之各模型對該即時資料的輸出值。然後，對於每一種演算法，計算出該最佳組合之該演算法的各模型之該輸出值的平均值，做為該演算法對該即時資料之預測結果，再根據各演算法對該即時資料之該預測結果進行表決，以預測該即時資料之風險，其中，若該等演算法對該即時資料之該等預測結果中，有超過半數之預測結果大於閾值，則該即時資料為高風險，否則為低風險。 Specifically, the real-time data is input into each model of the optimal combination to generate an output value of the real-time data for each model of the optimal combination. Then, for each algorithm, the average value of the output values of each model of the algorithm in the optimal combination is calculated as the prediction result of the algorithm for the real-time data, and then the real-time data is calculated according to each algorithm. The prediction results of the data are voted to predict the risk of the real-time data, wherein if more than half of the prediction results of the real-time data by the algorithms are greater than a threshold, the real-time data is high risk , otherwise it is low risk.

在步驟S204，將該即時資料的表決結果(高風險或低風險)及相關記錄寫入資料庫。 In step S204, the voting result (high risk or low risk) and related records of the real-time data are written into the database.

在步驟S205，檢查該表決結果是否為高風險，若為高風險則發出預警，例如，可發送簡訊或電子郵件做為預警以通知管理人員，或以聲音、語音和/或燈光等方式發出預警；反之，則發出一般通知或不處理。 In step S205, it is checked whether the voting result is high risk, and if it is high risk, an early warning is issued. For example, a short message or an email can be sent as a warning to notify the management personnel, or an early warning can be issued by means of sound, voice and/or light, etc. ; otherwise, a general notice or no processing will be issued.

對於即時資料的表決結果，可依其領域的特性進行各種應對。例如，在金融信貸領域，若表決結果為高風險，則可對該即時資料所對應的貸款申請從嚴審查，反之則可從寬審查。 Regarding the voting results of real-time data, various responses can be made according to the characteristics of the field. For example, in the field of financial credit, if the voting result is high risk, the loan application corresponding to the real-time data can be strictly reviewed, and vice versa.

另外，還可以使用該最佳組合的模型，對測試組中的測試資料進行風險預測，以評估該風險預測方法的效益。例如，對於產品抽驗，可評估該風險預測方法能降低多少抽樣檢驗成本，以及是否能提高被抽樣的產品中驗出問題的命中率。 In addition, the model of the optimal combination can also be used to perform risk prediction on the test data in the test group to evaluate the effectiveness of the risk prediction method. For example, for product sampling inspection, it is possible to evaluate how much the risk prediction method can reduce the sampling inspection cost, and whether it can improve the hit rate of inspection problems in the sampled products.

在另一實施例中，本創作另提供一種風險預測設備，係應用或設置於計算裝置或電腦中。例如，該風險預測設備可具有記憶體、軟碟、硬碟或光碟。此外，該風險預測設備儲存有指令，以執行該風險預測方法或執行圖3所示的各單元之內容。 In another embodiment, the present invention further provides a risk prediction device, which is applied or set in a computing device or a computer. For example, the risk prediction device may have a memory, a floppy disk, a hard disk or an optical disk. In addition, the risk prediction device stores instructions to execute the risk prediction method or to execute the content of each unit shown in FIG. 3 .

在另一實施例中，本創作另提供一種風險預測系統，例如圖3所示的風險預測系統300。風險預測系統300包括資料單元310、訓練單元320、分數單元330、選取單元340及預測單元350，這些單元依上述順序串列耦接，且每一個單元均可用硬體、韌體或軟體實現。 In another embodiment, the present creation further provides a risk prediction system, such as the risk prediction system 300 shown in FIG. 3 . The risk prediction system 300 includes a data unit 310 , a training unit 320 , a score unit 330 , a selection unit 340 and a prediction unit 350 . These units are coupled in series in the above order, and each unit can be implemented by hardware, firmware or software.

風險預測系統300係執行上述之風險預測方法，其中，資料單元310係用於執行圖1中的步驟S101至S104，包括資料收集與彙整、資料前處理、將資料切分為訓練組、驗證組與測試組、特徵篩選，訓練單元320係用於執行步驟S105，包括訓練產生模型，分數單元330係用於執行步驟S106至S111，包括讀取下一筆驗證資料、產生各資料組合之各模型對驗證資料之輸出值、產生各資料組合之各演算法對驗證資料之預測結果、產生各資料組合對驗證資料之表決結果、驗證資料是否已處理完成、計算出各資料組合之分數，選取單元340係用於執行步驟S112，包括選取最佳組合，且預測單元350係用於執行圖2中的步驟S201至S205，包括取得即時資料、資料前處理、使用最佳組合之模型預測即時資料之風險、寫入資料庫、發出預警。 The risk prediction system 300 executes the above-mentioned risk prediction method, wherein the data unit 310 is used to execute steps S101 to S104 in FIG. 1 , including data collection and aggregation, data preprocessing, dividing data into training groups, and verification groups With the test group and feature screening, the training unit 320 is used to perform step S105, including training and generating models, and the scoring unit 330 is used to perform steps S106 to S111, including reading the next verification data, and generating each model pair for each data combination. To verify the output value of the data, to generate the prediction result of each data combination on the verification data, to generate the voting result of each data combination to the verification data, to verify whether the data has been processed, and to calculate the score of each data combination, select unit 340 is used to execute step S112, including selecting the best combination, and the prediction unit 350 is used to execute steps S201 to S205 in FIG. 2, including obtaining real-time data, pre-processing the data, and predicting the risk of real-time data using the model of the best combination , write to the database, and issue an alert.

綜上所述，本創作透過多種不同演算法模型的穩健集成架構與多種不同資料組合的綜合交叉評估，找出用於風險預測的最佳資料組合，藉此提升模型預測的表現，避免以往僅使用單一演算法的預測偏誤，並解決忽略資料組合本身影響的問題，以降低風險應對或處理成本，同時提供精準的風險預測，並提升整體效率。 To sum up, this creation finds the best data combination for risk prediction through the robust integration architecture of various different algorithm models and comprehensive cross-evaluation of various data combinations, thereby improving the performance of model prediction and avoiding the traditional Use the prediction bias of a single algorithm and solve the problem of ignoring the impact of the data portfolio itself to reduce risk response or processing costs, while providing accurate risk predictions and improving overall efficiency.

上述實施形態僅例示性說明本創作之原理及其功效，而非用於限制本創作。任何在本技術領域具有通常知識者均可在不違背本創作之精神及範疇下，對上述實施形態進行修飾與改變。因此，本創作之權利保護範圍，應如後述之申請專利範圍所列。 The above-mentioned embodiments are only used to illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present creation. Therefore, the protection scope of the rights of this creation should be listed in the patent application scope described later.

300:風險預測系統 300: Risk Prediction System

310:資料單元 310: Data Unit

320:訓練單元 320: Training Unit

330:分數單元 330: Fractional Unit

340:選取單元 340:Select unit

350:預測單元 350: Prediction Unit

Claims

A risk prediction system comprising:

a training unit for training complex data sets and complex algorithms in the training set to generate complex models, wherein each data set includes complex training data, and each of the models uses one of the data sets and one of those algorithms is the result of training;

a score unit, coupled to the training unit, for inputting each verification data in the verification group into each of the models, so as to calculate the scores of each of the data combinations according to the output values of each of the models for each of the verification data;

a selection unit, coupled to the score unit, for selecting the data combination with the highest score among the data combinations as the best combination; and

The prediction unit, coupled to the selection unit, is used for predicting the risk of real-time data using the models corresponding to the optimal combination.

The risk prediction system of claim 1, wherein each data combination is composed of the training data starting from one of plural data usage times, and each of the data sets is generated according to one of plural resampling methods Data combination.

The risk prediction system of claim 1, wherein, for each of the data sets and each of the algorithms, there are plural sub-models in the models corresponding to the data sets and the algorithms.

The risk prediction system of claim 1, wherein the fractional unit is multiplexed to:

According to the output value of each of the models for each of the verification data, the prediction results of each of the algorithms for each of the data combinations for each of the verification data are generated;

for each of the verification data, voting on the prediction results of the verification data according to the algorithms for each of the data combinations to generate a voting result for the verification data for each of the data combinations; and

For each data combination, the score for the data combination is calculated based on the voting results of the data combination on the verification data.

The risk prediction system of claim 4, wherein, for each of the data combinations, each of the algorithms, and each of the verification data, the prediction result of the verification data by the algorithm of the data combination is the data combination and the verification data. The average value of the output values of the verification data of the sub-models corresponding to the algorithm.

The risk prediction system of claim 4, wherein, for each of the data combinations and each of the verification data, if the prediction results of the verification data by the algorithms of the data combination have more than half of the prediction results If it is greater than the threshold, the voting result of the data combination for the verification data is high risk, otherwise, it is low risk.

The risk prediction system of claim 1, wherein the prediction unit is multiplexed to:

inputting the real-time data into each of the models of the optimal combination to generate prediction results for the real-time data of the algorithms based on the output values of the real-time data of each of the models of the optimal combination; and

The prediction results of the real-time data are voted on according to the algorithms to predict the risk of the real-time data.

The risk prediction system as described in claim 7, wherein if more than half of the prediction results of the real-time data by the algorithms are greater than a threshold, the real-time data is high risk, otherwise, the real-time data is considered as high risk. low risk.

A risk prediction apparatus having a processor storing instructions to execute the content of the unit of the risk prediction system as claimed in any one of claims 1 to 8.