TWI808785B - Data splitting system and method for validating machine learning - Google Patents
Data splitting system and method for validating machine learning Download PDFInfo
- Publication number
- TWI808785B TWI808785B TW111121654A TW111121654A TWI808785B TW I808785 B TWI808785 B TW I808785B TW 111121654 A TW111121654 A TW 111121654A TW 111121654 A TW111121654 A TW 111121654A TW I808785 B TWI808785 B TW I808785B
- Authority
- TW
- Taiwan
- Prior art keywords
- blood pressure
- subjects
- data
- pressure data
- categories
- Prior art date
Links
Images
Landscapes
- Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
- Hardware Redundancy (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Description
本發明關於機器學習,特別是一種驗證機器學習的資料拆分系統及其方法。 The present invention relates to machine learning, in particular to a data splitting system and method for verifying machine learning.
驗證策略在調整機器學習及深度學習模型的超參數時是不可或缺的。為了訓練和驗證的目的,可以用資料拆分或交叉驗證(cross validation,CV)技術將資料劃分為訓練資料集、驗證資料集和測試資料集。這些技術提供了關於模型在獨立且新資料集上的準確性和通用性的判斷力。然而,若是資料拆分的策略不當,這些技術也會導致模型過擬合(over-fitting)或偏差(bias)的問題。 Validation strategies are integral to tuning hyperparameters of machine learning and deep learning models. For training and validation purposes, data splitting or cross-validation (CV) techniques can be used to divide the data into training data sets, validation data sets, and test data sets. These techniques provide insights about the accuracy and generalizability of models on independent and new datasets. However, if the data splitting strategy is not appropriate, these techniques can also lead to the problem of over-fitting or bias of the model.
現有的交叉驗證方式如留一驗證(Leave-One-Out)、保留驗證(Holdout)及K折交叉驗證(K-fold CV)等,是標準機器學習問題(如回歸問題)中的常見做法。然而,血壓估計並非標準回歸問題。例如,血壓資料集的資料點並不完全相互獨立。也就是說,這些資料集通常有許多來自同一記錄或受測者的片段,它們可能包含非常相似的資訊。此外,血壓估計問題中有兩個目標,收縮壓(systolic blood pressure,SBP)和舒張壓(diastolic blood pressure,DBP),這更類似於多任務或多輸出回歸問題。最後,收縮壓和舒張壓分布通常是偏斜(skewed)的,由於極端的血壓很少見,因此血壓估計成為一個不平衡回歸問題。 Existing cross-validation methods such as Leave-One-Out, Holdout and K-fold CV are common practices in standard machine learning problems (such as regression problems). However, blood pressure estimation is not a standard regression problem. For example, the data points of the blood pressure dataset are not completely independent of each other. That is, these data sets often have many segments from the same record or subject, which may contain very similar information. Furthermore, there are two targets in the blood pressure estimation problem, systolic blood pressure (SBP) and diastolic blood pressure (DBP), which are more similar to multi-task or multi-output regression problems. Finally, the systolic and diastolic distributions are often skewed, and since extreme blood pressures are rare, blood pressure estimation becomes an unbalanced regression problem.
為了在交叉驗證期間正確劃分資料,必須考慮上述差異。例如,隨機劃分資料可能導致同一受測者的資料片段同時出現在訓練資料集、驗證資料集和測試資料集中。由於來自同一受測者的資料片段攜帶相似的資訊,這將導致集合之間的獨立性崩潰並帶來過度樂觀的結果。此外,由於分布偏斜和不平衡,隨機資料劃分可能會導致每個集合之間的分布不同或資料移位等問題,甚至會導致測試資料集中缺乏少數案例。 In order to correctly partition the data during cross-validation, the above differences must be taken into account. For example, random partitioning of the data may result in segments of the same subject appearing in the training, validation, and test sets simultaneously. Since data segments from the same subject carry similar information, this will lead to a breakdown of independence between sets and lead to overly optimistic results. In addition, due to distribution skew and imbalance, random data partitioning may cause problems such as different distributions or data shifts between each set, or even lack of minority cases in the test data set.
有鑑於此,本發明提出一種驗證機器學習的資料拆分系統及其方法以避免上述問題,血壓估計任務的交叉驗證可以將同一受測者的資料保持在同一集合中,並儘可能保持原始資料集的收縮壓及舒張壓在不同資料集中的分布。換言之,在應用本發明產生的多個資料集中,無論是收縮壓或是舒張壓,其在多個資料集中呈現的多個趨勢之間具有高度相似性。 In view of this, the present invention proposes a data splitting system and method for verifying machine learning to avoid the above-mentioned problems. The cross-validation of the blood pressure estimation task can keep the data of the same subject in the same set, and keep the distribution of the systolic and diastolic blood pressure of the original data set in different data sets as much as possible. In other words, in multiple data sets generated by applying the present invention, whether it is systolic blood pressure or diastolic blood pressure, there is a high similarity among the multiple trends presented in multiple data sets.
依據本發明一實施例的一種驗證機器學習的資料拆分方法,適用於血壓資料集。血壓資料集包括多個受測者各自的多個血壓資料,這些血壓資料的類型包括收縮壓及舒張壓。所述方法包括以運算裝置執行以下步驟:將收縮壓的量測範圍劃分為多個第一區間,並將舒張壓的量測範圍劃分為多個第二區間,依據第一區間及第二區間產生多個類別,每一類別包括一個第一區間及一個第二區間,判斷並記錄每一受測者的血壓資料與類別的匹配狀況,進而產生對應於多個受測者的多個匹配狀況,每一匹配狀況包括對應於多個類別的多個標記,每一標記具有第一狀態及第二狀態中的一者,第一狀態代表血壓資料匹配標記對應的類別,第二狀態代表血壓資料未匹配標記對應的類別,以及依據匹配狀況執行分配程序以將多 個受測者分配至多個集合中。 A data splitting method for verifying machine learning according to an embodiment of the present invention is applicable to a blood pressure data set. The blood pressure data set includes a plurality of blood pressure data of a plurality of subjects respectively, and types of the blood pressure data include systolic blood pressure and diastolic blood pressure. The method includes the following steps performed by an arithmetic device: dividing the measurement range of the systolic blood pressure into a plurality of first intervals, and dividing the measurement range of the diastolic blood pressure into a plurality of second intervals, generating a plurality of categories according to the first interval and the second interval, each category including a first interval and a second interval, judging and recording the matching status of the blood pressure data of each subject and the category, and then generating a plurality of matching status corresponding to the plurality of testing subjects, each matching status includes a plurality of marks corresponding to the plurality of classes, each tag has one of the first state and the second state, and the first state represents the matching status of the blood pressure data. Category, the second state represents the category corresponding to the blood pressure data does not match the flag, and according to the matching status to perform the allocation process to allocate more Subjects were assigned to multiple sets.
依據本發明一實施例的一種驗證機器學習的資料拆分系統,包括量測裝置、儲存裝置以及運算裝置。量測裝置用於產生血壓資料集,其中血壓資料集包括多個受測者各自的多個血壓資料,這些血壓資料的類型包括收縮壓及舒張壓。儲存裝置通訊連接量測裝置以接收並儲存血壓資料集,以及儲存電腦可讀取記錄媒體。運算裝置通訊連接儲存裝置,運算裝置用於運行電腦可讀取記錄媒體以執行以下步驟:將收縮壓的量測範圍劃分為多個第一區間,並將舒張壓的量測範圍劃分為多個第二區間,依據第一區間及第二區間產生多個類別,每一類別包括一個第一區間及一個第二區間,判斷並記錄每一受測者的血壓資料與類別的匹配狀況,進而產生對應於多個受測者的多個匹配狀況,每一匹配狀況包括對應於多個類別的多個標記,每一標記具有第一狀態及第二狀態中的一者,第一狀態代表血壓資料匹配標記對應的類別,第二狀態代表血壓資料未匹配標記對應的類別,以及依據匹配狀況執行分配程序以將多個受測者分配至多個集合中。 A data splitting system for verifying machine learning according to an embodiment of the present invention includes a measurement device, a storage device, and a computing device. The measuring device is used to generate a blood pressure data set, wherein the blood pressure data set includes a plurality of blood pressure data of a plurality of subjects respectively, and types of the blood pressure data include systolic blood pressure and diastolic blood pressure. The storage device communicates with the measurement device to receive and store the blood pressure data set, and stores a computer-readable recording medium. The computing device communicates with the storage device, the computing device is used to run the computer and can read the recording medium to perform the following steps: divide the measurement range of the systolic blood pressure into a plurality of first intervals, and divide the measurement range of the diastolic blood pressure into a plurality of second intervals, generate a plurality of categories according to the first interval and the second interval, each category includes a first interval and a second interval, judge and record the matching status of the blood pressure data of each subject and the category, and then generate a plurality of matching status corresponding to a plurality of testing subjects, each matching status includes a plurality of marks corresponding to the plurality of classes, and each tag has one of the first state and the second state. Or, the first state represents the category corresponding to the blood pressure data matching flag, the second state represents the category corresponding to the blood pressure data unmatching flag, and the allocation procedure is executed according to the matching status to allocate multiple subjects into multiple sets.
綜上所述,本發明提出的驗證機器學習的資料拆分系統及其方法具有以下貢獻或效果:其一是所提出的方法將來自同一受測者的所有樣本保存在同一集合(訓練資料集、驗證資料集或測試資料集)中;其二是所提出的方法能夠在訓練資料集、驗證資料集和測試資料集上實現相似的血壓分布;其三是所提出的方法能夠在存在多個約束條件(例如收縮壓及舒張壓,此外可更包含脈搏率、心跳等任何與血壓相關因而影響模型訓練的約束條件)時保持不同資料集的血壓分布,這意味著訓練/驗證資料集和測試資料集的收縮壓分布相似,同時這些資料集的舒張壓分布也相似。 In summary, the data splitting system and method for verifying machine learning proposed by the present invention have the following contributions or effects: first, the proposed method stores all samples from the same subject in the same set (training data set, verification data set or test data set); second, the proposed method can achieve similar blood pressure distributions on the training data set, verification data set and test data set; The blood pressure distributions of the different datasets are maintained while maintaining the blood pressure distributions of the different datasets, which means that the systolic blood pressure distributions of the training/validation datasets and the test datasets are similar, and the diastolic blood pressure distributions of these datasets are also similar.
以上之關於本揭露內容之說明及以下之實施方式之說明係用以示範與解釋本發明之精神與原理,並且提供本發明之專利申請範圍更進一步之解釋。 The above description of the disclosure and the following description of the implementation are used to demonstrate and explain the spirit and principle of the present invention, and provide a further explanation of the patent application scope of the present invention.
100:資料拆分系統 100: Data splitting system
10:量測裝置 10: Measuring device
30:儲存裝置 30: storage device
50:運算裝置 50: computing device
S1~S6,S61~S69:步驟 S1~S6,S61~S69: steps
D0:血壓資料集 D0: blood pressure data set
D1:訓練資料集 D1: training data set
D2:驗證資料集 D2: Validation dataset
D3:測試資料集 D3: Test data set
圖1是本發明應用於機器學習的示意圖;圖2是依據本發明一實施例的驗證機器學習的資料拆分系統的方塊架構圖;圖3是依據本發明一實施例的驗證機器學習的資料拆分方法的流程圖;圖4是血壓資料集的示意圖;圖5是圖3中步驟的細部流程圖;圖6是依據傳統資料拆分方法產生的收縮壓分布及舒張壓分布示意圖;以及圖7是依據本發明實施例的驗證機器學習的資料拆分方法產生的收縮壓分布及舒張壓分布示意圖。 1 is a schematic diagram of the present invention applied to machine learning; FIG. 2 is a block diagram of a data splitting system for verifying machine learning according to an embodiment of the present invention; FIG. 3 is a flowchart of a data splitting method for verifying machine learning according to an embodiment of the present invention; FIG. 4 is a schematic diagram of a blood pressure data set; Schematic diagram of diastolic blood pressure distribution.
以下在實施方式中詳細敘述本發明之詳細特徵以及特點,其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施,且根據本說明書所揭露之內容、申請專利範圍及圖式,任何熟習相關技藝者可輕易地理解本發明相關之構想及特點。以下之實施例係進一步詳細說明本發明之觀點,但非以任何觀點限制本發明之範疇。 The detailed features and characteristics of the present invention are described in detail below in the embodiments, the content of which is sufficient for any person familiar with the relevant art to understand the technical content of the present invention and implement it accordingly, and according to the content disclosed in this specification, the scope of the patent application and the drawings, any person familiar with the relevant art can easily understand the concept and characteristics of the present invention. The following examples are to further describe the concept of the present invention in detail, but not to limit the scope of the present invention in any way.
圖1是本發明應用於機器學習的示意圖。如圖1所示,在血壓資料集D0經過資料前處理之後,可應用本發明提出的資料拆分系統及其方法,藉此將血壓資料集D0拆分為訓練資料集D1、驗證資料集D2及測試資料集D3。訓練資 料集D1及驗證資料集D2用於訓練及驗證血壓預測模型。測試資料集D3則用於測試血壓預測模型。在其他應用場景中,也可以是原始資料集經過資料前處理之後產生血壓資料集,然後將血壓資料集進行拆分。換言之,本發明不限制資料前處理的執行順序。 Fig. 1 is a schematic diagram of the present invention applied to machine learning. As shown in Figure 1, after the blood pressure data set D0 has been pre-processed, the data splitting system and method proposed by the present invention can be applied to split the blood pressure data set D0 into a training data set D1, a verification data set D2 and a test data set D3. training materials Data set D1 and verification data set D2 are used to train and verify the blood pressure prediction model. The test data set D3 is used to test the blood pressure prediction model. In other application scenarios, the blood pressure data set may be generated after the original data set undergoes data preprocessing, and then the blood pressure data set is split. In other words, the present invention does not limit the execution sequence of data pre-processing.
圖2是依據本發明一實施例的驗證機器學習的資料拆分系統的方塊架構圖。如圖2所示,此系統100包括量測裝置10、儲存裝置30以及運算裝置50。
FIG. 2 is a block diagram of a data splitting system for verifying machine learning according to an embodiment of the present invention. As shown in FIG. 2 , the
量測裝置10用於產生血壓資料集,其中血壓資料集包括多個受測者(subject)各自的血壓資料。每一受測者包括多筆血壓資料,每筆血壓資料的類型包括收縮壓(SBP)及舒張壓(DBP)。在一實施例中,量測裝置10例如是具有脈搏血氧儀(pulse oximetry)的穿戴型裝置,其應用光體積變化描記圖法(Photoplethysmography,PPG)獲取PPG訊號,再透過穿戴型裝置內建的微處理器換算得到血壓資料。在另一實施例中,量測裝置10例如是具有電極的穿戴型裝置,其應用心電圖(Electrocardiography,ECG)技術獲取ECG訊號,再透過穿戴型裝置內建的微處理器換算得到血壓資料。在又一實施例中,量測裝置10例如是電子血壓計或臂帶式血壓計。
The
儲存裝置30通訊連接量測裝置10以接收並儲存血壓資料集,以及儲存電腦可讀取記錄媒體。在一實施例中,儲存裝置30例如是揮發性記憶體及/或非揮發性記憶體。非揮發性記憶體包括唯讀記憶體(read-only memory,ROM)、可程式化ROM(programmable ROM,PROM)、電性可程式化ROM(electrically programmable ROM,EPROM)、電性可抹除及可程式化ROM(electrically erasable and programmable ROM,EEPROM)、快閃記憶體、相變隨機存取記憶體(phase-change random access memory,PRAM)、磁性RAM(magnetic RAM,MRAM)、
電阻式RAM(resistive RAM,RRAM)及/或鐵電RAM(ferroelectric RAM,FRAM)。揮發性記憶體可包括動態RAM(dynamic RAM,DRAM)、靜態RAM(static RAM,SRAM)及/或同步DRAM(synchronous DRAM,SDRAM)。在另一實施例中,儲存裝置30例如是硬碟驅動機(hard disk drive,HDD)、固態驅動機(solid-state drive,SSD)、緊密型快閃(compact flash,CF)卡、安全數位(secure digital,SD)卡、微型SD卡、迷你SD卡、極端數位(extreme digital,xD)卡及記憶條中的至少一者。
The
運算裝置50通訊連接儲存裝置30。運算裝置50用於運行電腦可讀取記錄媒體以執行本發明一實施例的驗證機器學習的資料拆分方法。在一實施例中,運算裝置50例如是:微處理器,例如中央處理器單元(central processor unit,CPU)、圖形處理器單元(graphic processing unit)及/或應用處理器(application processor,AP);邏輯晶片,例如現場可程式化閘陣列(field programmable gate array,FPGA)及特殊應用IC(application-specific integrated circuit,ASIC)。
The
請參考圖3及圖4,圖3是依據本發明一實施例的驗證機器學習的資料拆分方法的流程圖,圖4是血壓資料集的示意圖。圖3所示方法適用於血壓資料集。在一實施例中,血壓資料集包括多個受測者各自的血壓資料,血壓資料的類型包括收縮壓及舒張壓,但不以此二者為限。 Please refer to FIG. 3 and FIG. 4 . FIG. 3 is a flowchart of a data splitting method for verifying machine learning according to an embodiment of the present invention, and FIG. 4 is a schematic diagram of a blood pressure data set. The approach shown in Figure 3 was applied to a blood pressure dataset. In one embodiment, the blood pressure data set includes the respective blood pressure data of a plurality of subjects, and the types of the blood pressure data include systolic blood pressure and diastolic blood pressure, but not limited thereto.
以下舉實際數值為例,說明血壓資料集的資料結構:以量測裝置10(如電子血壓計)對500個人分別進行60分鐘的血壓量測,在量測裝置10取得所有人的原始量測資料之後,可根據需要選擇性地進行前處理程序,例如雜訊移除、訊號取樣等。假設以2分鐘擷取1筆血壓資料的取樣頻率,依據原始量測資料產生血壓資料,則每個人可以貢獻30筆血壓資料(60/2),其中每筆血壓資料包 括收縮壓數值及/或舒張壓數值。本發明所述的血壓資料集即為這500個人所有的血壓資料的集合,總計15000筆血壓資料(500*30)。 Taking the actual values as an example, the data structure of the blood pressure data set is described below: the blood pressure of 500 people is measured for 60 minutes by the measuring device 10 (such as an electronic sphygmomanometer). Assuming a sampling frequency of 1 piece of blood pressure data is captured in 2 minutes, and blood pressure data is generated based on the original measurement data, each person can contribute 30 pieces of blood pressure data (60/2), and each blood pressure data package Include systolic and/or diastolic values. The blood pressure data set of the present invention is a collection of blood pressure data of these 500 individuals, totaling 15,000 blood pressure data (500*30).
血壓資料集的示意圖如圖4所示,由圖4可看出收縮壓及舒張壓皆具有偏斜(skew)的分布特徵,例如在某一數值範圍如60~70毫米汞柱之間具有較多的資料筆數,在另一數值範圍如150~170毫米汞柱之間具有較少的資料筆數。 The schematic diagram of the blood pressure data set is shown in Figure 4. It can be seen from Figure 4 that both the systolic blood pressure and the diastolic blood pressure have skewed distribution characteristics. For example, there are more data items in a certain value range such as 60-70 mmHg, and there are fewer data items in another value range such as 150-170 mmHg.
在步驟S1中,運算裝置50從儲存裝置30取得血壓資料集。
In step S1 , the
在步驟S2中,運算裝置50將收縮壓的量測範圍劃分為多個第一區間。在步驟S3中,運算裝置50將舒張壓的量測範圍劃分為多個第二區間。本發明不限制步驟S2及步驟S3執行的先後順序。
In step S2, the
在一實施例中,舉實際數值為例說明第一區間及第二區間的劃分方式,例如收縮壓可被劃分為四個第一區間,這四個第一區間分別為:(1)低於100毫米汞柱(mmHg),(2)介於100mmHg和140mmHg之間,(3)介於140mmHg和160mmHg之間,以及(4)超過160mmHg。例如舒張壓可被劃分為四個第二區間:(1)低於60mmHg,(2)在60mmHg和80mmHg之間,(3)在80mmHg和100mmHg之間,以及(4)超過100mmHg。上述數值僅為舉例說明而非用以限制本發明。換言之,本發明對於各個第一/第二區間的範圍大小、第一/第二區間的數量等皆不限制。在其他實施例中,除了圖3所示的兩個血壓約束條件(constraint):收縮壓及舒張壓,本發明所述方法更可以加入第三個血壓約束條件,如脈壓差,而且運算裝置50將第三血壓約束條件的量測範圍劃分為多個第三區間。
In one embodiment, an actual numerical value is taken as an example to illustrate the division method of the first interval and the second interval. For example, the systolic blood pressure can be divided into four first intervals, and the four first intervals are: (1) lower than 100 millimeters of mercury (mmHg), (2) between 100 mmHg and 140 mmHg, (3) between 140 mmHg and 160 mmHg, and (4) exceeding 160 mmHg. For example diastolic blood pressure may be divided into four second intervals: (1) below 60mmHg, (2) between 60mmHg and 80mmHg, (3) between 80mmHg and 100mmHg, and (4) above 100mmHg. The above numerical values are for illustration only and not intended to limit the present invention. In other words, the present invention does not limit the size of the range of each first/second interval, the number of first/second intervals, and the like. In other embodiments, in addition to the two blood pressure constraints shown in FIG. 3: systolic blood pressure and diastolic blood pressure, the method of the present invention can further add a third blood pressure constraint, such as pulse pressure difference, and the
在步驟S4中,運算裝置50依據多個第一區間及多個第二區間產生多個類別。類別的數量為第一區間的數量及第二區間的數量的組合數。承前例,依據四個第一區間及四個第二區間可組合產生16個類別(4*4),如下方表格1所示。這16個類別的每一者包括四個第一區間中的一者及四個第二區間中的一者。例如類別6代表的是100SBP<140以及60SBP<80。
In step S4, the
在步驟S5中,運算裝置50判斷並記錄每個受測者的血壓資料與每個類別的匹配狀況,進而產生對應於所有受測者的多個匹配狀況。每個匹配狀況包括對應於多個類別的多個標記,每個標記具有第一狀態及第二狀態中的一者,第一狀態代表在一個受測者的多筆血壓資料中,至少一筆收縮壓資料位於標記對應的類別包含的第一區間,且在同一受測者的多筆血壓資料中,至少一筆舒張壓資料位於標記對應的類別包含的第二區間,第二狀態代表所有的血壓資料中都沒有匹配標記對應的類別。下方表格二以實際數值為例,說明多個受測者在多個類別的多個匹配狀況。
In step S5, the
在表格2所示的範例中,包括九個受測者在三個類別的匹配狀況。在表格2中,每一列代表一個匹配狀況,匹配狀況中的標記為1代表第一狀態,標記為0代表第二狀態。例如:受測者A的匹配狀況為(1,0,1),代表在受測者A的多筆血壓資料中,至少有一筆血壓資料匹配類別1,無任何血壓資料匹配類別2,至少有一筆血壓資料匹配類別3。總體而言,步驟S5用於產生多個受測者的血壓類別分布,此分布的資料結構為多個匹配狀況組成的0-1矩陣,矩陣的列數等於受測者數量,矩陣的行數等於類別數量。 In the example shown in Table 2, the matching status of nine subjects in three categories is included. In Table 2, each column represents a matching condition, where a mark of 1 represents the first state, and a mark of 0 represents the second state. For example: the matching status of subject A is (1,0,1), which means that among the multiple blood pressure data of subject A, at least one blood pressure data matches category 1, no blood pressure data matches category 2, and at least one blood pressure data matches category 3. In general, step S5 is used to generate the blood pressure category distribution of multiple subjects. The data structure of this distribution is a 0-1 matrix composed of multiple matching conditions. The number of columns in the matrix is equal to the number of subjects, and the number of rows in the matrix is equal to the number of categories.
在步驟S6中,運算裝置50依據所有受測者的多個匹配狀況執行分配程序,以將所有受測者分配至多個集合中。在一實施例中,每一個集合相當於K折交叉驗證中的一個折(fold)。本發明不限制集合的數量大小。分配程序必須考量到每個類別都盡可能地被平均分配到每個集合中,這樣可以保持不同集合中的SBP和DBP分布。另外,由於分配程序是以受測者作為分配單位,從而避免來自同一受測者的多筆血壓資料被分配到不同的集合中,導致資料獨立性的崩壞。
In step S6, the
請參考圖5,圖5是圖3中步驟S6的細部流程圖。 Please refer to FIG. 5 , which is a detailed flowchart of step S6 in FIG. 3 .
在步驟S61中,運算裝置50依據受測者的數量及集合的數量計算對應於集合的多個受測者需求數量。承前例,假設集合數量為3,標記為集合1、集合2及集合3。受測者數量為9,如表格2的分配狀況範例所示。因此,集合1對應的受測者需求數量為3,集合2對應的受測者需求數量為3、集合3對應的受測者需求數量為3。換言之,受測者需求數量為受測者數量除以集合數量。若出現無法整除的情況,則將剩餘未被分配的受測者按照後續介紹的流程分配到某幾個集合之中。
In step S61 , the
在步驟S62中,對於每個類別,運算裝置50在多個受測者中計算此類別具有第一狀態的匹配數量,進而得到對應於多個類別的多個匹配數量。本步驟S62可視為一個迴圈程序。運算裝置50每次處理一個類別(取決運算裝置50的平行處理能力,也可以每次處理N個類別),直到所有類別都被處理完成。基於表格2的匹配狀況範例,運算裝置50執行步驟S62後可得到如下方表格3的結果。
In step S62 , for each category, the
在步驟S63中,運算裝置50依據匹配數量及集合數量計算多個匹配類別需求數量。在一實施例中,匹配類別需求數量為匹配數量除以集合數量得到的平均值。基於表格3的匹配數量範例,運算裝置50執行步驟S63後可得到如下方表格4的結果。例如類別1對應的匹配數量為4,因此每一個集合的匹配類別需求數量為1.3(4/3四捨五入取至小數點後一位)。
In step S63, the
在步驟S64中,運算裝置50判斷血壓資料集中是否存在尚未被分配至集合中的受測者。若判斷為是,則執行步驟S65~S69的流程,然後返回步驟S64重新判斷。步驟S65~S69的流程將重複進行,直到所有受測者皆被分配到某個集合之後,步驟S64的判斷才會變成否,並結束本發明一實施例的資料拆分方法。
In step S64 , the
在步驟S65中,運算裝置50從多個類別中選擇一者作為指定類別。在一實施例中,指定類別的匹配數量為最小值。例如在表格4中,類別2對應的匹配類別需求數量最小(2.3>1.3>1),因此在第一輪迭代流程中,運算裝置50選擇類別2作為指定類別。在其他實施例中,運算裝置50隨機選擇指定類別。
In step S65 , the
在步驟S66中,運算裝置50從血壓資料集中選擇指定受測者。指定受測者的指定類別的標記為第一狀態。承前例,在第一輪迭代流程中,血壓資料集中包括A~I共9個受測者尚未被分配出去。當指定類別為類別2時,受測者C,E,F被選擇作為指定受測者,因為這三個受測者C,E,F在類別2中的標記皆為代表第一狀態的數值1。此外,由於三個受測者C,E,F被選為指定受測者,步驟S66至步驟S69的流程將被重複執行三次,直到所有指定受測者都被分配出去,步驟S66才會在下次迭代時選擇新的指定受測者。
In step S66, the
在步驟S67中,運算裝置50在多個集合中選擇一者作為指定集合。詳言之,運算裝置50判斷第一條件是否滿足。若判斷為是,則產生指定集合。若判斷為否,則運算裝置50判斷第二條件是否滿足。若判斷為是,則產生指定集合。若判斷為否,則運算裝置50從符合第三條件的多個集合中隨機選擇一者作為指定集合。
In step S67, the
第一條件為:在指定類別涵蓋的所有匹配類別需求數量中,找到第一最大值,且第一最大值的數量等於1。若第一條件被滿足,則第一最大值對應的集合即為指定集合。 The first condition is: find the first maximum value among all demand quantities of the matching category covered by the specified category, and the quantity of the first maximum value is equal to 1. If the first condition is satisfied, the set corresponding to the first maximum value is the designated set.
第二條件為:在指定類別涵蓋的所有匹配類別需求數量中,找到第一最大值,且第一最大值的數量大於1;在所有受測者需求數量中,找到第二最大值,且第二最大值的數量等於1。若第二條件被滿足,則第二最大值對應的集合即為指定集合。 The second condition is: find the first maximum value among all the demand quantities of the matching category covered by the specified category, and the number of the first maximum value is greater than 1; If the second condition is satisfied, the set corresponding to the second maximum value is the specified set.
第三條件為:在指定類別涵蓋的所有匹配類別需求數量中,找到第一最大值,且第一最大值的數量大於1;在所有受測者需求數量中,找到第二最大值,且第二最大值的數量大於1。 The third condition is: find the first maximum value among all the demand quantities of the matching category covered by the specified category, and the quantity of the first maximum value is greater than 1;
以下採用表格4作範例,說明步驟S67的執行流程:已知指定類別為步驟S66產生的類別2。類別2涵蓋的所有匹配類別數量分別為(1,1,1)。由於最大值為1,且此最大值的數量為3,所以第一條件未被滿足。集合1~集合3的受測者需求數量分別為(3,3,3)。由於最大值為3,且此最大值的數量為3,所以第二條件未被滿足。符合第三條件的集合包括集合1,集合2,集合3,因此運算裝置50從這三個集合中隨機選擇一者作為指定集合。
The following uses Table 4 as an example to illustrate the execution flow of step S67: the known designated category is category 2 generated in step S66. The number of all matching categories covered by category 2 is (1,1,1) respectively. Since the maximum value is 1, and the number of this maximum value is 3, the first condition is not satisfied. The number of subjects demanded by set 1~set 3 is (3,3,3) respectively. Since the maximum value is 3, and the number of this maximum value is 3, the second condition is not satisfied. The sets meeting the third condition include set 1, set 2, and set 3, so the
在步驟S68中,運算裝置50將指定受測者分配給指定集合,且從血壓資料集中移除指定受測者。承前例,在步驟S66中產生的指定受測者包括受測者C,E,F,在步驟S67中產生的指定集合為集合1,2,3。在一實施例中,當步驟S66產生多個指定受測者時,運算裝置50可從中隨機選擇一個用於步驟S68。例如:在步驟S68中,運算裝置50將受測者C分配給集合1,然後從血壓資料集中移除受測者C的血壓資料。
In step S68, the
在步驟S69中,運算裝置50更新匹配類別需求數量及受測者需求數量。承前例,更新後的結果如下方表格5所示。注意集合1在類別2的匹配類別需求數量由1降為0(1-1),這是因為集合1中已分配到一個類別2為1的受測者C。此外,集合1的受測者需求數量也由原本的3降為2(3-1),這是因為集合1中已分配到一個受測者C。
In step S69 , the
在步驟S69完成之後,將返回步驟S64。因為血壓資料集中仍然有受測者A~B,D~I。所以步驟S64的判斷結果為是,並執行步驟S65~步驟S69的第二次迭代流程。 After step S69 is completed, it will return to step S64. Because there are still subjects A~B, D~I in the blood pressure data set. Therefore, the judgment result of step S64 is yes, and the second iterative process of step S65 to step S69 is executed.
在步驟S65中,類別2對應的匹配類別需求數量仍然是最小(注意:低於1的匹配類別需求數量不予考慮),因此在第二輪迭代流程中,運算裝置50仍會選擇類別2作為指定類別。
In step S65, the number of matching category requirements corresponding to category 2 is still the smallest (note: the matching category requirement quantity lower than 1 is not considered), so in the second round of iterative process, the
在步驟S66中,運算裝置50選擇到的指定受測者包括E,F。
In step S66, the specified subjects selected by the
在步驟S67中,集合2及集合3各自在類別2的匹配類別需求數量分別為(1,1)。由於最大值為1,且此最大值的數量為2,所以第一條件未被滿足。集合1至集合3的受測者需求數量分別為(2,3,3)。由於最大值為3,且此最大值的數量為2,所以第二條件未被滿足。符合第三條件的集合包括集合2及集合3,因此運算裝置50從這二個集合中隨機選擇一者作為指定集合。
In step S67, the number of matching category requirements in category 2 of set 2 and set 3 is (1, 1) respectively. Since the maximum value is 1, and the number of this maximum value is 2, the first condition is not satisfied. The number of subjects demanded from set 1 to set 3 is (2, 3, 3) respectively. Since the maximum value is 3, and the number of this maximum value is 2, the second condition is not satisfied. The sets meeting the third condition include set 2 and set 3, so the
在步驟S68中,例如將指定受測者E分配給指定集合2,從血壓資料集中移除指定受測者E。 In step S68, for example, the designated subject E is assigned to the designated set 2, and the designated subject E is removed from the blood pressure data set.
在步驟S69中,更新後的結果如下方表格6所示。注意集合2的匹配類別需求數量會依據受測者E的匹配狀況(0,1,1)對應遞減。 In step S69, the updated results are shown in Table 6 below. Note that the matching category requirements of set 2 will decrease correspondingly according to the matching status (0, 1, 1) of subject E.
依上述流程類推,每執行一次步驟S65至步驟S69的迭代流程,都會把一個指定受測者分配到一個指定集合中,因此,總迭代次數與血壓資料集中受測者的數量相等。表格7、表格8及表格9分別呈現第三次、第六次迭代以及最後一次迭代(第九次)的範例,為便於理解,本範例假設「隨機挑選」的實現是按照字母順序及阿拉伯數字順序。 By analogy with the above process, each execution of the iterative process from step S65 to step S69 will assign a specified subject to a specified set, therefore, the total number of iterations is equal to the number of subjects in the blood pressure data set. Table 7, Table 8, and Table 9 show examples of the third iteration, sixth iteration, and last iteration (ninth iteration) respectively. For ease of understanding, this example assumes that "random selection" is implemented in alphabetical order and Arabic numeral order.
請參考圖6及圖7,圖6是依據傳統資料拆分方法產生的收縮壓分布及舒張壓分布示意圖,圖7是依據本發明實施例的驗證機器學習的資料拆分方法產生的收縮壓分布及舒張壓分布示意圖。在圖6中,訓練資料集、驗證資料集及測試資料集的分布並不一致。以圖6的收縮壓分布示意圖為例,訓練資料集約在130毫米汞柱處具有最大的資料量,但是驗證資料集約在125毫米汞柱處具有最大的資料量。此外,驗證資料集及測試資料集在約190毫米汞柱處皆包含少量的資料,但是訓練資料集在該處的資料量幾乎沒有資料。這將導致訓練得到的血壓估測模型無法用於預估190毫米汞柱以上的收縮壓數值。在圖7中,無論是收縮壓分布還是舒張壓分布,訓練資料集、驗證資料集、測試資料集等三個資料集都具有類似的資料分布趨勢。舉例說明如下:若某一個資料集在血壓區間A的資料量較多且在血壓區間B的資料量較少,則其他資料集也會具有相同的特性。因此,這樣資料集的拆分方式有助於提升血壓預測模型的通用性及準確性。 Please refer to FIG. 6 and FIG. 7. FIG. 6 is a schematic diagram of the systolic blood pressure distribution and diastolic blood pressure distribution generated according to the traditional data splitting method, and FIG. In Figure 6, the distributions of the training data set, validation data set, and test data set are not consistent. Taking the schematic diagram of systolic blood pressure distribution in Figure 6 as an example, the training data set has the largest amount of data at 130 mmHg, but the validation data set has the largest amount of data at 125 mmHg. In addition, both the validation and test datasets contain a small amount of data at about 190 mmHg, but the training dataset has almost no data there. This will lead to the inability of the trained blood pressure estimation model to predict systolic blood pressure values above 190 mmHg. In Figure 7, no matter the systolic blood pressure distribution or the diastolic blood pressure distribution, the three data sets of training data set, verification data set and test data set all have similar data distribution trends. An example is as follows: if a data set has more data in blood pressure interval A and less data in blood pressure interval B, then other data sets will also have the same characteristics. Therefore, such a splitting method of the data set helps to improve the generality and accuracy of the blood pressure prediction model.
綜上所述,本發明提出的驗證機器學習的資料拆分系統及其方法具有以下貢獻或效果:其一是所提出的方法將來自同一受測者的所有 樣本保存在同一集合(訓練資料集或測試資料集)中;其二是所提出的方法能夠在訓練資料集、驗證資料集和測試資料集上實現相似的血壓分布;其三是所提出的方法能夠在存在多個約束條件(例如收縮壓及舒張壓,此外可更包含脈搏率、心跳等任何與血壓相關因而影響模型訓練的約束條件)時保持不同資料集的血壓分布,這意味著訓練/驗證資料集和測試資料集的收縮壓分布相似,同時這些資料集的舒張壓分布也相似。 In summary, the data splitting system and method for verifying machine learning proposed by the present invention have the following contributions or effects: one is that the proposed method combines all data from the same subject The samples are stored in the same set (training data set or test data set); secondly, the proposed method can achieve similar blood pressure distributions on the training data set, verification data set and test data set; thirdly, the proposed method can maintain the blood pressure distribution of different data sets when there are multiple constraints (such as systolic blood pressure and diastolic blood pressure, and can also include pulse rate, heartbeat and other constraints related to blood pressure that affect model training), which means that the systolic blood pressure distributions of the training/validation data set and the test data set are similar, and the diastolic blood pressure distribution of these data sets The pressure distribution is also similar.
雖然本發明以前述之實施例揭露如上,然其並非用以限定本發明。在不脫離本發明之精神和範圍內,所為之更動與潤飾,均屬本發明之專利保護範圍。關於本發明所界定之保護範圍請參考所附之申請專利範圍。 Although the present invention is disclosed by the aforementioned embodiments, they are not intended to limit the present invention. Without departing from the spirit and scope of the present invention, all changes and modifications are within the scope of patent protection of the present invention. For the scope of protection defined by the present invention, please refer to the appended scope of patent application.
S1~S6:步驟 S1~S6: steps
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW111121654A TWI808785B (en) | 2022-06-10 | 2022-06-10 | Data splitting system and method for validating machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW111121654A TWI808785B (en) | 2022-06-10 | 2022-06-10 | Data splitting system and method for validating machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI808785B true TWI808785B (en) | 2023-07-11 |
TW202349286A TW202349286A (en) | 2023-12-16 |
Family
ID=88149396
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW111121654A TWI808785B (en) | 2022-06-10 | 2022-06-10 | Data splitting system and method for validating machine learning |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI808785B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246403A1 (en) * | 2005-08-26 | 2011-10-06 | Vanderbilt University | Method and System for Automated Supervised Data Analysis |
US20150238151A1 (en) * | 2014-02-25 | 2015-08-27 | General Electric Company | System and method for perfusion-based arrhythmia alarm evaluation |
CN107403072A (en) * | 2017-08-07 | 2017-11-28 | 北京工业大学 | A kind of diabetes B prediction and warning method based on machine learning |
TW201947465A (en) * | 2018-05-15 | 2019-12-16 | 美爾敦股份有限公司 | Self-learning data classification system and method |
-
2022
- 2022-06-10 TW TW111121654A patent/TWI808785B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246403A1 (en) * | 2005-08-26 | 2011-10-06 | Vanderbilt University | Method and System for Automated Supervised Data Analysis |
US20150238151A1 (en) * | 2014-02-25 | 2015-08-27 | General Electric Company | System and method for perfusion-based arrhythmia alarm evaluation |
CN107403072A (en) * | 2017-08-07 | 2017-11-28 | 北京工业大学 | A kind of diabetes B prediction and warning method based on machine learning |
TW201947465A (en) * | 2018-05-15 | 2019-12-16 | 美爾敦股份有限公司 | Self-learning data classification system and method |
Also Published As
Publication number | Publication date |
---|---|
TW202349286A (en) | 2023-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10980429B2 (en) | Method and system for cuffless blood pressure estimation using photoplethysmogram features and pulse transit time | |
WO2021227511A1 (en) | Complication onset risk prediction method and system based on electronic medical record big data, and terminal and storage medium | |
US9147167B2 (en) | Similarity analysis with tri-point data arbitration | |
Narin et al. | Investigating the performance improvement of HRV Indices in CHF using feature selection methods based on backward elimination and statistical significance | |
Gries et al. | Variability-based neighbor clustering | |
CN110197492A (en) | A kind of cardiac MRI left ventricle dividing method and system | |
US20160073969A1 (en) | Medical Imaging System Providing Disease Prognosis | |
WO2021151295A1 (en) | Method, apparatus, computer device, and medium for determining patient treatment plan | |
US20220211283A1 (en) | Methods for blood pressure calibration selection and modeling methods thereof | |
CN109582975A (en) | It is a kind of name entity recognition methods and device | |
EP3685405A1 (en) | Subject clustering method and apparatus | |
CN117651523A (en) | Electrocardiogram analysis support device, program, electrocardiogram analysis support method, electrocardiogram analysis support system, peak estimation model generation method, and interval estimation model generation method | |
TWI808785B (en) | Data splitting system and method for validating machine learning | |
Kim et al. | Artificial intelligence predicts clinically relevant atrial high-rate episodes in patients with cardiac implantable electronic devices | |
CN117425431A (en) | Electrocardiogram analysis support device, program, electrocardiogram analysis support method, and electrocardiogram analysis support system | |
US20210390623A1 (en) | Data analysis method and data analysis device | |
US20210027895A1 (en) | Method and system for pressure autoregulation based synthesizing of photoplethysmogram signal | |
CN114424934B (en) | Apnea event screening model training method, device and computer equipment | |
US20230385380A1 (en) | Data splitting system and method for validating machine learning | |
CN117235466A (en) | Data splitting system for verifying machine learning and method thereof | |
JP2023076082A (en) | Intervention effect analysis system, and intervention effect analysis method | |
CN111710431B (en) | Method, device, equipment and storage medium for identifying synonymous diagnosis names | |
WO2022157872A1 (en) | Information processing apparatus, feature quantity selection method, teacher data generation method, estimation model generation method, stress level estimation method, and program | |
US10241970B2 (en) | Reduced memory nucleotide sequence comparison | |
US20210342641A1 (en) | Method and system for generating synthetic time domain signals to build a classifier |