TWI808785B

TWI808785B - Data splitting system and method for validating machine learning

Info

Publication number: TWI808785B
Application number: TW111121654A
Authority: TW
Inventors: 謝宛庭; 智平雷; 陳佩君
Original assignee: 英業達股份有限公司
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2023-07-11
Also published as: TW202349286A

Abstract

A data splitting method for validating machine learning is adapted to a blood pressure dataset including blood pressure data of a plurality of subjects. The method includes dividing measurement ranges of the systolic blood pressure and the diastolic blood pressure into a plurality of first intervals and a plurality of second intervals respectively, generating a plurality of classes according to the plurality of first intervals and the plurality of second intervals, with each class including one of the plurality of first intervals and one of the plurality of second intervals, determining and recording a matching condition of each subject and the plurality of classes, with each matching condition includes a plurality of marks corresponding to the plurality of classes, each mark has one of the a first state and the second state, the first state representing that the blood pressure data matches the class corresponding to the mark and the second state representing that the blood pressure does not match the class corresponding to the mark, and distributing the plurality of subjects into a plurality of subsets by performing a distribution procedure.

Description

Data splitting system and method for validating machine learning

本發明關於機器學習，特別是一種驗證機器學習的資料拆分系統及其方法。 The present invention relates to machine learning, in particular to a data splitting system and method for verifying machine learning.

驗證策略在調整機器學習及深度學習模型的超參數時是不可或缺的。為了訓練和驗證的目的，可以用資料拆分或交叉驗證(cross validation,CV)技術將資料劃分為訓練資料集、驗證資料集和測試資料集。這些技術提供了關於模型在獨立且新資料集上的準確性和通用性的判斷力。然而，若是資料拆分的策略不當，這些技術也會導致模型過擬合(over-fitting)或偏差(bias)的問題。 Validation strategies are integral to tuning hyperparameters of machine learning and deep learning models. For training and validation purposes, data splitting or cross-validation (CV) techniques can be used to divide the data into training data sets, validation data sets, and test data sets. These techniques provide insights about the accuracy and generalizability of models on independent and new datasets. However, if the data splitting strategy is not appropriate, these techniques can also lead to the problem of over-fitting or bias of the model.

現有的交叉驗證方式如留一驗證(Leave-One-Out)、保留驗證(Holdout)及K折交叉驗證(K-fold CV)等，是標準機器學習問題(如回歸問題)中的常見做法。然而，血壓估計並非標準回歸問題。例如，血壓資料集的資料點並不完全相互獨立。也就是說，這些資料集通常有許多來自同一記錄或受測者的片段，它們可能包含非常相似的資訊。此外，血壓估計問題中有兩個目標，收縮壓(systolic blood pressure,SBP)和舒張壓(diastolic blood pressure,DBP)，這更類似於多任務或多輸出回歸問題。最後，收縮壓和舒張壓分布通常是偏斜(skewed)的，由於極端的血壓很少見，因此血壓估計成為一個不平衡回歸問題。 Existing cross-validation methods such as Leave-One-Out, Holdout and K-fold CV are common practices in standard machine learning problems (such as regression problems). However, blood pressure estimation is not a standard regression problem. For example, the data points of the blood pressure dataset are not completely independent of each other. That is, these data sets often have many segments from the same record or subject, which may contain very similar information. Furthermore, there are two targets in the blood pressure estimation problem, systolic blood pressure (SBP) and diastolic blood pressure (DBP), which are more similar to multi-task or multi-output regression problems. Finally, the systolic and diastolic distributions are often skewed, and since extreme blood pressures are rare, blood pressure estimation becomes an unbalanced regression problem.

為了在交叉驗證期間正確劃分資料，必須考慮上述差異。例如，隨機劃分資料可能導致同一受測者的資料片段同時出現在訓練資料集、驗證資料集和測試資料集中。由於來自同一受測者的資料片段攜帶相似的資訊，這將導致集合之間的獨立性崩潰並帶來過度樂觀的結果。此外，由於分布偏斜和不平衡，隨機資料劃分可能會導致每個集合之間的分布不同或資料移位等問題，甚至會導致測試資料集中缺乏少數案例。 In order to correctly partition the data during cross-validation, the above differences must be taken into account. For example, random partitioning of the data may result in segments of the same subject appearing in the training, validation, and test sets simultaneously. Since data segments from the same subject carry similar information, this will lead to a breakdown of independence between sets and lead to overly optimistic results. In addition, due to distribution skew and imbalance, random data partitioning may cause problems such as different distributions or data shifts between each set, or even lack of minority cases in the test data set.

有鑑於此，本發明提出一種驗證機器學習的資料拆分系統及其方法以避免上述問題，血壓估計任務的交叉驗證可以將同一受測者的資料保持在同一集合中，並儘可能保持原始資料集的收縮壓及舒張壓在不同資料集中的分布。換言之，在應用本發明產生的多個資料集中，無論是收縮壓或是舒張壓，其在多個資料集中呈現的多個趨勢之間具有高度相似性。 In view of this, the present invention proposes a data splitting system and method for verifying machine learning to avoid the above-mentioned problems. The cross-validation of the blood pressure estimation task can keep the data of the same subject in the same set, and keep the distribution of the systolic and diastolic blood pressure of the original data set in different data sets as much as possible. In other words, in multiple data sets generated by applying the present invention, whether it is systolic blood pressure or diastolic blood pressure, there is a high similarity among the multiple trends presented in multiple data sets.

依據本發明一實施例的一種驗證機器學習的資料拆分方法，適用於血壓資料集。血壓資料集包括多個受測者各自的多個血壓資料，這些血壓資料的類型包括收縮壓及舒張壓。所述方法包括以運算裝置執行以下步驟：將收縮壓的量測範圍劃分為多個第一區間，並將舒張壓的量測範圍劃分為多個第二區間，依據第一區間及第二區間產生多個類別，每一類別包括一個第一區間及一個第二區間，判斷並記錄每一受測者的血壓資料與類別的匹配狀況，進而產生對應於多個受測者的多個匹配狀況，每一匹配狀況包括對應於多個類別的多個標記，每一標記具有第一狀態及第二狀態中的一者，第一狀態代表血壓資料匹配標記對應的類別，第二狀態代表血壓資料未匹配標記對應的類別，以及依據匹配狀況執行分配程序以將多個受測者分配至多個集合中。 A data splitting method for verifying machine learning according to an embodiment of the present invention is applicable to a blood pressure data set. The blood pressure data set includes a plurality of blood pressure data of a plurality of subjects respectively, and types of the blood pressure data include systolic blood pressure and diastolic blood pressure. The method includes the following steps performed by an arithmetic device: dividing the measurement range of the systolic blood pressure into a plurality of first intervals, and dividing the measurement range of the diastolic blood pressure into a plurality of second intervals, generating a plurality of categories according to the first interval and the second interval, each category including a first interval and a second interval, judging and recording the matching status of the blood pressure data of each subject and the category, and then generating a plurality of matching status corresponding to the plurality of testing subjects, each matching status includes a plurality of marks corresponding to the plurality of classes, each tag has one of the first state and the second state, and the first state represents the matching status of the blood pressure data. Category, the second state represents the category corresponding to the blood pressure data does not match the flag, and according to the matching status to perform the allocation process to allocate more Subjects were assigned to multiple sets.

依據本發明一實施例的一種驗證機器學習的資料拆分系統，包括量測裝置、儲存裝置以及運算裝置。量測裝置用於產生血壓資料集，其中血壓資料集包括多個受測者各自的多個血壓資料，這些血壓資料的類型包括收縮壓及舒張壓。儲存裝置通訊連接量測裝置以接收並儲存血壓資料集，以及儲存電腦可讀取記錄媒體。運算裝置通訊連接儲存裝置，運算裝置用於運行電腦可讀取記錄媒體以執行以下步驟：將收縮壓的量測範圍劃分為多個第一區間，並將舒張壓的量測範圍劃分為多個第二區間，依據第一區間及第二區間產生多個類別，每一類別包括一個第一區間及一個第二區間，判斷並記錄每一受測者的血壓資料與類別的匹配狀況，進而產生對應於多個受測者的多個匹配狀況，每一匹配狀況包括對應於多個類別的多個標記，每一標記具有第一狀態及第二狀態中的一者，第一狀態代表血壓資料匹配標記對應的類別，第二狀態代表血壓資料未匹配標記對應的類別，以及依據匹配狀況執行分配程序以將多個受測者分配至多個集合中。 A data splitting system for verifying machine learning according to an embodiment of the present invention includes a measurement device, a storage device, and a computing device. The measuring device is used to generate a blood pressure data set, wherein the blood pressure data set includes a plurality of blood pressure data of a plurality of subjects respectively, and types of the blood pressure data include systolic blood pressure and diastolic blood pressure. The storage device communicates with the measurement device to receive and store the blood pressure data set, and stores a computer-readable recording medium. The computing device communicates with the storage device, the computing device is used to run the computer and can read the recording medium to perform the following steps: divide the measurement range of the systolic blood pressure into a plurality of first intervals, and divide the measurement range of the diastolic blood pressure into a plurality of second intervals, generate a plurality of categories according to the first interval and the second interval, each category includes a first interval and a second interval, judge and record the matching status of the blood pressure data of each subject and the category, and then generate a plurality of matching status corresponding to a plurality of testing subjects, each matching status includes a plurality of marks corresponding to the plurality of classes, and each tag has one of the first state and the second state. Or, the first state represents the category corresponding to the blood pressure data matching flag, the second state represents the category corresponding to the blood pressure data unmatching flag, and the allocation procedure is executed according to the matching status to allocate multiple subjects into multiple sets.

綜上所述，本發明提出的驗證機器學習的資料拆分系統及其方法具有以下貢獻或效果：其一是所提出的方法將來自同一受測者的所有樣本保存在同一集合(訓練資料集、驗證資料集或測試資料集)中；其二是所提出的方法能夠在訓練資料集、驗證資料集和測試資料集上實現相似的血壓分布；其三是所提出的方法能夠在存在多個約束條件(例如收縮壓及舒張壓，此外可更包含脈搏率、心跳等任何與血壓相關因而影響模型訓練的約束條件)時保持不同資料集的血壓分布，這意味著訓練/驗證資料集和測試資料集的收縮壓分布相似，同時這些資料集的舒張壓分布也相似。 In summary, the data splitting system and method for verifying machine learning proposed by the present invention have the following contributions or effects: first, the proposed method stores all samples from the same subject in the same set (training data set, verification data set or test data set); second, the proposed method can achieve similar blood pressure distributions on the training data set, verification data set and test data set; The blood pressure distributions of the different datasets are maintained while maintaining the blood pressure distributions of the different datasets, which means that the systolic blood pressure distributions of the training/validation datasets and the test datasets are similar, and the diastolic blood pressure distributions of these datasets are also similar.

以上之關於本揭露內容之說明及以下之實施方式之說明係用以示範與解釋本發明之精神與原理，並且提供本發明之專利申請範圍更進一步之解釋。 The above description of the disclosure and the following description of the implementation are used to demonstrate and explain the spirit and principle of the present invention, and provide a further explanation of the patent application scope of the present invention.

100:資料拆分系統 100: Data splitting system

10:量測裝置 10: Measuring device

30:儲存裝置 30: storage device

50:運算裝置 50: computing device

S1~S6,S61~S69:步驟 S1~S6,S61~S69: steps

D0:血壓資料集 D0: blood pressure data set

D1:訓練資料集 D1: training data set

D2:驗證資料集 D2: Validation dataset

D3:測試資料集 D3: Test data set

圖1是本發明應用於機器學習的示意圖；圖2是依據本發明一實施例的驗證機器學習的資料拆分系統的方塊架構圖；圖3是依據本發明一實施例的驗證機器學習的資料拆分方法的流程圖；圖4是血壓資料集的示意圖；圖5是圖3中步驟的細部流程圖；圖6是依據傳統資料拆分方法產生的收縮壓分布及舒張壓分布示意圖；以及圖7是依據本發明實施例的驗證機器學習的資料拆分方法產生的收縮壓分布及舒張壓分布示意圖。 1 is a schematic diagram of the present invention applied to machine learning; FIG. 2 is a block diagram of a data splitting system for verifying machine learning according to an embodiment of the present invention; FIG. 3 is a flowchart of a data splitting method for verifying machine learning according to an embodiment of the present invention; FIG. 4 is a schematic diagram of a blood pressure data set; Schematic diagram of diastolic blood pressure distribution.

以下在實施方式中詳細敘述本發明之詳細特徵以及特點，其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施，且根據本說明書所揭露之內容、申請專利範圍及圖式，任何熟習相關技藝者可輕易地理解本發明相關之構想及特點。以下之實施例係進一步詳細說明本發明之觀點，但非以任何觀點限制本發明之範疇。 The detailed features and characteristics of the present invention are described in detail below in the embodiments, the content of which is sufficient for any person familiar with the relevant art to understand the technical content of the present invention and implement it accordingly, and according to the content disclosed in this specification, the scope of the patent application and the drawings, any person familiar with the relevant art can easily understand the concept and characteristics of the present invention. The following examples are to further describe the concept of the present invention in detail, but not to limit the scope of the present invention in any way.

圖1是本發明應用於機器學習的示意圖。如圖1所示，在血壓資料集D0經過資料前處理之後，可應用本發明提出的資料拆分系統及其方法，藉此將血壓資料集D0拆分為訓練資料集D1、驗證資料集D2及測試資料集D3。訓練資料集D1及驗證資料集D2用於訓練及驗證血壓預測模型。測試資料集D3則用於測試血壓預測模型。在其他應用場景中，也可以是原始資料集經過資料前處理之後產生血壓資料集，然後將血壓資料集進行拆分。換言之，本發明不限制資料前處理的執行順序。 Fig. 1 is a schematic diagram of the present invention applied to machine learning. As shown in Figure 1, after the blood pressure data set D0 has been pre-processed, the data splitting system and method proposed by the present invention can be applied to split the blood pressure data set D0 into a training data set D1, a verification data set D2 and a test data set D3. training materials Data set D1 and verification data set D2 are used to train and verify the blood pressure prediction model. The test data set D3 is used to test the blood pressure prediction model. In other application scenarios, the blood pressure data set may be generated after the original data set undergoes data preprocessing, and then the blood pressure data set is split. In other words, the present invention does not limit the execution sequence of data pre-processing.

圖2是依據本發明一實施例的驗證機器學習的資料拆分系統的方塊架構圖。如圖2所示，此系統100包括量測裝置10、儲存裝置30以及運算裝置50。 FIG. 2 is a block diagram of a data splitting system for verifying machine learning according to an embodiment of the present invention. As shown in FIG. 2 , the system 100 includes a measuring device 10 , a storage device 30 and a computing device 50 .

量測裝置10用於產生血壓資料集，其中血壓資料集包括多個受測者(subject)各自的血壓資料。每一受測者包括多筆血壓資料，每筆血壓資料的類型包括收縮壓(SBP)及舒張壓(DBP)。在一實施例中，量測裝置10例如是具有脈搏血氧儀(pulse oximetry)的穿戴型裝置，其應用光體積變化描記圖法(Photoplethysmography,PPG)獲取PPG訊號，再透過穿戴型裝置內建的微處理器換算得到血壓資料。在另一實施例中，量測裝置10例如是具有電極的穿戴型裝置，其應用心電圖(Electrocardiography,ECG)技術獲取ECG訊號，再透過穿戴型裝置內建的微處理器換算得到血壓資料。在又一實施例中，量測裝置10例如是電子血壓計或臂帶式血壓計。 The measurement device 10 is used for generating a blood pressure data set, wherein the blood pressure data set includes blood pressure data of a plurality of subjects. Each subject includes multiple pieces of blood pressure data, and the types of each blood pressure data include systolic blood pressure (SBP) and diastolic blood pressure (DBP). In one embodiment, the measuring device 10 is, for example, a wearable device with a pulse oximeter (pulse oximetry), which uses photoplethysmography (PPG) to obtain the PPG signal, and then converts the blood pressure data through the built-in microprocessor of the wearable device. In another embodiment, the measurement device 10 is, for example, a wearable device with electrodes, which uses electrocardiography (ECG) technology to obtain ECG signals, and then converts the blood pressure data through the built-in microprocessor of the wearable device. In yet another embodiment, the measuring device 10 is, for example, an electronic sphygmomanometer or an armband sphygmomanometer.

儲存裝置30通訊連接量測裝置10以接收並儲存血壓資料集，以及儲存電腦可讀取記錄媒體。在一實施例中，儲存裝置30例如是揮發性記憶體及/或非揮發性記憶體。非揮發性記憶體包括唯讀記憶體(read-only memory，ROM)、可程式化ROM(programmable ROM，PROM)、電性可程式化ROM(electrically programmable ROM，EPROM)、電性可抹除及可程式化ROM(electrically erasable and programmable ROM，EEPROM)、快閃記憶體、相變隨機存取記憶體(phase-change random access memory，PRAM)、磁性RAM(magnetic RAM，MRAM)、電阻式RAM(resistive RAM，RRAM)及/或鐵電RAM(ferroelectric RAM，FRAM)。揮發性記憶體可包括動態RAM(dynamic RAM，DRAM)、靜態RAM(static RAM，SRAM)及/或同步DRAM(synchronous DRAM，SDRAM)。在另一實施例中，儲存裝置30例如是硬碟驅動機(hard disk drive，HDD)、固態驅動機(solid-state drive，SSD)、緊密型快閃(compact flash，CF)卡、安全數位(secure digital，SD)卡、微型SD卡、迷你SD卡、極端數位(extreme digital，xD)卡及記憶條中的至少一者。 The storage device 30 is communicatively connected to the measuring device 10 to receive and store the blood pressure data set, and store a computer-readable recording medium. In one embodiment, the storage device 30 is, for example, a volatile memory and/or a non-volatile memory. Non-volatile memory includes read-only memory (ROM), programmable ROM (programmable ROM, PROM), electrically programmable ROM (electrically programmable ROM, EPROM), electrically erasable and programmable ROM (electrically erasable and programmable ROM, EEPROM), flash memory, phase-change random access memory (phase-change) random access memory, PRAM), magnetic RAM (magnetic RAM, MRAM), Resistive RAM (resistive RAM, RRAM) and/or ferroelectric RAM (ferroelectric RAM, FRAM). The volatile memory may include dynamic RAM (dynamic RAM, DRAM), static RAM (static RAM, SRAM) and/or synchronous DRAM (synchronous DRAM, SDRAM). In another embodiment, the storage device 30 is, for example, at least one of a hard disk drive (HDD), a solid-state drive (solid-state drive, SSD), a compact flash (CF) card, a secure digital (SD) card, a micro SD card, a mini SD card, an extreme digital (xD) card, and a memory stick.

運算裝置50通訊連接儲存裝置30。運算裝置50用於運行電腦可讀取記錄媒體以執行本發明一實施例的驗證機器學習的資料拆分方法。在一實施例中，運算裝置50例如是：微處理器，例如中央處理器單元(central processor unit，CPU)、圖形處理器單元(graphic processing unit)及/或應用處理器(application processor，AP)；邏輯晶片，例如現場可程式化閘陣列(field programmable gate array，FPGA)及特殊應用IC(application-specific integrated circuit，ASIC)。 The computing device 50 is communicatively connected to the storage device 30 . The computing device 50 is used to run the computer-readable recording medium to execute the data splitting method for verifying machine learning according to an embodiment of the present invention. In one embodiment, the computing device 50 is, for example: a microprocessor, such as a central processor unit (central processor unit, CPU), a graphics processing unit (graphic processing unit) and/or an application processor (application processor, AP); a logic chip, such as a field programmable gate array (field programmable gate array, FPGA) and a specific application IC (application-specific integrated circuit, ASIC).

請參考圖3及圖4，圖3是依據本發明一實施例的驗證機器學習的資料拆分方法的流程圖，圖4是血壓資料集的示意圖。圖3所示方法適用於血壓資料集。在一實施例中，血壓資料集包括多個受測者各自的血壓資料，血壓資料的類型包括收縮壓及舒張壓，但不以此二者為限。 Please refer to FIG. 3 and FIG. 4 . FIG. 3 is a flowchart of a data splitting method for verifying machine learning according to an embodiment of the present invention, and FIG. 4 is a schematic diagram of a blood pressure data set. The approach shown in Figure 3 was applied to a blood pressure dataset. In one embodiment, the blood pressure data set includes the respective blood pressure data of a plurality of subjects, and the types of the blood pressure data include systolic blood pressure and diastolic blood pressure, but not limited thereto.

以下舉實際數值為例，說明血壓資料集的資料結構：以量測裝置10(如電子血壓計)對500個人分別進行60分鐘的血壓量測，在量測裝置10取得所有人的原始量測資料之後，可根據需要選擇性地進行前處理程序，例如雜訊移除、訊號取樣等。假設以2分鐘擷取1筆血壓資料的取樣頻率，依據原始量測資料產生血壓資料，則每個人可以貢獻30筆血壓資料(60/2)，其中每筆血壓資料包括收縮壓數值及/或舒張壓數值。本發明所述的血壓資料集即為這500個人所有的血壓資料的集合，總計15000筆血壓資料(500*30)。 Taking the actual values as an example, the data structure of the blood pressure data set is described below: the blood pressure of 500 people is measured for 60 minutes by the measuring device 10 (such as an electronic sphygmomanometer). Assuming a sampling frequency of 1 piece of blood pressure data is captured in 2 minutes, and blood pressure data is generated based on the original measurement data, each person can contribute 30 pieces of blood pressure data (60/2), and each blood pressure data package Include systolic and/or diastolic values. The blood pressure data set of the present invention is a collection of blood pressure data of these 500 individuals, totaling 15,000 blood pressure data (500*30).

血壓資料集的示意圖如圖4所示，由圖4可看出收縮壓及舒張壓皆具有偏斜(skew)的分布特徵，例如在某一數值範圍如60~70毫米汞柱之間具有較多的資料筆數，在另一數值範圍如150~170毫米汞柱之間具有較少的資料筆數。 The schematic diagram of the blood pressure data set is shown in Figure 4. It can be seen from Figure 4 that both the systolic blood pressure and the diastolic blood pressure have skewed distribution characteristics. For example, there are more data items in a certain value range such as 60-70 mmHg, and there are fewer data items in another value range such as 150-170 mmHg.

在步驟S1中，運算裝置50從儲存裝置30取得血壓資料集。 In step S1 , the computing device 50 obtains a blood pressure data set from the storage device 30 .

在步驟S2中，運算裝置50將收縮壓的量測範圍劃分為多個第一區間。在步驟S3中，運算裝置50將舒張壓的量測範圍劃分為多個第二區間。本發明不限制步驟S2及步驟S3執行的先後順序。 In step S2, the computing device 50 divides the measurement range of the systolic blood pressure into a plurality of first intervals. In step S3, the computing device 50 divides the measurement range of the diastolic pressure into a plurality of second intervals. The present invention does not limit the order in which steps S2 and S3 are performed.

在一實施例中，舉實際數值為例說明第一區間及第二區間的劃分方式，例如收縮壓可被劃分為四個第一區間，這四個第一區間分別為：(1)低於100毫米汞柱(mmHg)，(2)介於100mmHg和140mmHg之間，(3)介於140mmHg和160mmHg之間，以及(4)超過160mmHg。例如舒張壓可被劃分為四個第二區間：(1)低於60mmHg，(2)在60mmHg和80mmHg之間，(3)在80mmHg和100mmHg之間，以及(4)超過100mmHg。上述數值僅為舉例說明而非用以限制本發明。換言之，本發明對於各個第一/第二區間的範圍大小、第一/第二區間的數量等皆不限制。在其他實施例中，除了圖3所示的兩個血壓約束條件(constraint)：收縮壓及舒張壓，本發明所述方法更可以加入第三個血壓約束條件，如脈壓差，而且運算裝置50將第三血壓約束條件的量測範圍劃分為多個第三區間。 In one embodiment, an actual numerical value is taken as an example to illustrate the division method of the first interval and the second interval. For example, the systolic blood pressure can be divided into four first intervals, and the four first intervals are: (1) lower than 100 millimeters of mercury (mmHg), (2) between 100 mmHg and 140 mmHg, (3) between 140 mmHg and 160 mmHg, and (4) exceeding 160 mmHg. For example diastolic blood pressure may be divided into four second intervals: (1) below 60mmHg, (2) between 60mmHg and 80mmHg, (3) between 80mmHg and 100mmHg, and (4) above 100mmHg. The above numerical values are for illustration only and not intended to limit the present invention. In other words, the present invention does not limit the size of the range of each first/second interval, the number of first/second intervals, and the like. In other embodiments, in addition to the two blood pressure constraints shown in FIG. 3: systolic blood pressure and diastolic blood pressure, the method of the present invention can further add a third blood pressure constraint, such as pulse pressure difference, and the computing device 50 divides the measurement range of the third blood pressure constraint into multiple third intervals.

在步驟S4中，運算裝置50依據多個第一區間及多個第二區間產生多個類別。類別的數量為第一區間的數量及第二區間的數量的組合數。承前例，依據四個第一區間及四個第二區間可組合產生16個類別(4*4)，如下方表格1所示。這16個類別的每一者包括四個第一區間中的一者及四個第二區間中的一者。例如類別6代表的是100

SBP<140以及60

SBP<80。 In step S4, the computing device 50 generates a plurality of categories according to the plurality of first intervals and the plurality of second intervals. The number of categories is the number of combinations of the number of the first interval and the number of the second interval. Following the previous example, 16 categories (4*4) can be generated based on four first intervals and four second intervals, as shown in Table 1 below. Each of the 16 categories includes one of four first intervals and one of four second intervals. For example, category 6 represents 100

SBP<140 and 60

SBP<80.

在步驟S5中，運算裝置50判斷並記錄每個受測者的血壓資料與每個類別的匹配狀況，進而產生對應於所有受測者的多個匹配狀況。每個匹配狀況包括對應於多個類別的多個標記，每個標記具有第一狀態及第二狀態中的一者，第一狀態代表在一個受測者的多筆血壓資料中，至少一筆收縮壓資料位於標記對應的類別包含的第一區間，且在同一受測者的多筆血壓資料中，至少一筆舒張壓資料位於標記對應的類別包含的第二區間，第二狀態代表所有的血壓資料中都沒有匹配標記對應的類別。下方表格二以實際數值為例，說明多個受測者在多個類別的多個匹配狀況。 In step S5, the computing device 50 judges and records the matching status of each subject's blood pressure data and each category, and then generates a plurality of matching statuses corresponding to all the subjects. Each matching condition includes a plurality of marks corresponding to a plurality of categories, and each mark has one of a first state and a second state. The first state represents that among the multiple pieces of blood pressure data of a subject, at least one piece of systolic blood pressure data is located in the first interval included in the category corresponding to the mark, and among the multiple pieces of blood pressure data of the same subject, at least one piece of diastolic blood pressure data is located in the second interval included in the category corresponding to the mark. The second state represents that there is no matching category in all the blood pressure data. Table 2 below uses actual values as an example to illustrate multiple matching conditions of multiple testees in multiple categories.

在表格2所示的範例中，包括九個受測者在三個類別的匹配狀況。在表格2中，每一列代表一個匹配狀況，匹配狀況中的標記為1代表第一狀態，標記為0代表第二狀態。例如：受測者A的匹配狀況為(1,0,1)，代表在受測者A的多筆血壓資料中，至少有一筆血壓資料匹配類別1，無任何血壓資料匹配類別2，至少有一筆血壓資料匹配類別3。總體而言，步驟S5用於產生多個受測者的血壓類別分布，此分布的資料結構為多個匹配狀況組成的0-1矩陣，矩陣的列數等於受測者數量，矩陣的行數等於類別數量。 In the example shown in Table 2, the matching status of nine subjects in three categories is included. In Table 2, each column represents a matching condition, where a mark of 1 represents the first state, and a mark of 0 represents the second state. For example: the matching status of subject A is (1,0,1), which means that among the multiple blood pressure data of subject A, at least one blood pressure data matches category 1, no blood pressure data matches category 2, and at least one blood pressure data matches category 3. In general, step S5 is used to generate the blood pressure category distribution of multiple subjects. The data structure of this distribution is a 0-1 matrix composed of multiple matching conditions. The number of columns in the matrix is equal to the number of subjects, and the number of rows in the matrix is equal to the number of categories.

在步驟S6中，運算裝置50依據所有受測者的多個匹配狀況執行分配程序，以將所有受測者分配至多個集合中。在一實施例中，每一個集合相當於K折交叉驗證中的一個折(fold)。本發明不限制集合的數量大小。分配程序必須考量到每個類別都盡可能地被平均分配到每個集合中，這樣可以保持不同集合中的SBP和DBP分布。另外，由於分配程序是以受測者作為分配單位，從而避免來自同一受測者的多筆血壓資料被分配到不同的集合中，導致資料獨立性的崩壞。 In step S6, the computing device 50 executes an allocation procedure according to the multiple matching conditions of all the subjects, so as to allocate all the subjects into multiple sets. In one embodiment, each set corresponds to a fold in K-fold cross-validation. The present invention does not limit the number and size of sets. The allocation procedure must take into account that each category is distributed as evenly as possible into each set, so that the distribution of SBP and DBP among the different sets is maintained. In addition, since the allocation procedure is based on the subject as the allocation unit, multiple blood pressure data from the same subject are avoided from being allocated to different sets, resulting in the collapse of data independence.

請參考圖5，圖5是圖3中步驟S6的細部流程圖。 Please refer to FIG. 5 , which is a detailed flowchart of step S6 in FIG. 3 .

在步驟S61中，運算裝置50依據受測者的數量及集合的數量計算對應於集合的多個受測者需求數量。承前例，假設集合數量為3，標記為集合1、集合2及集合3。受測者數量為9，如表格2的分配狀況範例所示。因此，集合1對應的受測者需求數量為3，集合2對應的受測者需求數量為3、集合3對應的受測者需求數量為3。換言之，受測者需求數量為受測者數量除以集合數量。若出現無法整除的情況，則將剩餘未被分配的受測者按照後續介紹的流程分配到某幾個集合之中。 In step S61 , the computing device 50 calculates a plurality of required quantities of subjects corresponding to the sets according to the number of subjects and the number of sets. Continuing from the previous example, assume that the number of sets is 3, which are marked as set 1, set 2, and set 3. The number of subjects was 9, as shown in Table 2 for an example distribution. Therefore, set 1 corresponds to 3 subjects' demands, set 2 corresponds to 3 subjects' demands, and set 3 corresponds to 3 subjects' demands. In other words, the number of subjects required is the number of subjects divided by the number of sets. If there is no divisibility, the remaining unassigned subjects will be assigned to certain sets according to the procedure described later.

在步驟S62中，對於每個類別，運算裝置50在多個受測者中計算此類別具有第一狀態的匹配數量，進而得到對應於多個類別的多個匹配數量。本步驟S62可視為一個迴圈程序。運算裝置50每次處理一個類別(取決運算裝置50的平行處理能力，也可以每次處理N個類別)，直到所有類別都被處理完成。基於表格2的匹配狀況範例，運算裝置50執行步驟S62後可得到如下方表格3的結果。 In step S62 , for each category, the computing device 50 calculates the matching numbers of the category having the first status among the multiple subjects, and then obtains a plurality of matching numbers corresponding to the multiple categories. This step S62 can be regarded as a loop procedure. The computing device 50 processes one category at a time (depending on the parallel processing capability of the computing device 50 , it can also process N categories at a time), until all the categories are processed. Based on the example of the matching status in Table 2, the calculation device 50 can obtain the results in Table 3 below after executing step S62.

在步驟S63中，運算裝置50依據匹配數量及集合數量計算多個匹配類別需求數量。在一實施例中，匹配類別需求數量為匹配數量除以集合數量得到的平均值。基於表格3的匹配數量範例，運算裝置50執行步驟S63後可得到如下方表格4的結果。例如類別1對應的匹配數量為4，因此每一個集合的匹配類別需求數量為1.3(4/3四捨五入取至小數點後一位)。 In step S63, the computing device 50 calculates a plurality of required quantities of matching categories according to the matching quantity and the set quantity. In an embodiment, the required number of matching categories is an average value obtained by dividing the number of matches by the number of collections. Based on the matching quantity example in Table 3, the calculation device 50 can obtain the results in Table 4 below after executing step S63. For example, the number of matches corresponding to category 1 is 4, so the required number of matching categories for each set is 1.3 (4/3 is rounded to one decimal place).

在步驟S64中，運算裝置50判斷血壓資料集中是否存在尚未被分配至集合中的受測者。若判斷為是，則執行步驟S65~S69的流程，然後返回步驟S64重新判斷。步驟S65~S69的流程將重複進行，直到所有受測者皆被分配到某個集合之後，步驟S64的判斷才會變成否，並結束本發明一實施例的資料拆分方法。 In step S64 , the computing device 50 determines whether there are subjects in the blood pressure data set that have not been assigned to the set. If it is judged as yes, then execute the process of steps S65-S69, and then return to step S64 for re-judgment. The process of steps S65-S69 will be repeated until all the subjects are assigned to a certain set, then the judgment of step S64 will become negative, and the data splitting method according to an embodiment of the present invention will end.

在步驟S65中，運算裝置50從多個類別中選擇一者作為指定類別。在一實施例中，指定類別的匹配數量為最小值。例如在表格4中，類別2對應的匹配類別需求數量最小(2.3>1.3>1)，因此在第一輪迭代流程中，運算裝置50選擇類別2作為指定類別。在其他實施例中，運算裝置50隨機選擇指定類別。 In step S65 , the computing device 50 selects one of a plurality of categories as the designated category. In one embodiment, the number of matches for a given category is a minimum. For example, in Table 4, category 2 corresponds to the smallest number of matching category requirements (2.3>1.3>1), so in the first round of iterative process, the computing device 50 selects category 2 as the specified category. In other embodiments, the computing device 50 randomly selects the designated category.

在步驟S66中，運算裝置50從血壓資料集中選擇指定受測者。指定受測者的指定類別的標記為第一狀態。承前例，在第一輪迭代流程中，血壓資料集中包括A~I共9個受測者尚未被分配出去。當指定類別為類別2時，受測者C,E,F被選擇作為指定受測者，因為這三個受測者C,E,F在類別2中的標記皆為代表第一狀態的數值1。此外，由於三個受測者C,E,F被選為指定受測者，步驟S66至步驟S69的流程將被重複執行三次，直到所有指定受測者都被分配出去，步驟S66才會在下次迭代時選擇新的指定受測者。 In step S66, the computing device 50 selects a designated subject from the blood pressure dataset. The flag of the specified category of the specified subject is the first state. Following the previous example, in the first round of iterative process, 9 subjects including A~I in the blood pressure data set have not been assigned yet. When the designated category is category 2, subjects C, E, and F are selected as specified subjects, because the marks of these three subjects C, E, and F in category 2 are all value 1 representing the first state. In addition, since the three subjects C, E, and F are selected as designated subjects, the process from step S66 to step S69 will be repeated three times until all designated subjects are assigned, and step S66 will select a new designated subject in the next iteration.

在步驟S67中，運算裝置50在多個集合中選擇一者作為指定集合。詳言之，運算裝置50判斷第一條件是否滿足。若判斷為是，則產生指定集合。若判斷為否，則運算裝置50判斷第二條件是否滿足。若判斷為是，則產生指定集合。若判斷為否，則運算裝置50從符合第三條件的多個集合中隨機選擇一者作為指定集合。 In step S67, the computing device 50 selects one of the multiple sets as the designated set. In detail, the computing device 50 judges whether the first condition is satisfied. If the judgment is yes, then generate the specified set. If the determination is negative, the computing device 50 determines whether the second condition is satisfied. If the judgment is yes, then generate the specified set. If the determination is negative, the computing device 50 randomly selects one of the multiple sets meeting the third condition as the specified set.

第一條件為：在指定類別涵蓋的所有匹配類別需求數量中，找到第一最大值，且第一最大值的數量等於1。若第一條件被滿足，則第一最大值對應的集合即為指定集合。 The first condition is: find the first maximum value among all demand quantities of the matching category covered by the specified category, and the quantity of the first maximum value is equal to 1. If the first condition is satisfied, the set corresponding to the first maximum value is the designated set.

第二條件為：在指定類別涵蓋的所有匹配類別需求數量中，找到第一最大值，且第一最大值的數量大於1；在所有受測者需求數量中，找到第二最大值，且第二最大值的數量等於1。若第二條件被滿足，則第二最大值對應的集合即為指定集合。 The second condition is: find the first maximum value among all the demand quantities of the matching category covered by the specified category, and the number of the first maximum value is greater than 1; If the second condition is satisfied, the set corresponding to the second maximum value is the specified set.

第三條件為：在指定類別涵蓋的所有匹配類別需求數量中，找到第一最大值，且第一最大值的數量大於1；在所有受測者需求數量中，找到第二最大值，且第二最大值的數量大於1。 The third condition is: find the first maximum value among all the demand quantities of the matching category covered by the specified category, and the quantity of the first maximum value is greater than 1;

以下採用表格4作範例，說明步驟S67的執行流程：已知指定類別為步驟S66產生的類別2。類別2涵蓋的所有匹配類別數量分別為(1,1,1)。由於最大值為1，且此最大值的數量為3，所以第一條件未被滿足。集合1~集合3的受測者需求數量分別為(3,3,3)。由於最大值為3，且此最大值的數量為3，所以第二條件未被滿足。符合第三條件的集合包括集合1，集合2，集合3，因此運算裝置50從這三個集合中隨機選擇一者作為指定集合。 The following uses Table 4 as an example to illustrate the execution flow of step S67: the known designated category is category 2 generated in step S66. The number of all matching categories covered by category 2 is (1,1,1) respectively. Since the maximum value is 1, and the number of this maximum value is 3, the first condition is not satisfied. The number of subjects demanded by set 1~set 3 is (3,3,3) respectively. Since the maximum value is 3, and the number of this maximum value is 3, the second condition is not satisfied. The sets meeting the third condition include set 1, set 2, and set 3, so the computing device 50 randomly selects one of these three sets as the designated set.

在步驟S68中，運算裝置50將指定受測者分配給指定集合，且從血壓資料集中移除指定受測者。承前例，在步驟S66中產生的指定受測者包括受測者C,E,F，在步驟S67中產生的指定集合為集合1,2,3。在一實施例中，當步驟S66產生多個指定受測者時，運算裝置50可從中隨機選擇一個用於步驟S68。例如：在步驟S68中，運算裝置50將受測者C分配給集合1，然後從血壓資料集中移除受測者C的血壓資料。 In step S68, the computing device 50 assigns the specified subject to the specified set, and removes the specified subject from the blood pressure data set. Continuing from the previous example, the specified subjects generated in step S66 include subjects C, E, and F, and the specified sets generated in step S67 are sets 1, 2, and 3. In one embodiment, when step S66 generates a plurality of designated subjects, the computing device 50 may randomly select one of them for step S68. For example: in step S68, the computing device 50 assigns the subject C to the set 1, and then removes the blood pressure data of the subject C from the blood pressure data set.

在步驟S69中，運算裝置50更新匹配類別需求數量及受測者需求數量。承前例，更新後的結果如下方表格5所示。注意集合1在類別2的匹配類別需求數量由1降為0(1-1)，這是因為集合1中已分配到一個類別2為1的受測者C。此外，集合1的受測者需求數量也由原本的3降為2(3-1)，這是因為集合1中已分配到一個受測者C。 In step S69 , the computing device 50 updates the required quantity of the matching category and the required quantity of the subject. Following the previous example, the updated results are shown in Table 5 below. Note that the number of matching category requirements of set 1 in category 2 is reduced from 1 to 0 (1-1), because set 1 has been assigned a subject C whose category 2 is 1. In addition, the required number of subjects in set 1 is also reduced from 3 to 2 (3-1), because set 1 has been assigned a subject C.

在步驟S69完成之後，將返回步驟S64。因為血壓資料集中仍然有受測者A~B,D~I。所以步驟S64的判斷結果為是，並執行步驟S65~步驟S69的第二次迭代流程。 After step S69 is completed, it will return to step S64. Because there are still subjects A~B, D~I in the blood pressure data set. Therefore, the judgment result of step S64 is yes, and the second iterative process of step S65 to step S69 is executed.

在步驟S65中，類別2對應的匹配類別需求數量仍然是最小(注意：低於1的匹配類別需求數量不予考慮)，因此在第二輪迭代流程中，運算裝置50仍會選擇類別2作為指定類別。 In step S65, the number of matching category requirements corresponding to category 2 is still the smallest (note: the matching category requirement quantity lower than 1 is not considered), so in the second round of iterative process, the computing device 50 still selects category 2 as the designated category.

在步驟S66中，運算裝置50選擇到的指定受測者包括E,F。 In step S66, the specified subjects selected by the computing device 50 include E and F.

在步驟S67中，集合2及集合3各自在類別2的匹配類別需求數量分別為(1,1)。由於最大值為1，且此最大值的數量為2，所以第一條件未被滿足。集合1至集合3的受測者需求數量分別為(2,3,3)。由於最大值為3，且此最大值的數量為2，所以第二條件未被滿足。符合第三條件的集合包括集合2及集合3，因此運算裝置50從這二個集合中隨機選擇一者作為指定集合。 In step S67, the number of matching category requirements in category 2 of set 2 and set 3 is (1, 1) respectively. Since the maximum value is 1, and the number of this maximum value is 2, the first condition is not satisfied. The number of subjects demanded from set 1 to set 3 is (2, 3, 3) respectively. Since the maximum value is 3, and the number of this maximum value is 2, the second condition is not satisfied. The sets meeting the third condition include set 2 and set 3, so the computing device 50 randomly selects one of the two sets as the specified set.

在步驟S68中，例如將指定受測者E分配給指定集合2，從血壓資料集中移除指定受測者E。 In step S68, for example, the designated subject E is assigned to the designated set 2, and the designated subject E is removed from the blood pressure data set.

在步驟S69中，更新後的結果如下方表格6所示。注意集合2的匹配類別需求數量會依據受測者E的匹配狀況(0,1,1)對應遞減。 In step S69, the updated results are shown in Table 6 below. Note that the matching category requirements of set 2 will decrease correspondingly according to the matching status (0, 1, 1) of subject E.

依上述流程類推，每執行一次步驟S65至步驟S69的迭代流程，都會把一個指定受測者分配到一個指定集合中，因此，總迭代次數與血壓資料集中受測者的數量相等。表格7、表格8及表格9分別呈現第三次、第六次迭代以及最後一次迭代(第九次)的範例，為便於理解，本範例假設「隨機挑選」的實現是按照字母順序及阿拉伯數字順序。 By analogy with the above process, each execution of the iterative process from step S65 to step S69 will assign a specified subject to a specified set, therefore, the total number of iterations is equal to the number of subjects in the blood pressure data set. Table 7, Table 8, and Table 9 show examples of the third iteration, sixth iteration, and last iteration (ninth iteration) respectively. For ease of understanding, this example assumes that "random selection" is implemented in alphabetical order and Arabic numeral order.

請參考圖6及圖7，圖6是依據傳統資料拆分方法產生的收縮壓分布及舒張壓分布示意圖，圖7是依據本發明實施例的驗證機器學習的資料拆分方法產生的收縮壓分布及舒張壓分布示意圖。在圖6中，訓練資料集、驗證資料集及測試資料集的分布並不一致。以圖6的收縮壓分布示意圖為例，訓練資料集約在130毫米汞柱處具有最大的資料量，但是驗證資料集約在125毫米汞柱處具有最大的資料量。此外，驗證資料集及測試資料集在約190毫米汞柱處皆包含少量的資料，但是訓練資料集在該處的資料量幾乎沒有資料。這將導致訓練得到的血壓估測模型無法用於預估190毫米汞柱以上的收縮壓數值。在圖7中，無論是收縮壓分布還是舒張壓分布，訓練資料集、驗證資料集、測試資料集等三個資料集都具有類似的資料分布趨勢。舉例說明如下：若某一個資料集在血壓區間A的資料量較多且在血壓區間B的資料量較少，則其他資料集也會具有相同的特性。因此，這樣資料集的拆分方式有助於提升血壓預測模型的通用性及準確性。 Please refer to FIG. 6 and FIG. 7. FIG. 6 is a schematic diagram of the systolic blood pressure distribution and diastolic blood pressure distribution generated according to the traditional data splitting method, and FIG. In Figure 6, the distributions of the training data set, validation data set, and test data set are not consistent. Taking the schematic diagram of systolic blood pressure distribution in Figure 6 as an example, the training data set has the largest amount of data at 130 mmHg, but the validation data set has the largest amount of data at 125 mmHg. In addition, both the validation and test datasets contain a small amount of data at about 190 mmHg, but the training dataset has almost no data there. This will lead to the inability of the trained blood pressure estimation model to predict systolic blood pressure values above 190 mmHg. In Figure 7, no matter the systolic blood pressure distribution or the diastolic blood pressure distribution, the three data sets of training data set, verification data set and test data set all have similar data distribution trends. An example is as follows: if a data set has more data in blood pressure interval A and less data in blood pressure interval B, then other data sets will also have the same characteristics. Therefore, such a splitting method of the data set helps to improve the generality and accuracy of the blood pressure prediction model.

綜上所述，本發明提出的驗證機器學習的資料拆分系統及其方法具有以下貢獻或效果：其一是所提出的方法將來自同一受測者的所有樣本保存在同一集合(訓練資料集或測試資料集)中；其二是所提出的方法能夠在訓練資料集、驗證資料集和測試資料集上實現相似的血壓分布；其三是所提出的方法能夠在存在多個約束條件(例如收縮壓及舒張壓，此外可更包含脈搏率、心跳等任何與血壓相關因而影響模型訓練的約束條件)時保持不同資料集的血壓分布，這意味著訓練/驗證資料集和測試資料集的收縮壓分布相似，同時這些資料集的舒張壓分布也相似。 In summary, the data splitting system and method for verifying machine learning proposed by the present invention have the following contributions or effects: one is that the proposed method combines all data from the same subject The samples are stored in the same set (training data set or test data set); secondly, the proposed method can achieve similar blood pressure distributions on the training data set, verification data set and test data set; thirdly, the proposed method can maintain the blood pressure distribution of different data sets when there are multiple constraints (such as systolic blood pressure and diastolic blood pressure, and can also include pulse rate, heartbeat and other constraints related to blood pressure that affect model training), which means that the systolic blood pressure distributions of the training/validation data set and the test data set are similar, and the diastolic blood pressure distribution of these data sets The pressure distribution is also similar.

雖然本發明以前述之實施例揭露如上，然其並非用以限定本發明。在不脫離本發明之精神和範圍內，所為之更動與潤飾，均屬本發明之專利保護範圍。關於本發明所界定之保護範圍請參考所附之申請專利範圍。 Although the present invention is disclosed by the aforementioned embodiments, they are not intended to limit the present invention. Without departing from the spirit and scope of the present invention, all changes and modifications are within the scope of patent protection of the present invention. For the scope of protection defined by the present invention, please refer to the appended scope of patent application.

S1~S6:步驟 S1~S6: steps

Claims

A data splitting system for verifying machine learning, comprising: a measurement device used to generate a blood pressure data set, wherein the blood pressure data set includes a plurality of subjects, the blood pressure data of each of the subjects, and the types of the blood pressure data include systolic blood pressure and diastolic blood pressure; a storage device communicates with the measurement device to receive and store the blood pressure data set, and stores a computer-readable recording medium; and a computing device communicates with the storage device. dividing the measurement range of the diastolic blood pressure into a plurality of second intervals; generating a plurality of categories according to the first intervals and the second intervals, each of the categories including one of the first intervals and one of the second intervals; judging and recording the matching status of the blood pressure data of each of the subjects with the categories, and then generating a plurality of matching conditions corresponding to the subjects, each of the matching conditions includes a plurality of marks corresponding to the categories, and each of the marks has a first One of a state and a second state, the first state represents that the systolic blood pressure data in the blood pressure data is located in the first interval included in the category corresponding to the mark, and the diastolic blood pressure data in the blood pressure data is located in the second interval included in the class corresponding to the mark, and the second state represents that the blood pressure data do not match the class corresponding to the mark;

The data splitting system for verifying machine learning as described in claim 1, wherein the computing device running the computer-readable recording medium is further used to perform the following steps: calculate the number of required subjects corresponding to the sets according to the number of the subjects and the number of the sets; for each of the categories, calculate the number of matches in the category with the first state in the subjects, and then obtain the number of matches corresponding to the categories; calculate the number of needs for multiple matching categories according to the number of matches and the number of sets; When there are one or more of the subjects, the following steps are performed: select one of the categories as the specified category, and the matching number of the specified category is a minimum value; select a specified subject from the blood pressure dataset, and the flag of the specified category of the specified subject is the first state; according to the specified category, select one of the sets as a specified set; assign the specified subject to the specified set, and remove the specified subject from the blood pressure data set.

The data splitting system for verifying machine learning as described in claim 2, wherein when the computing device running the computer-readable recording medium selects one of the sets as the specified set, it includes: finding the maximum number of matching categories among the numbers of matching categories corresponding to the sets; and when the maximum number of matching categories is equal to 1, using the set corresponding to the maximum value as the specified set.

The data splitting system for verifying machine learning as described in claim item 3, wherein the computing device running the computer-readable recording medium is further used to perform the following steps: when the number of the maximum value is greater than 1, find the maximum number of testees in the number of testees corresponding to the sets; and when the maximum number of testees is equal to 1, use the set corresponding to the maximum number of testees as the specified set.

The data splitting system for verifying machine learning as described in claim 4, wherein the computing device running the computer-readable recording medium is further used to perform the following steps: when the maximum number of subjects required is greater than 1, randomly select one from multiple sets corresponding to the maximum number of subjects required as the specified set.