TW201303618A

TW201303618A - Method and computer program product of producing a model for use in predicting to an event, apparatus for determining time-to-event predictive information, method and computer program product for determining a predictive diagnosis for a patient

Info

Publication number: TW201303618A
Application number: TW101124423A
Authority: TW
Inventors: Olivier Saidi; David A Verbel
Original assignee: Aureon Biosciences Corp
Priority date: 2003-11-18
Filing date: 2004-11-18
Publication date: 2013-01-16
Also published as: TW200532489A

Abstract

A method of producing a model for use in predicting time to an event includes obtaining multi-dimensional, non-linear vectors of information indicative of status of multiple test subjects, at least one of the vectors being right-censored, lacking an indication of a time of occurrence of the event with respect to the corresponding test subject, and performing regression using the vectors of information to produce a kernel-based model to provide an output value related to a prediction of time to the event based upon at least some of the information contained in the vectors of information, where for each vector comprising right-censored data, a censored-data penalty function is used to affect the regression, the censored-data penalty function being different than a non-censored-data penalty function used for each vector comprising non-censored data.

Description

A method for generating a model for estimating the time of occurrence of an event, a computer program product, a device for determining an estimated time of occurrence of an event, a method for determining a patient's estimated diagnosis, and a computer program product

本發明有關於時間-事件(time-to-event)分析，更明確地說，用於右設限資料的時間-事件分析。 The present invention relates to time-to-event analysis, and more specifically to time-event analysis for right-bound data.

很多狀況下，吾人想要預估在某一時間段內事件發生(啟始發生及/或再發生)可能性及/或事件似乎發生的時間量。例如，於醫學領域中，這將可用以預估已經被治療一特定疾病之病患的再發生，若有的話，何時發生。數學模型已經被開發以基於由實際案例取得之資料，來完成此時間-事件預估。於上述例子中，可以藉由研究一群已受特定疾病治療的病患並指明有沒再發生及再發生病人的共同特徵或“特性”，而加以開關此一預估模型。藉由考量在該群病患中之再發生的實際時間，特性及特性值也可以被指明，以相關在特定時間發生的病患。這些特性也可以用以基於病患之個別特性輪廓，而預估一未來病患的再發生時間。此一時間-事件預估可以協助一治療醫師評估及計畫事件發生時的處理。 In many cases, we want to estimate the likelihood of an event occurring (initiating and/or recurring) and/or the amount of time it seems to have occurred during a certain period of time. For example, in the medical field, this would be used to predict the recurrence, if any, of a patient who has been treated for a particular disease. Mathematical models have been developed to accomplish this time-event estimate based on data obtained from actual cases. In the above example, this predictive model can be switched by studying a group of patients who have been treated for a particular disease and indicating that there are no recurrences and recurrences of the patient's common characteristics or "characteristics". By considering the actual time of recurrence in this group of patients, characteristics and characteristic values can also be specified to correlate patients who occur at a particular time. These characteristics can also be used to estimate the recurrence time of a future patient based on the individual characteristics of the patient. This time-event estimate can assist a treating physician in assessing and planning for the event to occur.

時間-事件資料的特有特徵為想要事件(於此例子中為病症再發生)並未被觀察到。例如，這將發生當該群中之一病患拜訪該醫生時，但該病症並未再發生。相關於此一病患拜訪的資料被稱為“右設限”，因為想要的部份資料時間缺失(即，想要事件，例如病症再發生並未發生)。雖然設限資料藉由定義缺乏某些資訊，但假若設限資料可以在發生預估模型時加以考量將會很有用，因為其可提供用以採行模型的參數的更多資料點。實際上，時間-事件資料，特別是右設限時間-事件資料為診療、藥學及生醫研究中所最常見之資料類型之一。 The unique feature of time-event data is that the desired event (in this case, the recurrence of the condition) has not been observed. For example, this will happen when one of the patients in the group visits the doctor, but the condition does not recur. The information related to this patient visit is called "right limit" because the part of the data that is wanted is missing (ie, the event is expected, for example, the recurrence of the disease has not occurred). Although the limit data lacks certain information by definition, it would be useful to limit the data to be taken into account when estimating the model, as it provides more information on the parameters used to adopt the model. In fact, time-event data, especially right-limited time-event data, is one of the most common types of data in diagnosis, pharmacy, and biomedical research.

於形成或訓練數學模型中，吾人大致想要加入儘可能地來自很多來源的很多資料。因此，例如，對於健康時間-事件預估，吾人大致想要儘可能由很多病患取得資料以及儘可能由每一病患取得很多之相關資料。然而，有關這些大量資料，有著很難處理所有資訊為可用的情形。雖然存在有各種模型，但沒有一模型可以完全滿足以處理包含右設限資料的高維異質資料組。例如，Cox比例危險模型為一已知模型，用以分析設限資料之指明由於病患特性之結果差異性，這係在整個結構中，假設任兩病患的失敗率係成比例及病患的獨立特性以各種方式影響該危險。雖然Cox模型可以適當地處理右設限資料，但Cox模型對於分析高維資料組並不理想，因為其係為在模型中之總迴歸自由度所限制及其若處理一複雜模型時，則需要足夠量的病患。另一方向，支持向量機(SVM)對高維資料組執行良好，但並不適用於設限資料。 In forming or training mathematical models, we generally want to add as much information as possible from many sources. So, for example, for a health time-event estimate, we generally want to get as much information as possible from as many patients as possible and as much as possible from each patient. However, with regard to this large amount of information, it is difficult to handle all the information as available. Although there are various models, none of them can be fully satisfied to process high-dimensional heterogeneous data sets containing right-limit data. For example, the Cox proportional hazard model is a known model used to analyze the limits of the indicated data due to differences in patient characteristics, which is in the entire structure, assuming that the failure rate of any two patients is proportional and the patient The independent nature affects this hazard in a variety of ways. Although the Cox model can properly handle right-limit data, the Cox model is not ideal for analyzing high-dimensional data sets because it is limited by the total regression degree of freedom in the model and if a complex model is processed, A sufficient number of patients. In the other direction, support vector machines (SVMs) perform well on high-dimensional data sets, but they are not suitable for setting data.

一般而言，於一態樣中，本發明提供一種產生用以預估事件發生時間的模型的方法，該方法包含：取得表示多數測試物的狀態的資訊的多維非線性向量，至少一向量為右設限，其相關於測試物，缺乏事件的發生時間的表示；及使用資訊的向量以執行迴歸，以產生核為主的模型，以基於資訊的向量所包含的至少部份資訊，提供有關於事件的一預估時間的輸出值，其中對於每一包含右設限資料的向量，一設限資料懲罰函數被用以影響該迴歸，該設限資料懲罰函數係與用於包含非設限資料的每一向量之非設限資料懲罰函數不同。 In general, in one aspect, the present invention provides a method of generating a model for estimating an event occurrence time, the method comprising: obtaining a representation a multi-dimensional nonlinear vector of information about the state of the test object, at least one vector being a right-limit, which is related to the test object, lacking a representation of the time at which the event occurred; and using a vector of information to perform regression to produce a kernel-based The model provides an output value for an estimated time of the event based on at least part of the information contained in the information-based vector, wherein for each vector containing the right-limit data, a limit data penalty function is used to influence In the regression, the limit data penalty function is different from the non-limit data penalty function for each vector containing non-restricted data.

本發明的實施可以包含一或多數以下特性。迴歸包含支持向量機迴歸法。該設限資料懲罰函數具有較非設限資料懲罰函數為大之正鬆弛變數。執行迴歸包含使用多數懲罰函數，其包含在模型的預估值與預估值之目標值間之差的線性函數，及用於該設限資料懲罰函數預估及目標值間之正差的線性函數的第一斜率係低於用於非設限資料懲罰函數之預估與目標值間之正差的線性函數的第二斜率。第一斜率實質等於用於設限資料懲罰函數之預估及目標值間之負差的線性函數的第三斜率與用於非設限資料懲罰函數之預估及目標值間之負差的線性函數的第四斜率，及非設限資料懲罰函數的正及負鬆弛變數係實質等於設限資料懲罰函數的負鬆弛變數。 Implementations of the invention may include one or more of the following features. The regression includes support vector machine regression. The limit data penalty function has a larger positive slack variable than the non-limit data penalty function. Performing regression involves using a majority penalty function that includes a linear function of the difference between the estimated value of the model and the target value of the estimate, and the linearity used to estimate the positive difference between the penalty function and the target value. The first slope of the function is lower than the second slope of the linear function for the positive difference between the predicted and target values of the non-limited data penalty function. The first slope is substantially equal to the linearity of the third slope of the linear function used to limit the negative difference between the estimate and the target value of the data penalty function and the linear difference between the estimate and the target value for the non-limited data penalty function The fourth slope of the function, and the positive and negative relaxation variables of the unrestricted data penalty function are substantially equal to the negative relaxation variable of the limit data penalty function.

本發明的實施也可以包含一或多數以下特性。向量資料係基於資料的至少一特徵，以分類加以相關，其係有關於協助模型以提供輸出值的能力，使得輸出值協助預估時間-事件，該方法更包含：依序使用由具有最像資料的分類到最不像資料的分類的向量資料，以執行迴歸，以協助模型，提供輸出值，使得輸出值協助預估到事件之時間。該至少一特徵為可靠度及預估能力之至少之一。迴歸係依據資料的特性，以正向貪婪方式執行，以選擇予以用於模型中之特性。該方法更包含在執行迴歸後，對向量的特性，執行負向貪婪程序，以進一步選擇予以用於模型中之特性。該迴歸係只相對於向量的一部份特性，以正向貪婪方向執行。該等向量包含臨床/組織病理資料、生物標記、及生物影像資料的資料分類，並且，其中迴歸係只針對生物標記及向量的生物影像資料以正向貪婪方式執行。資訊的向量表示活體、前活體及無生命體之至少之一的測試物的狀態。 Implementations of the invention may also include one or more of the following features. The vector data is based on at least one feature of the data, related to the classification, and has the ability to assist the model to provide an output value, so that the output value assists in estimating the time-event, and the method further includes: using the most Subdivision of information Classify the vector data that is least like the classification of the data to perform regressions to assist the model, providing output values so that the output values assist in estimating the time to the event. The at least one feature is at least one of reliability and predictive power. Regression is performed in a positively greedy manner based on the characteristics of the data to select the characteristics to be used in the model. The method further includes performing a negative greedy program on the characteristics of the vector after performing the regression to further select characteristics to be used in the model. This regression is performed in a positive greedy direction only with respect to a part of the characteristics of the vector. These vectors contain clinical/histopagoographic data, biomarkers, and data classification of biometric data, and the regression system is only performed in a positively greedy manner for biomarker data of biomarkers and vectors. The vector of information represents the state of the test object of at least one of the living body, the pre-living body, and the inanimate body.

一般而言，於另一態樣中，本發明提供一電腦程式產品，其產生用於預估事件的時間之模型，該電腦程式產品內佇在一電腦可讀取媒體上，該電腦程式產品包含電腦可讀取、電腦可執行指令，用以使得電腦以：取得表示多數測試物之狀態的資訊的多維非線性向量，至少一向量為右設限，缺乏有關測試物之事件發生時間的指示；並使用資訊的向量，以執行迴歸，以產生核為主模型，以基於包含於資訊的向量中之至少部份資訊，提供有關事件發生預估時間的輸出值，其中對於每一包含右設限資料的向量，一設限資料懲罰函數被用以影響該迴歸，該設限資料懲罰函數係不同於包含非設限資料之每一向量所用之非設限資料懲罰函數不同。 In general, in another aspect, the present invention provides a computer program product that generates a model for estimating the time of an event, the computer program product being internal to a computer readable medium, the computer program product Computer-readable, computer-executable instructions for causing a computer to: obtain a multi-dimensional, non-linear vector of information indicative of the state of a majority of the test object, at least one of which is a right-limit, lacking an indication of the time at which the event of the test object occurred And using the vector of information to perform regression to generate a kernel-based model to provide an output value for the estimated time of occurrence of the event based on at least some of the information contained in the vector of information, where for each containment The vector of limited data, a limit data penalty function is used to influence the regression, and the limit data penalty function is different from the non-limit data penalty function used for each vector containing the non-limited data.

本發明之實施包含一或多數以下特性。迴歸包含支持向量機迴歸。該設限資料懲罰函數具有較非設限資料懲罰函數所有者更大之正鬆弛變數。使電腦執行迴歸的指令包含指令，以使得電腦使用懲罰函數，其包含模型之預估值與用於該預估值之目標值間之差的線性函數，以及，在設限資料懲罰函數之預估及目標值間之正差的線性函數之第一斜率係低於用於非設限資料補償函數之預估及目標值間之正差的線性函數的第二斜率。第一斜率係實質等於用於設限資料懲罰函數之預估及目標值間之負差的線性函數的第三斜率與用於非設限資料懲罰函數之預估及目標值間之負差的線性函數的第四斜率，及非設限資料懲罰函數的正及負鬆弛變數係實質等於設限資料懲罰函數的負鬆弛變數。 Implementations of the invention include one or more of the following features. The regression contains support vector machine regression. The limit data penalty function has a larger positive slack variable than the owner of the non-limited data penalty function. The instruction to cause the computer to perform the regression includes instructions to cause the computer to use a penalty function that includes a linear function of the difference between the predicted value of the model and the target value for the predicted value, and a pre-limitation of the penalty function The first slope of the linear function that estimates the positive difference between the target values is lower than the second slope of the linear function used for the positive difference between the predicted and target values of the unset data compensation function. The first slope is substantially equal to a third slope of a linear function used to limit the negative difference between the estimate and the target value of the data penalty function and a negative difference between the estimate and the target value for the non-limited data penalty function. The fourth slope of the linear function, and the positive and negative relaxation variables of the non-limited data penalty function are substantially equal to the negative relaxation variable of the limit data penalty function.

本發明的實施也可以包含一或多數以下特性。用於使電腦執行迴歸的指令可以使得迴歸：依序使用由具有最像資料的分類到最不像資料的分類的向量資料，以執行迴歸，以協助模型，提供輸出值，使得輸出值協助預估到事件之時間。該使電腦執行迴歸的指令使得迴歸係依據資料的特性，以正向貪婪方式執行，以選擇予以用於模型中之特性。該電腦程式產品更包含指令，以使得電腦在執行迴歸後，對向量的特性，執行負向貪婪程序，以進一步選擇予以用於模型中之特性。該使得電腦執行正向貪婪方式迴歸的指令使得電腦迴歸係只相對於向量的一部份特性，以正向貪婪方向執行。該等向量包含臨床/組織病理資料、生物標記、及生物影像資料的資料分類，並且，其中使電腦執行迴歸的指令以正向貪婪方式執行迴歸，使得電腦係只針對生物標記及向量的生物影像資料執行正向貪婪特性選擇。 Implementations of the invention may also include one or more of the following features. The instructions used to cause the computer to perform regression can cause regression: sequentially use vector data from the most categorical classification to the least categorical data to perform regression to assist the model, provide output values, and make the output values assist Estimate the time of the event. The instruction to cause the computer to perform the regression causes the regression to be performed in a positively greedy manner based on the characteristics of the data to select the characteristics to be used in the model. The computer program product further includes instructions to cause the computer to perform a negative greedy program on the characteristics of the vector after performing the regression to further select characteristics to be used in the model. The instruction that causes the computer to perform a regression toward greedy mode causes the computer regression system to perform only in a positive direction with respect to a part of the characteristics of the vector. These vectors contain clinical/tissue pathology data, Biomarkers, and data classification of biometric data, and the instructions that enable the computer to perform regressions perform regression in a positively greedy manner, so that the computer system only performs positive greedy characteristics selection for biomarkers of biomarkers and vectors.

一般而言，於一態樣中，本發明提供一產生用以預估到一事件的時間的模型的方法，該方法包含：取得表示多數測試物的狀態的資訊的多維非線性向量；及使用資訊的向量以執行迴歸，以產生核為主的模型，以基於資訊的向量所包含的至少部份資訊，提供有關於事件的一預估時間的輸出值，其中向量的資料係基於資料的至少一特徵的分類加以相關，該至少一特徵有關於資料協助模型提供輸出值的能力，使得輸出值協助預估到該事件的時間，及其中，迴歸係依序使用由最像資料分類到最不像資料分類而來之向量的資料加以執行，以協助模型提供輸出值，使得輸出值協助作事件之預估時間。 In general, in one aspect, the present invention provides a method of generating a model for predicting the time to an event, the method comprising: obtaining a multi-dimensional nonlinear vector representing information of a state of a majority of the test object; and using The vector of information is used to perform regression to generate a kernel-based model, and at least part of the information contained in the information-based vector provides an estimated time output value for the event, wherein the vector data is based on at least the data A classification of features associated with the ability of the data assistance model to provide an output value, such that the output value assists in estimating the time to the event, and wherein the regression system is sequentially classified from the most image to the least The data of the vector from the data classification is executed to assist the model in providing the output value so that the output value assists in estimating the time of the event.

本發明的實施也可以包含一或多數以下特性。迴歸係依據資料的特性，以正向貪婪方式執行，以選擇予以用於模型中之特性。該方法更包含在執行迴歸後，對向量的特性，執行負向貪婪程序，以進一步選擇予以用於模型中之特性。該迴歸係只相對於向量的一部份特性，以正向貪婪方向執行。該等向量包含臨床/組織病理資料、生物標記、及生物影像資料的資料分類，並且，其中迴歸係針對臨床/生物組織病理資料係以非正向貪婪方式進行，以及，只針對向量的生物標記及生物影像資料以正向貪婪方式執行。至少一向量為右設限的，及缺乏有關測試物之發生事件的時間指示。 Implementations of the invention may also include one or more of the following features. Regression is performed in a positively greedy manner based on the characteristics of the data to select the characteristics to be used in the model. The method further includes performing a negative greedy program on the characteristics of the vector after performing the regression to further select characteristics to be used in the model. This regression is performed in a positive greedy direction only with respect to a part of the characteristics of the vector. The vectors include clinical/histopathological data, biomarkers, and data classification of biometric data, and wherein the regression is performed in a non-positive greedy manner for clinical/biological histopathological data, and vector-only biomarkers Biophotographic data Execution. At least one vector is right-defined and lacks a time indication of the occurrence of the test object.

一般而言，於另一態樣中，本發明提供一電腦程式產品，用以產生用於預估事件時間的模型，該電腦程式產品內佇在電腦可讀取媒體上並包含電腦可讀取、電腦可執行指令，以使得電腦：取得指示多數測試物之狀態的資訊之多維非線性向量，至少一向量為右設限，相關於測試物，缺乏事件的發生時間的表示；及使用資訊的向量以執行迴歸，以產生核為主的模型，以基於資訊的向量所包含的至少部份資訊，提供有關於事件的一預估時間的輸出值，其中向量的資料係基於資料的至少一特徵的分類加以相關，該至少一特徵有關於資料協助模型提供輸出值的能力，使得輸出值協助預估到該事件的時間，及其中，迴歸係依序使用由最像資料分類到最不像資料分類而來之向量的資料加以執行，以協助模型提供輸出值，使得輸出值協助作事件之預估時間。 In general, in another aspect, the present invention provides a computer program product for generating a model for estimating an event time, the computer program product being readable on a computer readable medium and including a computer readable Computer executable instructions to cause the computer to: obtain a multidimensional non-linear vector of information indicative of the state of the majority of the test object, at least one vector being a right-limit, related to the test object, lacking a representation of the time at which the event occurred; and using information The vector performs regression to generate a kernel-based model, and provides at least part of the information about the estimated time of the event based on at least part of the information contained in the information-based vector, wherein the vector data is based on at least one characteristic of the data The classification is related, the at least one feature has the ability to provide an output value for the data assistance model, such that the output value assists in estimating the time to the event, and wherein the regression system sequentially uses the most categorized data to the least similar data. The vector data from the classification is executed to assist the model in providing output values such that the output values assist in estimating the time of the event.

本發明之實施可以包含一或多數以下特性。迴歸可以依據資料的特性以正向貪婪方式加以進行，以選擇予以用於模型中之特性。該電腦程式產品更包含指令，以在執行迴歸後，使電腦對向量的特性，執行一負向貪婪程序，以進一步選擇予以用於模型中之特性。該迴歸係只相對於向量的一部份特性以正向貪婪方式進行。向量包含臨床/組織病理資料、生物標記資料、及生物影像資料的資料分類，及其中以該順序，該針對臨床/組織病理資料係以非正向貪婪方式進行，及只針對向量的生物標記資料及生物影像資料係以正向貪婪方式進行。 Implementations of the invention may include one or more of the following features. Regression can be carried out in a positive greedy manner based on the characteristics of the data to select the characteristics to be used in the model. The computer program product further includes instructions to perform a negative greedy procedure on the computer to vector characteristics after performing the regression to further select characteristics to be used in the model. This regression is performed in a positive greedy manner only with respect to a part of the characteristics of the vector. The vector contains clinical/histopathological data, biomarker data, and data classification of biological imaging data, and in this order, the clinical/tissue pathological data is non- It is carried out in a greedy manner, and biomarker data and biometric data only for vectors are performed in a positive greedy manner.

一般而言，於另一態樣中，本發明提供一決定病患之預估診斷的方法，該方法包含：接收有關於病患的臨床及組織病理資料之至少之一；接收有關於病患的生物標記；接收有關於病患的生物影像資料；及應用臨床及組織病理資料之至少一部份、生物標記資料的至少一部份及生物影像的至少一部份至一核為主數學模型，以計算表示該病患診斷的值。 In general, in another aspect, the invention provides a method of determining a predicted diagnosis of a patient, the method comprising: receiving at least one of clinical and histopathological data about the patient; receiving the patient Biomarker; receiving biometric data about the patient; and applying at least a portion of the clinical and histopathological data, at least a portion of the biomarker data, and at least a portion of the bioimage to a nuclear mathematical model To calculate the value indicating the diagnosis of the patient.

本發明之實施可以包含一或多數以下特性。該生物標記資料的至少一部份包含少於該病患的所有生物標記特性的資料。該生物標記資料的至少一部份包含少於病患的所有生物標記特性百分之十的資料。該生物標記資料至少一部份包含少於該病患的所有生物標記特性的百分之五的資料。該生物標記資料的至少一部份包含少於病患所有生物影像特性的資料。該生物標記資料的至少一部份包含少於病患的所有生物影像特性百分之一的資料。該生物標記資料至少一部份包含少於該病患的所有生物影像特性的百分之0.2的資料。該值係表示該健康有關狀況之再發生時間及該健康有關狀況的再發生或然率之至少之一。 Implementations of the invention may include one or more of the following features. At least a portion of the biomarker data contains less than all of the biomarker characteristics of the patient. At least a portion of the biomarker data contains less than ten percent of all biomarker characteristics of the patient. At least a portion of the biomarker data contains less than five percent of all biomarker characteristics of the patient. At least a portion of the biomarker data contains less than all of the patient's biometric characteristics. At least a portion of the biomarker data contains less than one percent of all bioimage characteristics of the patient. At least a portion of the biomarker data contains less than 0.2% of the biometric characteristics of the patient. The value is at least one of a recurrence time of the health-related condition and a recurrence probability of the health-related condition.

一般而言，於另一態樣中，本發明提供一決定時間-事件預估資訊的設備，該設備包含一輸入，架構以取得相關於一可能未來事件的多維、非線性第一資料，及一處理裝置，架構以使用在核為主數學模型中之第一資料，該數學模型係至少部份由多維非線性右設限第二資料之迴歸分析所導出，該資料決定影響該模型計算的模型參數，以計算出預估資訊，其表示該可能未來事件之預估時間及可能未來事件的或然率之至少之一。 In general, in another aspect, the present invention provides an apparatus for determining time-event estimation information, the apparatus including an input, an architecture to obtain multi-dimensional, non-linear first data related to a possible future event, and a processing device, the architecture to use the first data in the kernel-based mathematical model, the number The learning model is derived, at least in part, by a regression analysis of the multidimensional nonlinear right-limit second data, which determines the model parameters affecting the model calculation to calculate an estimate information indicating the estimated time of the possible future event. And at least one of the probabilities of possible future events.

本發明之實施可以包含一或多數以下特性。輸入及處理裝置包含內佇在電腦可讀取媒體上之電腦程式產品的一部份，該電腦程式產品包含電腦可讀取、電腦可執行指令，以使得電腦取得第一資料及在數學模型中使用第一資料，以計算預估資訊。第一資料包含相關於一病患之臨床及組織病理資料、生物標記資料、及生物影像資料之至少之一，其中該處理裝置係架構以使用至少一臨床及組織病理資料至少一部份、生物標記資料的至少一部份、及生物影像資料的至少一部份至一核為主數學模式，以計算該病患的預估資訊。該生物標記資料的至少一部份包含少於病患所有生物標記特性的百分之五的資料。該生物標記資料的至少一部份包含少於病患的所有生物影像特性的資料。該生物標記資料的至少一部份包含少於病患的所有生物影像特性百分之0.2的資料。 Implementations of the invention may include one or more of the following features. The input and processing device comprises a portion of a computer program product embodied in a computer readable medium, the computer program product comprising computer readable, computer executable instructions for causing the computer to obtain the first data and in the mathematical model Use the first data to calculate the estimated information. The first data includes at least one of clinical and histopathological data, biomarker data, and biological imaging data relating to a patient, wherein the processing device is configured to use at least a portion of the clinical and histopathological data, the biological At least a portion of the marker data, and at least a portion of the biometric image to a core is the primary mathematical model to calculate the patient's estimated information. At least a portion of the biomarker data contains less than five percent of the biomarker characteristics of the patient. At least a portion of the biomarker data contains less than all of the biometric characteristics of the patient. At least a portion of the biomarker data contains less than 0.2% of all bioimage characteristics of the patient.

一般而言，於另一態樣中，本發明提供一電腦程式產品，用以決定一病患的預估診斷，該電腦程式產品內佇在一電腦可讀取媒體上並包含電腦可讀取、電腦可執行指令，用以使得電腦：接收有關於該病患之臨床及組織病理資料的至少之一；接收有關於該病患之生物標記資料；接收有關該病患的生物影像資料；及施加臨床及組織病理資料的至少之一的至少一部份、生物標記資料的至少一部份及生物影像資料的至少一部份至核為主數學模型，以計算該病患之診斷的表示值。 In general, in another aspect, the present invention provides a computer program product for determining a predicted diagnosis of a patient, the computer program product being readable on a computer readable medium and including a computer readable Computer executable instructions for causing the computer to: receive at least one of clinical and histopathological information about the patient; receive biomarker data about the patient; receive biometric information about the patient; Applying clinical and histopathological resources At least a portion of at least one of the materials, at least a portion of the biomarker data, and at least a portion of the biometric image to a nuclear-based mathematical model to calculate a representation of the diagnosis of the patient.

本發明之實施可以包含一或多數以下等性。該生物標記資料的至少一部份包含少於該病患的所有生物標記特性的資料。申請專利範圍第50項的該電腦程式產品，其中，該生物標記資料的至少一部份包含少於該病患的所有生物標記的百分之十的資料。該生物標記資料的至少一部份包含少於該病患的所有生物標記特性的百分之五的資料。 Implementations of the invention may include one or more of the following equivalences. At least a portion of the biomarker data contains less than all of the biomarker characteristics of the patient. The computer program product of claim 50, wherein at least a portion of the biomarker data contains less than ten percent of all biomarkers of the patient. At least a portion of the biomarker data contains less than five percent of all biomarker characteristics of the patient.

本發明之實施包含一或多數以下特性。該生物標記資料的至少一部份包含少於該病患的所有生物影像的資料。該生物標記資料的至少一部份包含少於該病患的所有生物影像特性的百分之一的資料。該生物標記資料的至少一部份包含少於該病患的所有生物影像特性的百分之0.2的資料。該值係表示健康有關狀況之再發生時間及健康有關狀況的再發生或然率之至少之一。 Implementations of the invention include one or more of the following features. At least a portion of the biomarker data contains less than all of the biological images of the patient. At least a portion of the biomarker data contains less than one percent of all biometric characteristics of the patient. At least a portion of the biomarker data contains less than 0.2% of the biometric characteristics of the patient. This value is at least one of the recurrence time of the health-related condition and the recurrence probability of the health-related condition.

本發明提供新穎技術，例如利用SVR之高維能力，同時，將其適用至設限資料，特別是右設限資料。用於設限資料的支持向量迴歸(SVRc)可以提供各種優點及能力。因為可用以形成或訓練一預估模型的很多資訊可能被設限，所以SVRc可以藉由使用在SVR中之設限資料及未設限資料，而增加模型預估正確度。以SVRc，包含右設限觀察之少數結果資料點的高維資料可以被用以產生一時間-事件預測模型。高維資料的特性可以削減，以留下用於時間-事件預估模型中之減量的特性組，使得時間-事件預估正確率可以改進。 The present invention provides novel techniques, such as utilizing the high dimensional capabilities of the SVR, while applying it to the limit data, particularly the right limit data. Support Vector Regression (SVRc) for setting data provides a variety of advantages and capabilities. Because much of the information available to form or train a predictive model may be limited, SVRc can increase the accuracy of the model estimate by using the limit data and unbound data in the SVR. With SVRc, high-dimensional data containing a small number of result data points observed by the right limit can be used to generate a temporary Inter-event prediction model. The characteristics of the high-dimensional data can be reduced to leave a set of characteristics for the decrement in the time-event prediction model, so that the time-event prediction accuracy can be improved.

本發明之這些及其他能力與本發明本身將可以看過以下之圖式及詳細說明與申請專利範圍加以完全了解。 These and other abilities of the present invention, as well as the scope of the invention and the detailed description of the invention, are fully understood.

本發明之實施例提供用於改良預估時間-事件或然率正確性的技術。為了開發用以預估時間-事件或然率之改良模型，一新穎之修改損失/懲罰函數係用於右設限異質資料的支持向量機(SVM)中。使用此新修改之損失/懲罰函數，SVM可以有意義地處理右設限資料，以對設限資料執行支持向量迴歸(以下稱SVRc)。用以開發模型之資料可以來自各種測物，該等物係取決於想要預估之事件。例如，測試物可以為活體或前活體物，例如人或用於醫學應用之其他動物。測試物也可以是用於醫學或非醫學應用之無生命體。例如無生命測試物可以為用於磨損分析之汽車零件、會計報告，例如用於財務服務之股市效能等等。 Embodiments of the present invention provide techniques for improving the accuracy of estimated time-event probabilities. In order to develop an improved model for estimating time-event probability, a novel modified loss/penalty function is used in a support vector machine (SVM) for right-bound heterogeneous data. Using this newly modified loss/penalty function, the SVM can meaningfully process the right limit data to perform support vector regression (hereinafter referred to as SVRc) on the limit data. The data used to develop the model can come from a variety of measurements, depending on the event you want to estimate. For example, the test article can be a living or pre-living body, such as a human or other animal for medical applications. The test substance can also be an inanimate body for medical or non-medical applications. For example, inanimate test objects can be automotive parts for wear analysis, accounting reports, such as stock market performance for financial services, and the like.

於例示實施例中，SVRc可以用以產生預估癌症再發生的模型。此一模型可以由一病患群族所取之三個不同特性域加以分析特性：(i)臨床/組織病理特性；(ii)生物標記特性；及(iii)生物影像特性，其中諸特性在各階段中被加入至模型，以選自不同域的特性作為後續階段的錨定物。 In an exemplary embodiment, SVRc can be used to generate a model that predicts cancer recurrence. This model can be characterized by three different characteristic domains taken by a patient group: (i) clinical/histopathological properties; (ii) biomarker properties; and (iii) biometric properties, where the properties are Each stage is added to the model, with characteristics selected from different domains as subsequent stages. Anchor.

臨床特性表示可以為醫生在例行辦公室拜訪中所收集到之特定病患資料。這些資料可以包含例如年齡、種族、性別等之資料，以及，部份病症有關之資訊，例如臨床分期或實驗參數，例如攝護腺特異性抗原(PSA)。 The clinical characteristics indicate the specific patient data collected by the doctor during a routine office visit. Such information may include information such as age, race, sex, etc., as well as information about some of the conditions, such as clinical stage or experimental parameters, such as prostate specific antigen (PSA).

組織病理特性表示屬於描述病症基本本質之病理的資訊，特別是為病症所造成之身體組織及器官的結構及功能變化。組織病理特性例包含Gleason分數、手術邊緣狀況、及倍數性資訊。 Histopathological properties are information that is part of the pathology that describes the essential nature of the condition, particularly the structural and functional changes in the tissues and organs of the body caused by the condition. Examples of histopathological features include Gleason scores, surgical margin status, and ploidy information.

生物標記特性表示有關於具有特定分子特性之身體中之生物化學物，其可用以量測一病症的進度或者治療的效果。生物標記特性的類型例為資訊，其係屬於使用抗體，以指明一特定細胞類型、細胞器官、或細胞單元。生物標記特性包含例如在樣品染色中對幾種生物標記具陽性反應的細胞百分比及這些生物標記之染色強度。 Biomarker properties are indicative of biochemicals in the body of a particular molecular property that can be used to measure the progress of a condition or the effect of a treatment. The type of biomarker property is exemplified by the use of antibodies to indicate a particular cell type, cell organ, or cell unit. Biomarker properties include, for example, the percentage of cells that are positive for several biomarkers in sample staining and the staining intensity of these biomarkers.

生物影像特性表示使用數學及計算科學所導出之資訊，以由組織或細胞中研讀數位影像。此資訊的例子為流明的平均、最大、最小及標準偏差。臨床/組織病理特性、生物標記特性、及生物影像特性係呈現在附錄中。各種特性可以經由使用例如由Definiens AG(www.definiens.com)所購得之Cellenger及由MathWorks公司(www.mathworks.com)所購得之MATLAB之軟體加以取得及分析。 Bioimaging features represent information derived from mathematics and computational science to produce a bitmap image from tissue or cells. Examples of this information are lumen average, maximum, minimum, and standard deviation. Clinical/histopathological properties, biomarker properties, and bioinformatic properties are presented in the appendix. Various characteristics can be obtained and analyzed by using, for example, Cellenger, which is commercially available from Definiens AG (www.definiens.com), and MATLAB software, which is commercially available from MathWorks, Inc. (www.mathworks.com).

於此例子中，來自這三域之特性在三階段中被加入至模型中(例如第一階段：臨床/組織病理資料；第二階段：選擇臨床/組織病理特性係被用作為所加入之錨定及生物標記特性；第三階段：所選擇臨床/組織病理及選擇生物標記特性係被用作為被加入之錨定及生物影像(IMG)特性)。所得模型包含所選定特性及對這些特性加以迭代調整/調諧的模型參數。其他實施例仍在本發明之範圍內。 In this example, the characteristics from these three domains are added to the three phases. In the model (eg first stage: clinical/histopathological data; second stage: selection of clinical/histopathological characteristics used as anchoring and biomarker characteristics added; stage 3: selected clinical/histopathology and selection Biomarker properties are used as anchoring and bioimage (IMG) properties to be added. The resulting model contains selected features and model parameters that are iteratively adjusted/tuned for these characteristics. Other embodiments are still within the scope of the invention.

本發明之實施例可以用於各種應用中。於醫學領域中，例如，實施例可以用於預估例如攝護腺特異性抗原(PSA)再發生的時間-事件。實施例也可以用以預估各種慢性病之診斷或其他健康有關事件，包含對一藥物或荷爾蒙，或輻射或化療支配的反應。其他應用大致包含使用組織為主臨床實驗及臨床實驗。其他想要預估事件發生的應用也是有可能的。由健康領域中，例子包含預估洗腎病人之感染、燒傷病人的感染、及新生兒之斷奶。於其他領域，例如工程師可以預估煞車皮何時故障。於示於第1圖之醫學領域實施例中，一SVRc系統10包含臨床/組織病理量測/資料收集12、生物標記資料收集14、及生物影像資料量測/收集的資料源，及資料迴歸及分析裝置18，其提供一預估診斷輸出26。資料源12、14、16可以包含適當之個人(例如醫生)、資料記錄(例如醫學資料庫)、及/或機器(例如攝像裝置、染色設備等等)。迴歸及分析裝置18包含一電腦20，其包含有記憶體22及處理機24，其係架構以執行電腦可讀取、電腦可執行軟體碼指令，用以執行SVRc。電腦20係被個人電腦所代表顯示，但其他形式之計算裝置也可用。裝置18更架構以提供作為輸出26資料，其包含或可以處理以表示一預定時間-事件。例如，輸出26可以為一病患之癌症之發生(包含再發生)時間的預估診斷。輸出26可以提供在迴歸及分析裝置18的顯示螢幕28上。 Embodiments of the invention may be used in a variety of applications. In the medical field, for example, embodiments can be used to predict time-events such as recurrence of prostate specific antigen (PSA). Embodiments can also be used to predict various chronic disease diagnoses or other health related events, including responses to a drug or hormone, or radiation or chemotherapy. Other applications generally involve the use of tissue-based clinical trials and clinical trials. Other applications that want to estimate the occurrence of an event are also possible. Examples from the health field include estimates of infections in dialysis patients, infections in burn patients, and weaning in newborns. In other areas, for example, engineers can estimate when the wagon will fail. In the medical field embodiment shown in FIG. 1, an SVRc system 10 includes clinical/histopathological measurement/data collection 12, biomarker data collection 14, and biometric data measurement/collection data sources, and data regression. And an analysis device 18 that provides an estimated diagnostic output 26. The data sources 12, 14, 16 may contain appropriate individuals (eg, doctors), data records (eg, medical databases), and/or machines (eg, camera devices, staining devices, etc.). The regression and analysis device 18 includes a computer 20 including a memory 22 and a processor 24 configured to perform computer readable, computer executable software code Instruction to execute SVRc. The computer 20 is represented by a personal computer, but other forms of computing devices are also available. Device 18 is further structured to provide as output 26 material that contains or can be processed to represent a predetermined time-event. For example, output 26 can be an estimated diagnosis of the time (including recurrence) of a cancer in a patient. Output 26 can be provided on display screen 28 of regression and analysis device 18.

迴歸及分析裝置18的電腦20被架構以藉由提供被修改以分析設限及非設限資料之SVM，而執行SVRc。電腦20可以依據SVRc的以下結構處理資料。 The computer 20 of the regression and analysis device 18 is architected to perform SVRc by providing an SVM that is modified to analyze the bound and non-restricted data. The computer 20 can process data in accordance with the following structure of the SVRc.

SVRc structure

資料組T具有N樣品，，其中z_i={x_i，y_i，s_i}，其中x_i R ⁿ(R為實數組)係為樣品向量，及y_i R為目標值(即想要預估之發生時間)，及s_i {0，1}為相關樣品的設限狀態。樣品向量為用於(N中)第i個樣品/病患的向量特性。目標值y為對於非設限資料之檢測事件實際時間(例如再發生)及設限資料之觀察的最後已知時間。若設限狀態s_i為1，則第i個樣品z_i為設限樣品，若s_i為0，則第i個樣品z_i變為非設限樣品。當對於i=1，...N，s_i=0時，資料組T變成正常，完全未設限資料組。另外，設限狀態s_i=1表示非設限樣品及s_i=0表示設限樣品的資料組表示有效；於此時，SVRc被控制以相反方式考量設限。 The data set T has an N sample, Where z _i ={x _i ,y _i ,s _i }, where x _i R ⁿ (R is a real array) is a sample vector, and y _i R is the target value (that is, the time when you want to estimate), and s _i {0, 1} is the set state of the relevant sample. The sample vector is the vector property for the ith sample/patient in (N). The target value y is the last known time of the actual time of the detection event (eg, reoccurrence) and the limit data for the non-limited data. If the limit state s _i is 1, the i-th sample z _i is a set limit sample, and if s _i is 0, the i-th sample z _i becomes a non-limit sample. When i = 1, ... N, s _i =0, the data set T becomes normal, and no data set is completely set. In addition, the limit state s _i =1 indicates that the unrestricted sample and s _i =0 indicate that the data set of the set limit sample is valid; at this time, the SVRc is controlled to consider the limit in the opposite manner.

該SVRc公式建構一線性迴歸函數 f(x)=W ^TΦ(x)+b (1) The SVRc formula constructs a linear regression function f ( x )= W ^T Φ( x )+ b (1)

於一特性空間F中，f(x)為用於樣品x的預估時間-事件。於此W為F中之向量，及Φ(x)將輸入(x)映圖至F中之向量。於(1)中之W及b係藉由解答一最佳化問題而加以取得，其一般形式為： In a property space F, f(x) is the estimated time-event for sample x. Where W is the vector in F, and Φ(x) maps the input (x) to the vector in F. W and b in (1) are obtained by solving an optimization problem, and the general form is:

然而，此方程式假設凸集最佳化問題為永遠可用，但這於此情形下可能不是如此。再者，吾人想要允許在迴歸估算中有小誤差。為了這些理由，一損失函數係用於SVR。損失在迴歸估算中，允許部份餘裕。理想上，所建立之模型將精確正確地計算所有結果，這係不可行的。損失函數允許離開理想的大範圍誤差，此範圍係為鬆弛變數ξ及ξ^*與一懲罰C所控制。偏離開理想的誤差但仍在為ξ及ξ^*所定義範圍內之誤差係被計算，但其貢獻係為C所緩和。例子誤差愈多，則懲罰愈大。例子誤差愈小(接近理想)，則懲罰愈小。懲罰隨著誤差的概念造成一斜率，及C控制此斜率。雖然各種損失函數均可使用，但對於一ε不靈敏損失函數，一般方程式轉換為： However, this equation assumes that the convex set optimization problem is always available, but this may not be the case in this case. Furthermore, we want to allow small errors in the regression estimates. For these reasons, a loss function is used for the SVR. Losses in the regression estimate allow for some margin. Ideally, the model built will accurately and correctly calculate all results, which is not feasible. The loss function allows for the departure of the ideal wide range error, which is controlled by the relaxation variables ξ and ξ ^* with a penalty C. Errors that deviate from the ideal error but are still within the range defined by ξ and ξ ^* are calculated, but their contribution is moderated by C. The more the example error, the greater the penalty. The smaller the example error (close to ideal), the smaller the penalty. Penalty results in a slope with the concept of error, and C controls this slope. Although various loss functions can be used, for an ε insensitive loss function, the general equation is converted to:

對於依據本發明之ε不靈敏損失函數(具有不同損失函數應用至設限及非設限資料)，此方程式變成： For the ε insensitive loss function according to the invention (with different loss functions applied to the set and unbound data), the equation becomes:

最佳化準則懲罰資料指出其y值離開f(x)超過ε。鬆弛變數ξ及ξ^*分別對應於正及負偏移之過量偏移的尺寸。此懲罰機制具有兩元件，一用於非設限資料(即非右設限)及一用於設限資料。兩元件為損失函數之形式所代表，其係被稱為ε不靈敏損失函數。用於設限資料的例示損失函數30係被定義於(3)中並例示於第2圖中。 The optimization criterion penalty data indicates that its y value leaves f(x) beyond ε. The slack variables ξ and ξ ^* correspond to the size of the excess offset of the positive and negative offsets, respectively. This penalty mechanism has two components, one for non-restricted data (ie, non-right limit) and one for limit data. The two elements are represented by the form of the loss function, which is referred to as the ε insensitive loss function. An exemplary loss function 30 for setting data is defined in (3) and illustrated in FIG.

其中e=f(x)-y。 Where e=f(x)-y.

因此，e=f(x)-y代表預定時間-事件與實際時間-事件(檢測/假設事件)之差量。C及ε值調整於預估及實際時間-事件間之各種偏離所造成之懲罰量。C值控制損失函數30的相關部份的斜率。負及正ε偏移值(ε_s ^*及-ε_s)控制在懲罰付出前控制有多少偏移。一設限資料樣品係與傳統SVR不同之方式處理，因為其只提供“單側資訊”。例如，於存活時間預估中，其中於z_i中之y_i代表存活時間，設限資料樣品Z_i只表示事件並不會發生，直到y_i為止，在y_i後並不會指示何時將發生。方程式(3)的損失函數反映此現實。對於設限資料，預估在現行時間前之時間-事件(當事件快要發生)較在現行時間後預估時間差(因為此預估可能會成真)。因此，預估設限資料係取決於是否預估對實際/現行時間為正或負而加以完全不同地處理。ε及C值係用以區別用於f(x)>0對f(x)<0的懲罰(並用以區別設限與非設限資料預估)。對於早於現行時間之時間-事件預估，e<0，懲罰係被施加較現行時間，即e>0後預測為小之偏移(ε_s<ε_s ^*)。再者，於早於現行時間(及大於ε_s)之預測時間-事件間之增量較大偏移造成較晚於現行時間(即大於ε_s ^*)之預測時間-事件間之類似差為大之懲罰，即C_s>C_s ^*。結果，在現行時間前之預估造成較現行時間後之預估有較大之懲罰。 Therefore, e=f(x)-y represents the difference between the predetermined time-event and the actual time-event (detection/hypothetical event). The C and ε values are adjusted to the amount of penalty caused by the various deviations between the estimated and actual time-event. The C value controls the slope of the relevant portion of the loss function 30. The negative and positive ε offset values (ε _s ^* and - ε _s ) control how much offset is controlled before the penalty is paid. A limited data sample is processed in a different way than a traditional SVR because it only provides "one-sided information." For example, in the survival time estimation, where y _i in z _i represents the survival time, the limit data sample Z _i only indicates that the event does not occur until y _i does not indicate when it will be after y _i occur. The loss function of equation (3) reflects this reality. For the limit data, it is estimated that the time before the current time - the event (when the event is about to happen) is estimated to be worse than the current time (because this estimate may come true). Therefore, the projected limit data depends on whether the estimate is positive or negative for the actual/current time and is treated differently. The ε and C values are used to distinguish between penalties for f(x)>0 versus f(x)<0 (and to distinguish between bound and unbound data estimates). For a time-event estimate earlier than the current time, e < 0, the penalty is applied to the current time, ie, e>0 and then predicted to be a small offset (ε _s < ε _s ^* ). Furthermore, the larger difference between the predicted time-increment of the time before the current time (and greater than ε _s ) results in a later time between the predicted time and the event that is later than the current time (ie greater than ε _s ^* ). The big penalty is C _s >C _s ^* . As a result, the estimate before the current time has resulted in a larger penalty than the estimate after the current time.

第2圖顯示， Figure 2 shows that

(1)若e [-ε_x，0]，則不施加懲罰；若e (-∞，-ε_s)，則一具有斜率C_s之線性增加懲罰係被應用。 (1) If e [-ε _x ,0], no penalty is imposed; if e (-∞, -ε _s ), then a linear increase penalty with a slope C _s is applied.

(2)若e [0，-ε_s ^*]，則不施加懲罰，若e (ε_s ^*，∞)，則一具有斜率C_s ^*之線性增加懲罰被應用。 (2) If e [0, -ε _s ^* ], no penalty is imposed, if e (ε _s ^* , ∞), then a linear increase penalty with a slope C _s ^* is applied.

因為ε_s ^*>ε_s，及C_s ^*<C_s，所以當預估值f(x)<y時，大致較f(x)>y有更多之懲罰。此機制協助為電腦20所執行得到之SVRc迴歸函數完全利用提供在設限資料樣品中之單側資訊。 Since ε _s ^* > ε _s , and C _s ^* < C _s , when the estimated value f(x) < y, there are more penalties than f(x) > y. This mechanism assists in the full utilization of the SVRc regression function performed by the computer 20 to provide one-sided information in the set of data samples.

再者，用於非設限資料之修改損失函數可以以ε不靈敏形式加以表示。此損失函數較佳考量記錄時間-事件可能不是實際時間-事件的現實。雖然目標值y_i大致表示時間-事件，但當事件被檢測時，y_i實際為時間，而事件發生的精確時間係經常在y_i前一段時間。電腦20可以在非設限資料樣品之損失函數中考量及此。一例示非設限資料損失函數32係被提供於方程式(4)中並例示於第3圖中。 Furthermore, the modified loss function for non-restricted data can be represented in ε insensitive form. This loss function is a good consideration for recording time - the event may not be the reality of the actual time - event. Although the target value y _i roughly represents a time-event, when the event is detected, y _{i is} actually time, and the precise time at which the event occurs is often a period of time before y _i . The computer 20 can take this into account in the loss function of the non-limited data samples. An exemplary non-restricted data loss function 32 is provided in equation (4) and is illustrated in FIG.

其中e=f(x)-y。 Where e=f(x)-y.

注意，ε_n ^*≦ε_n及C_n ^*≧C_n，否則第3圖之解釋係大致與第2圖相同。 Note that ε _n ^* ≦ ε _n and C _n ^* ≧ C _n , otherwise the explanation of Fig. 3 is substantially the same as Fig. 2.

幾項簡化及/或近似法可以完成簡化計算。例如，因為於檢測事件時間及精確事件時間之間的差很小，並通常可忽略，所以可以設定ε_n ^*=ε_n及C_n ^*=C_n，這簡化非設限資料樣品的損失函數。為了進一步降低SVRc之公式化之自由參數的數量，並使其容易使用，在多數情形中，ε_s ^(*)，ε_n ^(*)，C_s ^(*)及C_n ^(*)可以被設定位 ε_s ^*>ε_s=ε_n ^*=ε_n Several simplifications and/or approximations can be used to simplify the calculation. For example, because the difference between the detection event time and the precise event time is small and usually negligible, ε _n ^* = ε _n and C _n ^* = C _n can be set, which simplifies the loss function of the unbound data sample. . In order to further reduce the number of free parameters formulated by SVRc and make it easy to use, in most cases, ε _s ^(*) , ε _n ^(*) , C _s ^(*) and C _n ^(*) can be set. ε _s ^* > ε _s = ε _n ^* = ε _n

C_s ^*<C_s=C_n ^*=C_n C _s ^* <C _s =C _n ^* =C _n

如同於本技藝所知並如上所注意到，標準SVR使用損失函數。以上所提供之損失函數係為ε不靈敏損失函數，並只作例子用，其他ε不靈敏函數(例如具有不同ε及/或C值)，及其他形式之損失函數也可以使用。例示損失函數係被討論於1998年五月之工程及應用科學部之電子與電腦科學的技術報告第29頁之由S.Gunn所著之用於分類與迴歸的支持向量機。除了ε不靈敏函數外，例示損失函數包含二次、拉普拉斯或休伯損失函數。有關於損失函數30、32，用於早先預估對較實際/現行時間為晚的預估的懲罰可以不同(例如對於f(x)值在零下及零上，有不同的斜率/形狀)。形狀可以使用，以於範圍在f(x)=0旁準備用為無懲罰，並取決於f(x)為大於或小於零，而準備不同增量之懲罰。 As is known in the art and as noted above, the standard SVR uses a loss function. The loss function provided above is an ε-insensitive loss function and is used as an example only. Other ε-insensitive functions (eg, having different ε and/or C values), and other forms of loss functions can also be used. The exemplary loss function is discussed in S. Gunn's Support Vector Machine for Classification and Regression on page 29 of the Electronic and Computer Science Technical Report of the Department of Engineering and Applied Sciences, May 1998. Except for the ε-insensitive function, the exemplary loss function includes a quadratic, Laplacian or Huber loss function. Regarding the loss function 30, 32, the penalty used to estimate the estimate that is later than the actual/current time may be different (eg, for f(x) values below zero and zero, there are different slopes/shapes). The shape can be used to prepare for use as a no penalty next to f(x) = 0, and to prepare for different increments depending on whether f(x) is greater or less than zero.

Implementation of SVRc structure

於操作中，參考第4圖，進一步參考第1至3圖，使用系統18，使用SVRc以開發一預估模型的程序40包含所示之階段。然而，此程序40只作例示用並非限定用。程序40可以例如使階段加入、移除或重新排列而加以變化。 In operation, referring to FIG. 4, with further reference to Figures 1 through 3, using system 18, a program 40 using SVRc to develop an estimate model includes The stage shown. However, this program 40 is for illustrative purposes only and is not limiting. Program 40 can be varied, for example, by adding, removing, or rearranging the stages.

在階段42，執行模型1之啟始模型的訓練。相關臨床/組織病理特性之臨床/組織病理資料12被供給至系統18，以決定一組演算參數及用於模型1之對應組之模型參數。演算參數係為管理為電腦20所執行之迴歸的參數，以決定模型參數並選擇特性。演算參數例為用於迴歸的核，及邊際-ε_s、ε_s ^*、-ε_n、ε_n ^*，及損失函數斜率C_n、C_n ^*、C_s、C_s ^*。模型參數影響用於一給定輸入x的模型f(x)的輸出值。演算參數係在階段42加以設定並在程序40的其他階段被固定在設定值。 At stage 42, the training of the starting model of model 1 is performed. Clinical/histopathological data 12 of relevant clinical/histopathological properties are supplied to system 18 to determine a set of calculus parameters and model parameters for the corresponding set of Model 1. The calculus parameters are parameters that govern the regression performed by the computer 20 to determine model parameters and select characteristics. Examples of the calculation parameters are the kernel used for regression, and the margins - ε _s , ε _s ^* , - ε _n , ε _n ^* , and the loss function slopes C _n , C _n ^* , C _s , C _s ^* . The model parameters affect the output value of the model f(x) for a given input x. The calculation parameters are set in stage 42 and fixed at the set values in other stages of the program 40.

參考第5圖，參考第1至4圖，用於實施第4圖之實施階段42的程序60，以使用系統18使用SVRc決定模型1包含所示階段。然而，此程序60只作例示用並非限定用。程序60可以例如使階段加入、移除或重新排列而加以變化。 Referring to Figure 5, with reference to Figures 1 through 4, a routine 60 for implementing the implementation phase 42 of Figure 4 is used to determine the model 1 using the system 18 using SVRc. However, this program 60 is for illustrative purposes only and is not limiting. Program 60 can be varied, for example, by adding, removing, or rearranging the stages.

在階段62，演算參數被啟始設定。第一時間階段62係被執行，演算參數係為啟始設定，並在階段62的後續效能被重設。每一時間階段62係被執行，一組未被使用之演算參數係選擇用於該模型中，以訓練模型參數。 At stage 62, the calculus parameters are initiated. The first time phase 62 is executed, the calculation parameters are the initial settings, and the subsequent performance at stage 62 is reset. Each time phase 62 is executed and a set of unused calculus parameters are selected for use in the model to train the model parameters.

在階段64中，模型參數被啟始設定。模型參數可以為模型參數值的上位組，但較佳基於SVR的知識，以降低為電腦20所用之時間，以訓練模型參數。雖然此階段與其他階段分開來顯示，但所述動作可以結合其他階段加以執行，例如在第4圖之階段42的演算參數選擇及/或階段66。 In stage 64, the model parameters are initiated. The model parameters may be a higher group of model parameter values, but are preferably based on SVR knowledge to reduce the time spent on computer 20 to train model parameters. Although this stage It is displayed separately from the other phases, but the actions can be performed in conjunction with other phases, such as calculus parameter selection and/or phase 66 at stage 42 of FIG.

在階段66，模型參數係使用現行選擇組演算參數加以訓練。為了訓練模型參數，於一組資料向量中之資料向量的部份(及可能所有資料)係被饋入電腦20。資料向量包含有關於各種特性的資訊。例如，病患資料向量較佳包含臨床/組織病理、生物標記、及生物影像特性以及用於每一病患的這些特性的對應值。對於在程序60中之演算參數的選擇，較佳只使用臨床/組織病理特性及相關資料。這些值係被用作為模型f之輸入x，以決定f(x)之值。該等向量同時也包含對應於f(x)之目標值的目標值y。電腦20決定用於每一病患的f(x)值，及模型輸出與目標值間之差，即f(x)-y。電腦20分別決定於輸入向量x為設限或非設限，而使用損失函數30、32。電腦20使用來自損失函數30、32之資訊，依據方程式(2)，以執行SVR，以決定對應於現行組演算參數的一組模型參數。以所決定之模型參數，電腦20使用5摺疊交叉驗證，計算並儲存用於此組演算參數及模型參數的要字表索引(CI)。 At stage 66, the model parameters are trained using the current selection set calculus parameters. To train the model parameters, portions of the data vector (and possibly all of the data) in a set of data vectors are fed into the computer 20. The data vector contains information about various characteristics. For example, the patient data vector preferably includes clinical/tissue pathology, biomarkers, and bio-image characteristics as well as corresponding values for these characteristics for each patient. For the selection of the calculation parameters in the procedure 60, it is preferred to use only clinical/tissue pathological characteristics and related data. These values are used as input x of model f to determine the value of f(x). The vectors also contain a target value y corresponding to the target value of f(x). Computer 20 determines the f(x) value for each patient and the difference between the model output and the target value, i.e., f(x)-y. The computer 20 determines whether the input vector x is set or not, and uses the loss function 30, 32. The computer 20 uses the information from the loss functions 30, 32, in accordance with equation (2), to perform the SVR to determine a set of model parameters corresponding to the current set of calculus parameters. Using the determined model parameters, the computer 20 uses 5 fold cross validation to calculate and store the key table index (CI) for the set of calculus parameters and model parameters.

在階段68，詢問是否還有演算參數組想要試。電腦20決定是否每一可用組演算參數已經被用以決定一對應組模型參數。若不是，則程序60回到階段62，在該階段中選擇一組新的演算參數。若所有組之演算參數已經被用以決定對應組的模型參數，則程序60進行至階段70。 At stage 68, a query is made as to whether there is still a set of calculus parameters to try. Computer 20 determines if each of the available set of calculus parameters has been used to determine a corresponding set of model parameters. If not, program 60 returns to stage 62 where a new set of calculation parameters is selected. If the calculation parameters of all groups have been used The model 60 is determined to correspond to the model parameters of the group.

於階段70，電腦20選擇一組想要之演算參數，以進一步訓練該模型。電腦20分析用於對應各種組演算參數及為電腦20所決定的相關模型參數的模型之儲存要字表索引。電腦20找到最大儲存CI並將相關演算參數固定為將用於如第4圖所示之程序40的其他階段中之模型的演算參數。此版本之模型以及所選擇之演算參數與相關的模型參數形成模型1。模型1被由階段42輸出並形成用於階段44的錨。 At stage 70, computer 20 selects a desired set of calculus parameters to further train the model. The computer 20 analyzes the stored vocabulary index for the model corresponding to the various sets of calculus parameters and the associated model parameters determined by the computer 20. Computer 20 finds the maximum stored CI and fixes the relevant calculation parameters to the calculation parameters of the model that will be used in the other stages of program 40 as shown in FIG. This version of the model and the selected calculation parameters and the associated model parameters form the model 1. Model 1 is output by stage 42 and forms an anchor for stage 44.

參考第4圖及第1至3圖，在階段44，一補充模型，即模型2被訓練。模型1被使用為決定模型2的錨定物，其中演算參數在階段42被設定，這將於其他模型訓練中保持不變。模型1為一錨定物，其中用於模型1中之特性(於此為臨/床組織病理特性)將用以形成其他進一步模型，特別是，提供用於模型2的基礎。 Referring to Figure 4 and Figures 1 through 3, in stage 44, a supplemental model, Model 2, is trained. Model 1 is used to determine the anchor of Model 2, where the calculus parameters are set at stage 42, which will remain unchanged in other model training. Model 1 is an anchor in which the characteristics used in Model 1 (here, the pro/bed histopathological properties) will be used to form other further models, in particular, to provide a basis for Model 2.

為了基於模型1而形成模型2，特性選擇(FS)係使用一正向貪婪(GF)演算法加以執行，只有被認為可改良模型的預估正確性的特性才被保留在模型中。於癌症預估的例示情形中，生物標記資料在階段44被饋入裝置18，用以決定哪一生物標記被加入模型1中，以形成模型2。包含用於臨床/組織病理特性及選定生物標記特性的值之資料向量x係被用於上述之SVRc建立。五摺疊交叉驗證係被用以決定包含有新特性的模型參數。改版模型及前一模型的預估正確性係為個別CI所指明。若改版模型的預估正確性優於前一模型者(對於生物標記特性，前一模型為模型1)，則改版模型的特性被保持，及新特性被加入用於評估。若預估正確性並未改良，則放棄最近加入之特性，及加入其他新特性以作評估。此會持續，直到所有生物標記特性已經被試過並且放棄或加入至模型為止。具有相關模型參數的模型在階段44被裝置18所輸出作為模型2。 In order to form model 2 based on model 1, feature selection (FS) is performed using a forward greedy (GF) algorithm, and only features that are considered to improve the predictive correctness of the model are retained in the model. In the exemplary case of cancer prediction, biomarker data is fed to device 18 at stage 44 to determine which biomarker is added to model 1 to form model 2. A data vector x containing values for clinical/tissue pathological properties and selected biomarker properties was used for the SVRc setup described above. A five-fold cross-validation system is used to determine model parameters that contain new features. The correctness of the revised model and the previous model is specified by individual CIs. If the revised model is pre- The correctness is better than the previous model (for the biomarker feature, the former model is model 1), the characteristics of the revised model are maintained, and new features are added for evaluation. If the correctness of the estimate is not improved, the features of the recent additions are abandoned and other new features are added for evaluation. This will continue until all biomarker features have been tested and abandoned or added to the model. The model with the relevant model parameters is output by the device 18 as a model 2 at stage 44.

在階段46，訓練一補充模型，即模型3。模型2係被用作為決定模型3的錨定物。模型2為錨定物，包含於模型2中之特性(於此，為臨床/組織病理特性，若有的話，加上生物標記特性)將用以形成模型3。 At stage 46, a supplemental model, Model 3, is trained. Model 2 is used as an anchor for determining model 3. Model 2 is an anchor, and the characteristics contained in Model 2 (here, clinical/tissue pathological properties, if any, plus biomarker properties) will be used to form Model 3.

為了基於模型2而形成模型3，特性選擇(FS)係使用正向貪婪(GF)演算法加以執行，只有被認為可改良模型預估正確性的特性才會被保留在模型中。較佳地，就個別及/或一群而言，有關於模型1所評估以形成模型2的特性被認為較有關於模型2被評估以形成模型3的特性有較佳的可靠性及/或預估功能力(資料值對時間之相關性及/或事件相似性)。於癌症預估的例示情形中，生物影像資料在階段46被饋入裝置18中，以決定哪一生物影像特性被加入模型2中，以形成模型3。包含臨床/組織病理特性的值、在階段44所選擇的生物標記特性、及選定之生物影像特性之資料向量x係用以上述之SVRc結構中。五摺疊交叉驗證係用以決定包含有新特性的模型參數。改版模型及前一模型之預估正確性係為個別之CI表示。若改版模型的預估正確性優於前一模型者(對於生物影像特性，前一模型為模型2)，則剛加入至模型之特性被保持，及新特性被加入用於評估。若預估正確性並未改良，則放棄最新加入之特性，及加入另一新特性以作評估。此會持續，直到所有生物影像特性已經被試過並且放棄或加入至模型為止。具有相關模型參數的模型在階段46被裝置18所輸出作為模型3。 In order to form model 3 based on model 2, the feature selection (FS) is performed using a forward greedy (GF) algorithm, and only features that are considered to improve the correctness of the model prediction are retained in the model. Preferably, in terms of individual and/or group, the characteristics evaluated by Model 1 to form Model 2 are considered to be more reliable and/or pre-optimal than the characteristics of Model 2 being evaluated to form Model 3. Estimate functionality (data value versus time and/or event similarity). In the illustrated case of cancer prediction, the biometric image is fed into device 18 at stage 46 to determine which biometric feature is added to model 2 to form model 3. The data vector x containing the values of the clinical/histopathological properties, the biomarker characteristics selected at stage 44, and the selected bioimage characteristics are used in the SVRc structure described above. Five-fold cross-validation is used to determine model parameters that contain new features. The revised model and the correctness of the prediction of the previous model are individual CI tables. Show. If the estimated correctness of the revised model is better than that of the previous model (for the bio-image feature, the former model is model 2), the characteristics just added to the model are maintained, and new features are added for evaluation. If the correctness of the estimate is not improved, the latest additions are discarded and another new feature is added for evaluation. This will continue until all biometric features have been tested and discarded or added to the model. The model with the relevant model parameters is output by the device 18 as a model 3 at stage 46.

在階段48，一負向貪婪(GB)程序被執行以將來自模型3之模型精煉為最終模型。於模型3執行一GB演算法，以執行特性選擇時，一次一特性係由模型移除及該模型被再測試其預估正確性。若當一特性被移除時，模型之預估正確性增加時，則該特性被自該模型移除及GB程序被應用至改版模型。此持續直到GB程序當在現行特性中之任一特性被移除時，仍無法得到在預估正確性上之增加為止。最終模型參數然後與測試資料一起使用，以決定最終模型之預估正確性。具有可能降低特性組及決定模型參數之所得最終模型係為階段48的輸出並可以為裝置18所用，以當提供用於最終模型中之特性的資料時，提供時間-事件的或然率。 At stage 48, a negative greedy (GB) program is executed to refine the model from model 3 into the final model. When Model 3 executes a GB algorithm to perform feature selection, one feature is removed from the model and the model is retested for correctness. If the predictive correctness of the model increases when a feature is removed, then the feature is removed from the model and the GB program is applied to the revised model. This continues until the GB program is unable to obtain an increase in the estimated correctness when any of the features in the current feature are removed. The final model parameters are then used with the test data to determine the correctness of the final model estimate. The resulting final model with the potential to reduce the set of characteristics and determine the model parameters is the output of stage 48 and can be used by device 18 to provide a time-event probability when providing information for the characteristics in the final model.

其他實施例仍在隨附申請專利範圍之範圍與精神內。例如，由於軟體本質，上述功能可以使用軟體、硬體、韌體或任意組合加以實施。特性實施功能也可以實體位在各位置上，包含被分佈使得功能的部份被在不同實體位置處實施。再者，雖然於程序60中，模型參數被調整，但模型參數可以被基於SVR的知識加以設定，並在隨後不加以變化。這可以降低開發SVRc模型的處理容量及/或時間。再者，一或多數準則可以放置於被認為是加至模型上的多數特性上。例如，只有具有臨限值(例如0.6)或以上之要字表索引可以被加至模型並被測試以影響模型的正確性。因此，予以被測試之特性組可以被降低，這可以降低產生一模型的處理容量及/或時間。再者，模型可以在不使用特性域作為錨定物下加以開發。特性可以被加入至模型及在每一特性域後，不建立模型作為錨定物對預估正確度的衝擊已經加以考量。 Other embodiments are still within the scope and spirit of the appended claims. For example, due to the nature of the software, the above functions can be implemented using software, hardware, firmware, or any combination. The feature enforcement function can also be physically located at various locations, including being distributed such that portions of the functionality are implemented at different physical locations. Furthermore, although in the program 60, the model parameters are adjusted, but the mode Type parameters can be set based on the knowledge of the SVR and will not change subsequently. This can reduce the processing capacity and/or time of developing the SVRc model. Furthermore, one or more criteria can be placed on most of the features that are considered to be added to the model. For example, only a key table index with a threshold (eg, 0.6) or more can be added to the model and tested to affect the correctness of the model. Therefore, the set of characteristics to be tested can be reduced, which can reduce the processing capacity and/or time at which a model is generated. Furthermore, the model can be developed without using the property domain as an anchor. The characteristics can be added to the model and after each characteristic domain, the impact of not establishing the model as an anchor on the accuracy of the estimate has been considered.

Experimental and experimental results Lab 1: Internal Verification

現代機器學習演算法係被應用至為拜爾大學醫學中心所治療的540個病友的手術後前列腺癌病患。這些病患在拜爾大學醫學中心接受攝護腺根除手術。臨床及組織病理變數係被提供用於539病人，及為病人及變數所改變的病人失聯資料數。同樣地，組織微陣列片(包含三元正常及三元癌核)被提供用於這些病患；這些係被用以作攝影用之H&E染色，剩餘片則用於生物標記研究。 Modern machine learning algorithms were applied to post-operative prostate cancer patients for 540 patients treated at the Bayer University Medical Center. These patients underwent prostate eradication surgery at the Bayer University Medical Center. Clinical and histopathological variables were provided for 539 patients, and the number of patients lost data for patients and variables. Similarly, tissue microarray sheets (including ternary normal and ternary cancer nucleus) were provided for these patients; these were used for H&E staining for photography, and the remaining sheets were used for biomarker studies.

有關於研究的影像分析成份，只有包含至少80%癌的核被使用，以保留想要在這些組織樣品中量測的信號的完整性(及提升信雜比)。該予以量測之信號係由癌微解剖中之異常所構成。(相反地，在影像分析中之“雜訊”係為正常組織微解剖量測。)選擇了80%的刪除，以同時最大化該群的尺寸同時保留結果的完整性。因此，研究的有效樣品尺寸係最後基於有可以由臨床資料、生物標記資料、及生物影像資料取得資訊的病患。因此，可以用於整合預估系統的總病患數為130。 Regarding the image analysis components of the study, only nuclei containing at least 80% of the cancer were used to preserve the integrity of the signals that were intended to be measured in these tissue samples (and to improve the signal-to-noise ratio). The signal to be measured consists of an abnormality in the cancer microdissection. (Conversely, the "noise" in image analysis is Normal tissue microanatomy measurements. ) 80% deletion was chosen to maximize the size of the group while preserving the integrity of the results. Therefore, the effective sample size of the study was based on patients who had access to information from clinical data, biomarker data, and biometric data. Therefore, the total number of patients that can be used to integrate the predictive system is 130.

SVRc係被應用至該群病患及其相關資料。SVRc係被單獨應用至臨床/組織病理資料(17特性)、生物標記資料(來自12標記的43特性)、及自生物影像軟體幻象公司(由紐約楊科之Aureon生技公司所製)所產生之Script4所取得之生物影像資料(496特性)。SVRc演算法係被應用至三個資料類型之每一個，以找出每一資料類型的個別預估能力。於每一例子中，被建立兩模型：一個模型使用全部之原始特性；另一則使用為負向貪婪特性選擇(SVRc-GB)所取得之一組選定特性。該SVRc演算法同時也依據上述程序40被應用至所有三種資料類型。 SVRc is applied to this group of patients and their related data. The SVRc system was applied separately to clinical/histopathological data (17 characteristics), biomarker data (43 characteristics from 12 markers), and from bioimaged software phantom company (manufactured by Aureon Biotech, Inc., New York). Bio-image data obtained by Script4 (496 characteristics). The SVRc algorithm is applied to each of the three data types to find individual predictive power for each data type. In each case, two models were created: one model used all of the original features; the other used a set of selected features obtained for the negative greedy feature selection (SVRc-GB). The SVRc algorithm is also applied to all three data types in accordance with the above procedure 40.

Experiment 1: Results, Synthesis and Conclusions

結果係被總結於表1及第6圖中。 The results are summarized in Tables 1 and 6.

顯示出由分子及生物影像資訊依序加入到單獨臨床/組織病理資訊的預估能力的增加傾向。這結果支援在不同階段(即臨床/組織病理、微解剖、及分子)整合病患資訊的系統病理分析可以改良整個系統預估能力的概念。當相較於傳統多變量模型化技術時，該分析同時也展現先進之受監視多變量模型化技術可以建立改良之預估系統。同時，除了臨床/組織病理特性外，也可以選擇預估PSA再發生的部份分子及生物影像特性。 It shows an increasing tendency to predict the ability to add molecular and bioinformatic information to individual clinical/tissue pathology information. This result supports a systematic pathology analysis that integrates patient information at different stages (ie, clinical/histopathology, microdissection, and molecular) to improve the concept of predictive power across the system. When compared to traditional multivariate modeling techniques, the analysis also demonstrates that advanced monitored multivariate modeling techniques can be used to build improved predictive systems. with In addition to clinical/histopathological properties, some molecular and bioimage characteristics of PSA recurrence can be selected.

相反於傳統單獨應用至臨床資料的Cox模型的存活分析法，SVRc的優點為展現能處理在一小群病串中之高維資料組。在這研究資料組中，SVRc提供較標準Cox模型所產生之結果，展現更實在及較優的結果。 In contrast to the survival analysis of the Cox model, which was traditionally applied to clinical data alone, SVRc has the advantage of exhibiting high-dimensional data sets that can be processed in a small group of diseased strings. In this research data set, SVRc provides results that are more realistic and superior than those produced by the standard Cox model.

Lab 2: External Validation of Expert Domain Knowledge

為了評估整個系統效能，一相當保守之兩階驗證程序被用以模擬外部驗證。140對之訓練及測試組係藉由隨機拾取100個記錄作為訓練組及使用剩餘30個未被選擇之記錄作為測試組加以產生。 To evaluate overall system performance, a fairly conservative two-level verification procedure was used to simulate external verification. The 140 pairs of training and test groups were generated by randomly picking up 100 records as training groups and using the remaining 30 unselected records as test groups.

(1)對於每一對，訓練組係使用程序40以建立一預估模型。 (1) For each pair, the training group uses program 40 to build an estimate model.

(2)所建立模型然後應用至測試組，以評估最終模型之預估正確性。 (2) The model is then applied to the test group to assess the correctness of the final model.

(3)步驟(1)及(2)被重覆40次，以取得40個預估正確性及最終預估效能被報告為在該40個特定最終模型的平均預估正確性。 (3) Steps (1) and (2) were repeated 40 times to obtain 40 estimated correctness and final estimated performance was reported as the average estimated correctness of the 40 specific final models.

在該40個不同最終模型中最常選擇特性然後使用SVRc被用以訓練用於每一對訓練及測試組之三個其他模型：一模型只基於臨床/組織病理特性；一模型基於臨床/組織病理特性及生物標記特性；及一模型基於臨床/組織病理/生物標記特性及生物影像特性。 The most frequently selected features in the 40 different final models were then used to train three other models for each pair of training and test groups using SVRc: one model based only on clinical/tissue pathology; one model based on clinical/organization Pathological characteristics and biomarker characteristics; and a model based on clinical/histopago/biomarker characteristics and bio-image characteristics.

Experiment 2: Results, Summary and Conclusions

實驗結果被顯示於表2中。結果可以被總結如下：對於40執行，該平均一般化正確性(即當應用至一測試組時，模型的預估正確性)為：(1)對於只有臨床/組織病理資料為0.74；(2)對於臨床/組織病理及生物標記資訊為0.76；及(3)對於臨床/組織病理/生物標記加上生物影像資料為0.77。 The experimental results are shown in Table 2. The results can be summarized as follows: For 40 execution, the average generalization correctness (ie, the predictive correctness of the model when applied to a test group) is: (1) for clinical/tissue pathological data of 0.74; (2) for clinical/tissue pathology And the biomarker information was 0.76; and (3) the clinical/hiortopathology/biomarker plus bioimage data was 0.77.

保持在最終模型的特性及頻率的完整表列係提供在附錄中。 A complete list of the characteristics and frequencies of the final model is provided in the appendix.

如同先前，展現了來自分子及生物影像資訊的依序加入至臨床/組織病理資訊的預估能力的增強趨勢。這結果進一步支持在不同階層(即臨床/組織病理、微解剖及分子)整合病患的系統病理分析可以改良整個系統之預估能力的概念。相較於此只應用臨床資料的傳統處理小群病患中之高組資料組的傳統多變模型技術，該分析同時也展現先進的監視多變量模型技術可以改良預估系統。 As before, it shows an increasing trend of predictive power from molecular and biological imaging information to clinical/tissue pathology information. This result further supports the notion that systematic pathology analysis of patients at different levels (ie, clinical/histopathology, microdissection, and molecular) can improve the predictive power of the entire system. Compared with the traditional multivariate model technology of the traditional group of small group patients who only use clinical data, the analysis also shows that the advanced monitoring multivariate model technology can improve the estimation system.

可以結論出，加一層域專門意見可以協助選擇特性，以改良系統的預估能力。 It can be concluded that adding a layer of domain specific advice can assist in the selection of features to improve the predictive power of the system.

對於為紐約之楊克斯之AUreon生科公司的幻象系統所完成之組織分段，影像客體被分類為使用頻譜特性、形狀特性、及組織病理客體間之特殊關係的組織病理級例。對於給定組織病理客體，其特性被計算並輸出為生物影像特性。特性包含頻譜(色頻道值、標準偏差及亮度)及一般形狀(面積、長度、寬度、緊實度、密度等)特性。統計(最小、最大、平均及標準偏差)係對一組織病理客體之每一特性加以計算。上述係以附錄中之特性為名加以反映。例如，特性“Lumen.StdDevAreaPxl”、“Lumen”表示為組織病理客體，“StdDev”表示為標準偏差的統計，及“AreaPxl”表示為一客體特性。 For the organizational segmentation of the phantom system of Aureon Biotech of New York, the image object is classified as using spectral characteristics, shape Histopathological grades of specific characteristics and special relationships between histopathological objects. For a given histopathological object, its properties are calculated and output as biometric features. Features include spectrum (color channel values, standard deviation and brightness) and general shape (area, length, width, compactness, density, etc.) characteristics. Statistics (minimum, maximum, mean, and standard deviation) are calculated for each characteristic of a histopathological object. The above is reflected in the name of the appendix. For example, the characteristics "Lumen.StdDevAreaPxl", "Lumen" are represented as histopathological objects, "StdDev" is expressed as a standard deviation, and "AreaPxl" is expressed as a guest characteristic.

統計及特性係被計算用於以下組織病理客體。“Background”係為數位影像中未為組織所佔用之部份。“Cytoplasm”為包圍一上皮核之非晶“粉紅”區。“Epithelial nuclei”為該細胞質所包圍的“圓”客體。“Lumen”為該上皮細胞所包圍的密封白區域。或者，該腔可以被攝護腺液(粉紅)或其他“殘物”(例如巨噬細胞，死細胞等)所填入。腔及上皮核形成一腺單元。“Stroma”為具有不同密度的連接組織的形式，其維持攝護腺組織的架構。基質係出現在腺單元之間。“基質核”係為具有最小量或沒有細胞質(成纖維細胞)之長形細胞。這分類也可以包含內皮細胞及炎性細胞，及若有癌症出現，則上皮核同時也被發現散佈在基質之間。“紅血球”通常為位於血管(動脈或靜脈)內之小紅圓客體，但可以被發現分佈在整個組織中，AK.1，AK.2，AK.3，AK.4及AK.5係為沒有特定意義的使用者標示標籤。“C2EN”為核面積相對於細胞質的相對比例。上皮細胞的愈退行/惡性，則為核所佔用區域愈大。“EN2SN”為上皮細胞對出現在數位組織影像中之基質細胞之百分比或相對量。“L2Core”為出現在組織中之腔數量或面積。Gleason分數愈高則腔的出現數量愈少。“C2L”為細胞質對腔。“CEN2L”為細胞質內皮細胞對腔。 Statistics and characteristics were calculated for the following histopathological objects. “Background” is the portion of the digital image that is not occupied by the organization. "Cytoplasm" is an amorphous "pink" region that surrounds an epithelial nucleus. "Epithelial nuclei" is the "circle" object surrounded by this cytoplasm. "Lumen" is the sealed white area surrounded by the epithelial cells. Alternatively, the cavity can be filled with prostate fluid (pink) or other "residues" (eg, macrophages, dead cells, etc.). The cavity and epithelial nucleus form a glandular unit. "Stroma" is a form of connected tissue having different densities that maintain the architecture of the prostate tissue. The matrix system appears between the glandular units. A "matrix nucleus" is an elongate cell with minimal or no cytoplasmic (fibroblasts). This classification may also include endothelial cells and inflammatory cells, and if a cancer occurs, the epithelial nucleus is also found to be interspersed between the stroma. "Red blood cells" are usually small red round objects located in blood vessels (arteries or veins), but can be found throughout the tissue, AK.1, AK.2, AK.3, AK.4 and AK.5 User-specific labels are not meaningful. "C2EN" is the relative ratio of nuclear area to cytoplasm example. The more regressive/malignant epithelial cells, the larger the area occupied by the nucleus. "EN2SN" is the percentage or relative amount of epithelial cells to stromal cells that appear in digital tissue images. "L2Core" is the number or area of cavities that appear in the tissue. The higher the Gleason score, the less the number of cavities. "C2L" is a cytoplasmic versus cavity. "CEN2L" is the cytoplasmic endothelial cell to the lumen.

在客體後的名稱部份為例示用並對應於德國幕尼黑之Definiens公司所完成之挑戰者開發語音4.0軟體。 The name part after the object is for the challenger developed by the Definiens company in Munich, Germany, to develop the voice 4.0 software.

結束附錄 End appendix

10‧‧‧SVRc系統 10‧‧‧SVRc system

12‧‧‧臨床/組織病理資料 12‧‧‧Clinical/hitopathological data

14‧‧‧生物標記資料收集 14‧‧‧Biomarker data collection

16‧‧‧生物影像資料量測/收集 16‧‧‧Biometric data measurement/collection

18‧‧‧資料迴歸及分析裝置 18‧‧‧ Data regression and analysis device

20‧‧‧電腦 20‧‧‧ computer

22‧‧‧記憶體 22‧‧‧ memory

24‧‧‧處理機 24‧‧‧Processing machine

26‧‧‧輸出 26‧‧‧ Output

28‧‧‧顯示螢幕 28‧‧‧ Display screen

第1圖為用於右設限資料的預估診斷系統之簡化方塊圖。 Figure 1 is a simplified block diagram of an estimated diagnostic system for right-limit data.

第2圖為用於設限資料的例示損失函數圖。 Figure 2 is an illustration of an exemplary loss function for setting data.

第3圖為非設限資料的例示損失函數圖。 Figure 3 is an illustration of an exemplary loss function for non-restricted data.

第4圖為用於開發一用以預估時間-事件資訊的模型之程序的流程方塊圖。 Figure 4 is a block diagram of a process for developing a program for estimating time-event information.

第5圖為生產如第4圖所示之啟始模型程序的方塊流程圖。 Figure 5 is a block flow diagram of the production of the start model program as shown in Figure 4.

第6圖為使用由本發明之實施例與使用實驗資料之傳統Cox比例危險模型所決定之要字表索引所總結的模型效能的三維圖。 Figure 6 is a three-dimensional map of the model's performance as summarized by the vocabulary index determined by the embodiment of the present invention and the traditional Cox proportional hazard model using experimental data.

Claims

A method of generating a model for estimating the time at which an event occurs, the method comprising: obtaining a multidimensional non-linear vector of information representative of a state of a majority of the test object; and using a vector of information to perform regression to generate a kernel-based model to At least part of the information contained in the information-based vector is provided with an estimated value of the estimated time of the event; wherein the data of the vector is correlated based on the classification of at least one feature of the data, the at least one feature being related to the data assisting model The ability to provide an output value such that the output value assists in estimating the time to the event, and wherein the regression system is sequentially executed using data from the most material-like to the least-classified data classification to assist in providing the model. The output value is such that the output value assists the estimated time of the event.

The method of claim 1, wherein the regression is performed in a positive greedy manner based on the characteristics of the data to select characteristics to be used in the model.

For example, the method described in claim 2 includes, after performing the regression, performing a negative greedy procedure on the characteristics of the vector to further select characteristics to be used in the model.

The method of claim 2, wherein the regression is performed in a positive greedy direction only with respect to a part of the characteristics of the vector.

The method of claim 4, wherein the vectors comprise clinical/tissue pathological data, biomarkers, and biological imaging materials. Data classification, and wherein the regression system is performed in a non-positive greedy manner for clinical/biological histopathological data, and biomarker data and bio-image data for only vectors are performed in a positive greedy manner.

The method of claim 1, wherein the at least one vector is right-defined and lacks a time indication of an occurrence of the test object.

A computer program product for generating a model for estimating the time of occurrence of an event, the computer program product being in a computer readable medium and including computer readable computer executable instructions for causing the computer to: obtain a majority of the test object a multidimensional nonlinear vector of state information, at least one vector being right-defined, with respect to the relevant test object, lacking a representation of the occurrence time of the event; and using a vector of information to perform regression to generate a kernel-based model based on At least part of the information contained in the vector of information provides an output value for an estimated time to the event; wherein the data of the vector is related to the classification based on at least one feature of the data, the at least one feature having a data assistance model The ability to provide an output value such that the output value assists in estimating the time to the event; and wherein the regression system is sequentially executed using data from the most material-like to the least-classified data classification to assist in providing the model. The output value is such that the output value assists the estimated time of the event.

For example, the computer program product described in claim 7 of the patent scope, wherein the regression is performed in a positive greedy manner according to the characteristics of the data to select characteristics to be used in the model.

The computer program product described in claim 8 of the patent application further includes instructions to perform a negative greedy procedure on the computer-to-vector characteristics after performing the regression to further select characteristics to be used in the model.

For example, the computer program product described in claim 8 wherein the regression is performed in a positively greedy manner only with respect to a part of the characteristics of the vector.

The computer program product according to claim 10, wherein the vectors comprise clinical/histopathological data, biomarker data, and data classification of biological imaging materials, and wherein the regression system is for clinical/tissue pathological data, In a non-positive greedy manner, and only biomarker data and bio-image data for vectors are performed in a positive greedy manner.

A method of determining a predicted diagnosis of a patient, the method comprising: receiving at least one of clinical and histopathological data about the patient; receiving a biomarker related to the patient; receiving a biological activity about the patient Image data; and applying at least a portion of at least one of the clinical and histopathological data, at least a portion of the biomarker data, and at least a portion of the biological image to a core as a mathematical model for computational representation The value of the diagnosis of the patient.

The method of claim 12, wherein at least a portion of the biomarker data comprises less than all biomarker characteristics of the patient.

The method of claim 13, wherein the creature At least a portion of the marker data contains less than about 10% of all biomarker characteristics of the patient.

The method of claim 14, wherein at least a portion of the biomarker data comprises less than about five percent of all biomarker characteristics of the patient.

The method of claim 12, wherein at least a portion of the biomarker data comprises less than all of the biometric characteristics of the patient.

The method of claim 16, wherein at least a portion of the biomarker data comprises less than one percent of all biometric characteristics of the patient.

The method of claim 17, wherein at least a portion of the biomarker data comprises less than about 0.2 percent of all bioimage characteristics of the patient.

The method of claim 12, wherein the value is at least one of a recurrence time of a health-related condition and a recurrence probability of the health-related condition.

An apparatus for determining estimation information of an event occurrence time, the apparatus comprising: an input, an architecture to obtain multi-dimensional, non-linear first data related to a possible future event; and a processing device configured to use the core-based mathematics The first data in the model is used to calculate the estimation information, and the mathematical model is at least partially derived from the regression analysis of the second data of the multidimensional nonlinear right limit, the data determines the shadow The model parameters calculated by the model, the estimated information indicating at least one of an estimated time of the possible future event and a probability of a possible future event.

The device of claim 20, wherein the input and processing device comprises a portion of a computer program product embodied in a computer readable medium, the computer program product comprising a computer readable, computer executable instruction In order to enable the computer to obtain the first data and use the first data in the mathematical model to calculate the estimated information.

The device of claim 20, wherein the first data comprises at least one of clinical and histopathological data, biomarker data, and biological imaging data related to a patient, wherein the processing device is At least a portion of the clinical and histopathological data, at least a portion of the biomarker data, and at least a portion of the biometric image to the core are used as the primary mathematical model to calculate the patient's estimated information.

The device of claim 22, wherein at least a portion of the biomarker data comprises less than all biomarker characteristics of the patient.

The device of claim 23, wherein at least a portion of the biomarker data comprises less than about five percent of all biomarker characteristics of the patient.

The device of claim 22, wherein at least a portion of the biomarker data comprises less than all of the biometric characteristics of the patient.

The device of claim 25, wherein at least a portion of the biomarker data contains less than all biometric characteristics of the patient About 0.2% of the data.

A computer program product for determining a patient's predictive diagnosis, the computer program product being readable on a computer readable medium and containing computer readable, computer executable instructions for causing the computer to receive information about the patient At least one of clinical and histopathological data; receiving biomarker data about the patient; receiving biometric data about the patient; and applying at least a portion of at least one of clinical and histopathological data, the organism At least a portion of the marker data and at least a portion of the biometric image are subjected to a nuclear-based mathematical model to calculate a representation of the diagnosis of the patient.

The computer program product of claim 27, wherein at least a portion of the biomarker data contains less than all biomarker characteristics of the patient.

The computer program product of claim 28, wherein at least a portion of the biomarker data comprises less than about 10% of all biomarkers of the patient.

The computer program product of claim 29, wherein at least a portion of the biomarker data comprises less than about five percent of all biomarker characteristics of the patient.

The computer program product of claim 27, wherein at least a portion of the biomarker data comprises less than all biometric characteristics of the patient.

A computer program product as described in claim 31, At least a portion of the biomarker data contains less than about one percent of all bioinformatic properties of the patient.

The computer program product of claim 32, wherein at least a portion of the biomarker data comprises less than about 0.2 percent of all biometric characteristics of the patient.

The computer program product of claim 27, wherein the value is at least one of a recurrence time of a health-related condition and a recurrence probability of a health-related condition.