KR20000071096A

KR20000071096A - A system for predicting future health

Info

Publication number: KR20000071096A
Application number: KR1019997007383A
Authority: KR
Inventors: 캠벨티콜린; 헬름즈로날드더블유; 토마스코리사
Original assignee: 바이오마 인터내셔날, 인코포레이티드
Priority date: 1997-02-14
Filing date: 1998-02-10
Publication date: 2000-11-25
Also published as: CN1268033A; EP0973435A4; WO1998035609A1; EP0973435A1; AU6151498A; BR9807366A; CA2280042A1; JP2001511680A; US6059724A

Abstract

본 발명은 개인의 향후 건강을 예측하기 위한 컴퓨터 기반 시스템에 있어서, (a) 구성원의 부분 모집단 D가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득한 것으로 식별되고, 상기 구성원의 부분 모집단가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하지 않은 것으로 식별되는 시험 모집단의 개개의 구성원으로부터의 시계열적으로 획득된 바이오마커값의 데이타베이스를 포함하고 있는 프로세서를 구비하는 컴퓨터와; (b) (1)상기 바이오마커로부터 부분 모집단 D 와에 속하는 구성원간의 판별을 위한 바이오마커의 서브세트를, 시험 모집단의 개개의 구성원의 바이오마커값의 분포에 의거하여 선택하는 단계와; (2) (i)시험 모집단의 구성원을 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 높은 확률을 갖는 부분 모집단 PD 또는 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 낮은 확률을 갖는 부분 모집단에 속하는 것으로서 분류, 또는 (ii)시험 모집단의 각 구성원에 대해 특정 시간 주기 또는 연령 구간내에서의 특정 생물학적 조건을 획득하는 확률을 정량적으로 추정하는데 사용될 수 있는 통계학적 과정을 개발하기 위해 상기 선택된 바이오마커의 분포를 이용하는 단계를 포함하는 컴퓨터 프로그램을 포함하는 것을 특징으로 하는 컴퓨터 기반 시스템을 제공한다.The present invention provides a computer-based system for predicting an individual's future health, wherein (a) a subpopulation D of a member is identified as having acquired a specific biological condition within a specific time period or age interval, and the subpopulation of the member A computer comprising a processor comprising a database of time-series obtained biomarker values from individual members of a test population, wherein s is identified as not obtaining specific biological conditions within a particular time period or age interval; (b) (1) subpopulations D and Selecting a subset of biomarkers for discrimination between members belonging to the group based on a distribution of biomarker values of individual members of the test population; (2) (i) a subpopulation PD having a certain high probability of acquiring a particular biological condition within a particular time period or age interval for a member of the test population or a predetermined biological condition within a particular time period or age interval. Subpopulation with low probability The selected biotechnology to develop a statistical process that can be used to classify as belonging to, or (ii) quantitatively estimate, for each member of the test population, a probability of obtaining a particular biological condition within a particular time period or age interval. A computer-based system is provided that includes a computer program comprising using a distribution of markers.

Description

A SYSTEM FOR PREDICTING FUTURE HEALTH}

개개인에 대해 향후 건강 상태에 문제가 생길 것을 충분한 신뢰성을 가지고 또한 충분한 시간을 두고 예측할 수 있게 되면, 개개인에 대해 질병이 실제로 생기는 것을 기다려서 그 증상을 치료하는 것보다 향후 건강상의 문제를 예방할 수 있는 기회를 더 많이 갖게 되어 바람직할 것이다. 현재, 의료 연구비의 대부분은 질병의 통상적으로 관찰된 증상이 명백해지기 훨씬 전에 질명의 위험을 감소시키기 위한 예방 조치를 발견하는 것이 아니라 질명의 진단 및 치료 방법을 개선하는데 사용되고 있다. 질병을 치료하는 것에 중점을 두는 것이 진단후 질병을 치료하는 것뿐만 아니라 현존하는 질병을 진단하기 위해 개발된 기술 및 방법을 보다 정교하게 하여 의료 과학 분야에 큰 진보를 가져올 수도 있기는 하지만, 이러한 진보에는 치료를 위한 비용의 증가가 따르게 된다. 이렇게 증가되는 치료 비용은 개인뿐만 아니라 사회 전체적으로도 큰 재정적인 문제로 부각되고 있어, 의료 비용을 줄이기 위한 방법을 찾기 위한 요구가 증가하게 되었다.Once you have enough confidence and are able to predict in time that future health problems will arise for an individual, you will have the opportunity to prevent future health problems rather than wait for the disease to occur and treat the symptoms. It would be desirable to have more. Currently, much of the medical research funding has been spent on improving the diagnosis and treatment of asylum, rather than finding preventive measures to reduce the risk of asylum long before the commonly observed symptoms of the disease become evident. Although the focus on treating diseases may not only cure post-diagnosis, but also elaborate on the technologies and methods developed to diagnose existing diseases, these advances may bring significant advances in the medical sciences. This will increase the cost of treatment. The increasing cost of care is a major financial issue not only for individuals but also for society as a whole, increasing the demand for finding ways to reduce medical costs.

따라서, 질병의 발병 위험이 높다고 사전에 알 수 있어 효율적인 예방 대책을 세울 수 있게 되면, 개인의 이익일뿐 아니라 사회 전체적으로도 전체 의료 비용의 실질적인 감소를 실현할 수 있게 된다.Therefore, when it is known in advance that the risk of disease is high, and effective preventive measures can be taken, it is possible to realize a substantial reduction in the total medical cost not only for the benefit of the individual but also for society as a whole.

현재까지, 개인의 향후 건강 상태를 판단 또는 예측하고자 할 때 생기는 문제점들 중 2개는 다음과 같다. (a) 첫째, 개인의 향후 건강 상태의 예측이 수백 또는 수천개 정도의 서브젝트(subject)로 구성된 비교적 소량의 연구 샘플로부터 얻어진 데이터에 기초하기 때문에 그 예측이 부정확하다는 것이다. (b) 둘째, 이러한 예측에는 연구 샘플의 평균(그리고 기타 다른 인자들)으로부터 개개인에 대한 외삽법(extrapolation)이 필요하다는 것이다. 이러한 외삽법은 특정 개인, 아니면 특정 질병에 대해 높은 위험을 갖는 집단(이하, 간단히 위험 집단이라고 함) 내에서의 위험을 신뢰성 있게 추정하는 것에 대해 큰 의문점이 있다. 이것은 통상적으로 사용된 통계적 과정이 모집단의 개개인에 대해서가 아니라 모집단의 평균치에 대해 이루어지기 때문에 부분적으로는 사실이라고 할 수 있다.To date, two of the problems encountered when trying to determine or predict an individual's future health status are as follows. (a) First, the prediction is inaccurate because the prediction of an individual's future health is based on data from a relatively small number of study samples consisting of hundreds or thousands of subjects. (b) Second, these predictions require extrapolation of individuals from the mean (and other factors) of the study sample. This extrapolation poses a big question about the reliable estimation of risk within a particular individual, or in a group that has a high risk for a particular disease (hereafter simply referred to as a risk group). This is partly true because the statistical process commonly used is not for individual populations but for the average of the population.

정량적인 예측을 얻기 위해서, "개인의 향후 건강 상태"는 특정 시간대의 특정 이벤트의 발생으로 구성되어야 한다. 2개의 예가 있다. (a) 계속해서 5년의 기간내에 심근 경색(myocardial infarction)의 발병, 및 (b) 그 다음해의 사망. 이러한 이벤트의 예측은 본질적으로 개연성, 즉 확률에 근거를 두어야 한다.In order to obtain quantitative predictions, an "person's future health condition" should consist of the occurrence of a specific event at a specific time period. There are two examples. (a) Onset of myocardial infarction within a five year period, and (b) Death of the following year. The prediction of such an event should be based on probability, i.e. probability.

이러한 구성에서는 2가지 유형의 확률이 중요하다. 이벤트의 사전(事前) 확률이란 이벤트의 발생 또는 비발생의 사실 이전의 이벤트의 확률을 말한다. 이벤트의 사후(事後) 확률이란 이벤트가 실현된 후의 이벤트, 즉 이벤트의 발생 또는 비발생의 사실 이후의 이벤트의 확률을 말한다. 만일 이벤트가 발생되었다면 이벤트의 사후 확률은 1 이고, 이벤트가 발생되지 않았다면 이벤트의 사후 확률은 0 이다. 이러한 사전 확률과 사후 확률간의 구분은 고려해볼만한 가치가 있다.In this configuration, two types of probabilities are important. The pre-event probability of an event is the probability of the event before the fact of occurrence or non-occurrence of the event. The post probability of an event refers to the event after the event is realized, that is, the probability of the event after the fact of occurrence or non-occurrence of the event. If the event has occurred, the post probability of the event is 1, and if the event has not occurred, the post probability of the event is 0. This distinction between pre- and post-probabilities is worth considering.

연속된 해(year) 또는 다른 시간대에서 발생하는 이벤트의 사전 확률은 중요한 정보가 될 수 있다. 이벤트의 확률을 알게 됨으로써 행동을 변경시킬 수 있으며, 또한 사람이 취하는 조치(행동)는 이벤트의 사전 확률에 따라 달라질 수 있다. 이 원리는 2가지 극단적인 경우를 고려해 보면 매우 명백해진다. 사람은 2가지 시나리오, 즉 다음해에 사람이 사망할 확률이 (a) 0.9999 또는 (b) 0.0001 이라고 통보된 2가지 시나리오에서 상이한 행위(5가지 상이한 조치)를 보이는 것이 거의 확실하다.Prior probabilities of events occurring in consecutive years or at different times can be important information. Knowing the probability of an event can change its behavior, and the actions that a person takes may depend on the event's prior probability. This principle becomes very apparent when considering two extreme cases. It is almost certain that a person exhibits different behaviors (five different actions) in two scenarios, two scenarios where the probability of human death in the next year is (a) 0.9999 or (b) 0.0001.

이벤트의 사전 확률은 확률이 평가될 때 이용 가능한 정보에 따라 다르게 된다. 이 점을 설명하기 위해, 다음의 가설적인 "게임"을 고려해 보자.The prior probability of an event depends on the information available when the probability is evaluated. To illustrate this point, consider the following hypothetical "game."

모든 미국 거주민 중에서 건강한 사람을 무작위로 선택하여 임의의 기간(1년) 동안 추적한다. 임의의 기간(1년)이 지나면, 개인의 생명 상태(생존 또는 사망)가 확인될 것이다. "이벤트" 는 "그 기간(1년) 동안 사망한 사람" 이다. 임의의 기간이 지나면, 사후 확률값 1과 0을 각각 갖는 이벤트가 발생(사망)하거나 또는 발생하지 않게(생존) 된다. 개인을 선택하기 전에, 개인이 그 기간(1년) 안에 사망할 사전 확률을 추정하는데 미국 사망률 통계가 이용될 수 있다. 이 확률은 p=d/N 으로 계산되며, 여기서 N 은 위험 집단(여기서 미국 모집단에 있는 모든 개인은 그 기간의 초기에는 생존되어 있음)에 있는 전체 개인의 수이며, d 는 위험 집단에서의 총 사망자수이다. 예컨대, 1993년의 데이터는, 대략 d = 2,268,000, N = 257,932,000 이며, 이벤트의 사전 확률은 대략 p = 0.0088 이다[ Microsoft Bookshelf 1995 Almanac 의 "Vital Statistics, Annual Report for the Year 1993(Provisional Statistics), Death", 와 National Center for Health Statistics 에서 발행한 "Vital Statistics of the United States" 로부터의 데이터임]. 이 게임에서, 이벤트의 사전 확률은 생존하였을 모든 사람과 선택 당시의 미국 거주민으로 구성된 위험 집단의 구성원이 되었을 것이라는 매우 적은 정보에 기초를 두고 있다.Healthy individuals are randomly selected from all US residents and followed for any period of time (one year). After a certain period of time (one year), the individual's life state (survival or death) will be identified. "Event" is "the person who died during that period (1 year)". After a certain period of time, an event with post-probabilities 1 and 0, respectively, occurs (death) or does not occur (survival). Prior to selecting an individual, US mortality statistics can be used to estimate the prior probability that an individual will die within that period of one year. This probability is calculated as p = d / N, where N is the total number of individuals in the risk group, where all individuals in the US population are alive at the beginning of the period, and d is the total in the risk group. The death toll. For example, the data for 1993 is approximately d = 2,268,000, N = 257,932,000, and the prior probability of the event is approximately p = 0.0088 [Vital Statistics, Annual Report for the Year 1993 (Provisional Statistics), Death by Microsoft Bookshelf 1995 Almanac. ", And data from the" Vital Statistics of the United States "published by the National Center for Health Statistics. In this game, the preliminary probability of the event is based on very little information that it would be a member of a risk group consisting of all those who survived and the US residents at the time of selection.

서브젝트를 무작위로 선택한 위험 집단에 관한 추가 정보는 이벤트의 사전 확률의 서브젝트 및 변경에 관한 추가 정보를 의미한다. 예컨대, 1993년 데이터에 기초하여 상기 "게임"을 계속하면 다음과 같다.Additional information about the risk group that randomly selected the subject means additional information about the subject and the change in the prior probability of the event. For example, continuing the "game" based on 1993 data is as follows.

만일 위험 집단이 미국 남성 집단이었다면, 즉 선택 전에 서브젝트가 남성인 것으로 알려졌다면, 이벤트의 사전 확률은 대략 p = 0.0093 이 되며, 이 값은 성(gender)이 알려지지 않았거나 특정되지 않았던 경우에 비해 대략 6% 높은 수치가 된다.If the risk group was a US male group, that is, the subject was known to be male before the selection, then the prior probability of the event would be approximately p = 0.0093, which is roughly comparable to when no gender was known or specified. 6% higher.

만일 위험 집단이 75세 내지 84세 사이의 미국 남성 집단이었다면, 즉 선택 전에 서브젝트가 75세 내지 84세 사이의 연령인 남성인 것으로 알려졌다면, 이벤트의 사전 확률은 대략 p = 0.0772 이며, 연령이 알려지지 않았거나 특정되지 않은 경우에 비해 대략 8.3 배 높은 수치가 된다.If the risk group was a U.S. male group between 75 and 84 years old, that is, the subject was known to be a male between 75 and 84 years old prior to selection, the prior probability of the event is approximately p = 0.0772, and the age is unknown. It is roughly 8.3 times higher than when not specified or unspecified.

이러한 예들은 이벤트의 사전 확률은 확률이 평가될 때 이용 가능한 정보에 따라 달라진다는 일반적인 이론을 나타낸 것이다. 사전 확률의 대부분의 정확한 추정은 이용 가능한 모든 정보에 기초를 두는 것이 통상적이다.These examples illustrate the general theory that the prior probability of an event depends on the information available when the probability is evaluated. Most accurate estimates of prior probabilities are usually based on all available information.

사전 확률의 매우 정확한 추정은 특정 결과를 보장하지는 않는다. 즉, 특정 개인에 대한 사전 확률은 사후 확률에 매우 근접한 값이 되지 않을 수도 있다. 상기 언급된 극단적인 경우를 고려해 보면, 다음 해에 특정 개인의 사전 사망 확률은 0.0001 이다. 생존 가능성이 매우 높을 것 같지만, 보장되는 것은 아니다. 이 "게임"의 모든 개인 중에서 10,000 명 중 대략 9,999 명이 그 해에 살아남아 사후 확률은 0(이 값은 사전 확률 0.0001 에 가깝다)의 값을 가질 것이며, 10,000 명 중 대략 1 명이 사망하여 사후 확률은 1 의 값을 가지며 이 값은 사전 확률의 값과 매우 다르게 된다. 이 원리를 더욱 명료히 하기 위해, "앞"이 나올 사전 확률이 정확히 0.5 인 공평한 동전 던지기(앞뒤 결정)를 고려해 보자. "앞"이 나올 사후 확률은 0 또는 1 이고, 어느 것이나 0.5 에 그다지 가까운 값은 아니다. 따라서, 개인 한명에 대한 사전 확률은 그 개인에 대한 사후 확률의 근사값을 고려할 필요가 없다. 그러나, 만일 매우 많은 수의 개인이 게임에 참여하고 있다면, 이벤트가 발생하는 개인의 비율인 사후 확률의 평균은 사전 확률에 매우 가까운 값이 될 것이다.Highly accurate estimates of prior probabilities do not guarantee specific results. In other words, the prior probability for a particular individual may not be very close to the posterior probability. Considering the extreme cases mentioned above, the probability of premature death of a person in the following year is 0.0001. Survival is likely to be very high, but it is not guaranteed. Of all the individuals in this "game", approximately 9,999 out of 10,000 will survive that year, with a post-probability of 0 (this value is close to a prior probability of 0.0001), with approximately 1 of 10,000 deaths and a post-probability of 1 Which is very different from the prior probability value. To make this principle clearer, consider a fair coin tossing (back and forth decision) with exactly 0.5 prior probability of "front". The posterior probability of "before" is 0 or 1, and neither is very close to 0.5. Thus, the prior probabilities for an individual need not consider an approximation of the posterior probabilities for that individual. However, if a very large number of individuals are participating in the game, then the average of the posterior probabilities, the percentage of individuals in which the event occurs, will be very close to the prior probabilities.

일부 경우에 있어서는 개인에 대해 상이한 사전 확률을 갖는 집단으로 "이동"시킴으로써 사전 확률을 변경할 수 있다. 예를 들어, 총 콜레스테롤 치수가 높고 리포단백질의 농도가 매우 낮은 미국 거주민인 중간 연령의 남자는 매우 낮은 콜레스테롤 치수를 갖는 사람에 비해 차후 5년 동안 심근 경색으로 인한 사망의 사전 확률이 더 높다는 것이 밝혀졌다. 임상 학술 조사에 의하면, 만일 높은 콜레스테롤 치수를 갖는 개인이 더 낮은 콜레스테롤 치수를 갖는 집단으로 이동함으로써 그의 콜레스테롤 치수를 줄일 수 있다면, 그는 차후 5년 동안에 심근 경색으로 인한 사망의 사전 확률을 줄일 수 있다.In some cases, prior probabilities can be altered by "moving" into groups with different prior probabilities for an individual. For example, middle-aged men who are US residents with high total cholesterol levels and very low levels of lipoproteins are found to have a higher probability of dying from myocardial infarction over the next five years than those with very low cholesterol levels. lost. According to clinical studies, if an individual with a high cholesterol level can reduce his cholesterol level by moving to a group with a lower cholesterol level, he can reduce the likelihood of death from myocardial infarction over the next five years.

다음에 사용되는 용어 "위험"은 "특정 시간대의 특정 이벤트의 사전 확률" 대신에 사용될 것이다. 이것은 예상된 사망으로서 위험의 통계적 정의에 대응하며, 여기서 사망 함수는 이벤트가 발생하면 1의 값을 가지며 이벤트가 발생하지 않으면 0의 값을 갖는다.The term "risk", which is used in the following, will be used instead of "predetermined probability of a particular event in a particular time zone". This corresponds to the statistical definition of risk as expected death, where the death function has a value of 1 if an event occurs and 0 if no event occurs.

상기 논의는 정보의 레벨이 상이하면 사전 확률이 다르게 된다는 원리를 나타낸 것이다. 많은 것이 알려진 개인(즉, 많은 공지된 특징을 갖는 소규모의 부분 모집단의 구성원)에 대한 위험은 알려진 특징이 거의 없는 대규모의 부분 모집단에 대한 위험과 매우 상이하게 될 수 있다. 그러나, 개인에 대한 질병의 위험을 확정하기 위한 모집단의 통상적인 과학적 연구 조사 능력에 어려움을 주는 다른 문제점이 또 있다. 이 문제점은 질병의 원인, 특히 암, 심장 혈관 질병, 당뇨병 등과 같은 만성 퇴행성 질병의 원인에 대한 너무 단순한 인식으로부터 생기는 것이 일반적이다. 즉, 다양한 원인에 대해 이러한 질병은 단일 성분 또는 단일의 약제 화합물을 처방함으로써 억제되거나 임상적으로 치료법이 지시될 수 있다고 믿는 경향이 있다. 예컨대, 유방암은 지방의 섭취를 적절하게 줄임으로써 제어될 수 있고, 결장암은 특정 음식물의 섬유질을 추가함으로써 제어될 수 있으며, 심장병은 높은 혈중 콜레스테롤 치수에 의해 임상적으로 지시되고, 위암은 낮은 비타민 C 혈중 농도에 의해 임상적으로 지시될 수 있다고 알려져 왔다. 이들 너무 단순화된 견해들은 너무 자주 검증되어서 특히 개개인에 대한 원인을 식별하는데는 부적합하다. 모집단내에서 개인에 대한 모집단 데이터를 외삽법으로 추정하는데에 큰 어려움이 있는 것은 말할 것도 없고 고려해야 하는 많은 혼란스런 변수들이 있다. 수백만개는 아니더라도 수천개의 가능한 성분 원인 중에서 단일 성분을 검사 및 조사하는 것은, 특히 이들 데이터를 개인에 대한 질병 위험의 추정에 외삽법을 행하려고 할 때 매우 불확실한 것이 될 수 있다.The above discussion shows the principle that if the levels of information are different, the prior probabilities are different. The risks for many known individuals (ie, members of small subpopulations with many known features) can be very different from those for large subpopulations with few known features. However, there are other problems that make it difficult for a population's usual scientific research to determine the risk of disease to an individual. This problem usually arises from too simple perception of the cause of the disease, in particular the cause of chronic degenerative diseases such as cancer, cardiovascular disease, diabetes and the like. That is, for a variety of causes, such a disease tends to be believed to be suppressed or clinically directed to treatment by prescribing a single component or a single drug compound. For example, breast cancer can be controlled by appropriately reducing fat intake, colon cancer can be controlled by adding fiber of certain foods, heart disease is clinically indicated by high blood cholesterol levels, and gastric cancer is low vitamin C. It has been known that it can be clinically indicated by blood concentration. These too simplistic views have been so often verified that they are inadequate for identifying causes for individuals. There are many confusing variables to consider, not to mention the great difficulty of extrapolating population data for individuals within a population. Examining and investigating a single component out of thousands, but not thousands, of possible component causes can be very uncertain, especially when attempting to extrapolate these data to estimate disease risk for an individual.

이들 2가지 어려움, 즉 (a) 실험에 의한 개인의 모집단용 데이터를 무작위로 선택된 개인에 대해 외삽법을 행하는 것과, (b) 질병 발생의 단일의 치료 지시나 원인에 의존하는 것은 무작위로 선택된 개인에 대한 향후 질병 위험의 추정을 심각하게 손상시킨다. 만일 특정 질병에 대한 개인의 위험이 더욱 신뢰성 있게 판정될 수 있다면, 그의 개인적 행위에 대한 더욱 분별있는 결정을 내릴 수 있는 이들 개인에게 정보를 제공하는 것이 가능해질 것이다. 향후 건강 상태를 예측하는 더욱 신뢰성 있는 방법들은 개개인이 그들 자신의 건강 상태를 자기 것으로 하여 자신의 건강을 효율적으로 관리할 수 있도록 하는 강력한 수단이 될 것이 틀림없다.These two difficulties: (a) extrapolating the population data of the individuals from the experiment to randomly selected individuals, and (b) relying on a single treatment indication or cause of disease outbreak were randomly selected individuals. Seriously impair the estimation of future disease risks. If an individual's risk for a particular disease can be judged more reliably, it will be possible to provide information to those individuals who can make more informed decisions about their personal behavior. More reliable ways of predicting future health conditions must be a powerful means for individuals to manage their own health by making their own health status their own.

더욱이, 각각의 카테고리가 심장병과 같은 특정 질병과 높은 상관성을 갖는 수개의 카테고리에 속할 수 있으며 특정 질병에 대해 높은 위험을 갖는 것으로 확인된 개개인에 대해, 현재 이용 가능한 방법은 이 질병이 특정 개인에 대해 치명적인 것이 될 때 충분한 신뢰성을 가지고 또는 그 개인에 동기를 부여할 정도의 자신감을 주면서 정량적으로 예측하여, 그 위험을 줄이기 위해 미리 유효한 조치를 취하는 것이 허용되지 않는 것이 일반적이다. 따라서, 특정 시간 주기내에서 특정한 건강 문제의 발생을 신뢰성 있게 예측할 뿐만 아니라 이러한 예측에 기초를 두고 취해진 예방책을 관리하는데 유용하게 될 유효한 범용 수단이 필요할 것이다.Moreover, for individuals whose categories may belong to several categories that have a high correlation with a particular disease, such as heart disease, and have been identified as having a high risk for a particular disease, currently available methods suggest that this disease may affect a particular individual. It is common to be quantitatively predicted with sufficient confidence when motivated to be lethal or motivated to motivate the individual, and to take effective measures in advance to reduce the risk. Therefore, there will be a need for a valid universal means that will not only reliably predict the occurrence of a particular health problem within a certain time period, but will also be useful in managing preventive measures taken based on this prediction.

본 발명은 개인의 향후 건강 상태를 예측하기 위한 컴퓨터 시스템과 그 방법에 관한 것이다. 더 구체적으로 말해서, 본 발명은 대규모의 인간 시험 모집단으로부터 다수의 바이오마커(biomarker)를 위한 일련의 시계열적 데이터를 얻고, 예측용 바이오마커를 통계적으로 선택하며, 선택된 바이오마커에 기초하여 적절한 다변수 평가 함수를 결정 및 판단함으로써 개인의 향후 건강 상태를 예측하기 위한 것이다.The present invention relates to a computer system and method for predicting a future health condition of an individual. More specifically, the present invention obtains a set of time series data for a large number of biomarkers from a large human test population, statistically selects predictive biomarkers, and selects appropriate multivariables based on the selected biomarkers. It is intended to predict an individual's future health status by determining and determining an evaluation function.

도 1은 실시예의 집단 D(실선)와 집단(점선)에 대한 추정에 기초하여 선형 판별 함수값의 실험 분포 함수("EDF")를 나타내는 도면.1 shows population D (solid line) and population of an embodiment A diagram showing an experimental distribution function ("EDF") of linear discriminant function values based on the estimation for (dotted line).

도 2는 실시예의 집단 D와 집단에 대한 최소 랜덤 서브젝트 이펙트 예측값에 기초하여 선형 판별 함수의 실험 분포 함수("EDF")를 나타내는 도면.2 shows population D and population of an embodiment Plots an experimental distribution function ("EDF") of a linear discriminant function based on a minimum random subject effect prediction value for.

본 발명의 목적은 향후 개인의 질병의 위험을 판단하여 질병의 치료보다는 예방이라는 큰 효과를 가져올 수 있는 수단을 제공하는데 있다.An object of the present invention is to provide a means that can determine the risk of a disease in the future to bring a greater effect of prevention than the treatment of the disease.

더욱 상세히 말해서, 본 발명은 향후 실질적으로 더 높은 확률과 현재 가능한 것보다 더 큰 신뢰성을 가지고 선택된 개인에 대해 넓은 범위의 질병의 위험을 정량적으로 예측하기 위한 범용 수단을 제공하는 것을 목적으로 한다.More specifically, the present invention aims to provide a universal means for quantitatively predicting the risk of a wide range of diseases for a selected individual with substantially higher probability of future and greater reliability than is currently possible.

특히, 본 발명은 특정 개인에 대해 향후의 건강 위험을 판단하고 특정 개인에 대한 향후 건강의 위험을 감소시키기 위해 취해진 예방책을 관리하기 위한 시스템을 제공하는 컴퓨터를 기반으로 한 방법 및 장치를 제공하는데 그 목적이 있다.In particular, the present invention provides a computer-based method and apparatus that provides a system for determining future health risks for a particular individual and for managing preventive measures taken to reduce future health risks for a particular individual. There is a purpose.

본 발명은 개인이 특정 기간이나 연령 구간내의 특정의 생물학적 조건을 갖게 되는 확률에 대한 정보를 포함하는 선택된 바이오마커 세트를 식별하고, 개인의 위험을 추정하기 위해 이들 바이오마커의 특정 시간적(cross-sectional) 및 시계열적(longitudinal) 값을 이용한다.The present invention identifies selected biomarker sets that include information about the probability that an individual will have a particular biological condition within a particular time period or age interval, and cross-sectional views of these biomarkers to estimate the individual's risk. ) And longitudinal values.

또, 본 발명은 개인의 향후 건강을 예측하기 위한 컴퓨터 기반 시스템에 있어서,In addition, the present invention is a computer-based system for predicting the future health of the individual,

(a) 구성원의 부분 모집단 D가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득한 것으로 식별되고, 상기 구성원의 부분 모집단가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하지 않은 것으로 식별되는 시험 모집단의 개개의 구성원으로부터의 시계열적으로 획득된 바이오마커값의 데이타베이스를 포함하고 있는 프로세서를 구비하는 컴퓨터와;(a) a subpopulation D of a member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of said member A computer comprising a processor comprising a database of time-series obtained biomarker values from individual members of a test population, wherein s is identified as not obtaining specific biological conditions within a particular time period or age interval;

(b) (1)상기 바이오마커로부터 부분 모집단 D 와에 속하는 구성원간의 판별을 위한 바이오마커의 서브세트를, 시험 모집단의 개개의 구성원의 바이오마커값의 분포에 의거하여 선택하는 단계와;(b) (1) subpopulations D and Selecting a subset of biomarkers for discrimination between members belonging to the group based on a distribution of biomarker values of individual members of the test population;

(2) (i)시험 모집단의 구성원을 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 높은 확률을 갖는 부분 모집단 PD 또는 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 낮은 확률을 갖는 부분 모집단에 속하는 것으로서 분류하거나, 또는,(2) (i) a subpopulation PD having a certain high probability of acquiring a particular biological condition within a particular time period or age interval for a member of the test population or a predetermined biological condition within a particular time period or age interval. Subpopulation with low probability Classified as belonging to, or

(ii)시험 모집단의 각 구성원에 대해 특정 시간 주기 또는 연령 구간내에서의 특정 생물학적 조건을 획득하는 확률을 정량적으로 추정(ii) quantitatively estimate the probability of acquiring a specific biological condition within a specific time period or age interval for each member of the test population;

하는데 사용될 수 있는 통계학적 과정을 개발하기 위해 상기 선택된 바이오마커의 분포를 이용하는 단계를 포함하는 컴퓨터 프로그램을 포함하는 것을 특징으로 하는 컴퓨터 기반 시스템을 제공한다.A computer-based system is provided that includes a computer program comprising using the distribution of the selected biomarker to develop a statistical process that can be used to.

또, 본 발명은 개인의 향후 건강을 예측하는 컴퓨터 기반 시스템에 있어서,In addition, the present invention is a computer-based system for predicting the future health of the individual,

(a) 개인으로부터의 복수의 바이오마커값을 포함하고 있는 프로세서를 구비하는 컴퓨터와;(a) a computer having a processor containing a plurality of biomarker values from an individual;

(b) 적어도 하나의 바이오마커값이 바이오마커값을 물리적으로 측정함으로써 획득되는 복수의 바이오마커값을 개인으로부터 수집하는 단계와;(b) collecting from the individual a plurality of biomarker values from which at least one biomarker value is obtained by physically measuring the biomarker value;

(i) 상기 개인을 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 높은 확률을 갖는 사람 또는 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하는 소정의 낮은 확률을 갖는 사람으로서 분류하거나, 또는 (ii) 상기 개인에 대해 특정 시간 주기 또는 연령 구간내에서의 특정 생물학적 조건을 획득하는 확률을 정량적으로 추정하기 위해 상기 복수의 바이오마커값에 통계학적 과정을 적용하는 단계를 포함하는 컴퓨터 프로그램을 구비하며,(i) the individual as a person with a certain high probability of acquiring a specific biological condition within a particular time period or age interval, or as a person with a certain low probability of acquiring a specific biological condition within a specific time period or age interval. Or (ii) applying a statistical process to the plurality of biomarker values to quantitatively estimate the probability of obtaining a particular biological condition within a particular time period or age interval for the individual. Computer program,

상기 통계학적 과정은,The statistical process is,

(1) 구성원의 부분 모집단 D가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득한 것으로 식별되고, 상기 구성원의 부분 모집단가 특정 시간 주기 또는 연령 구간내에서 특정 생물학적 조건을 획득하지 않은 것으로 식별되는 시험 모집단의 개개의 구성원으로부터의 시계열적으로 획득된 바이오마커값의 데이타베이스를 수집하는 단계와;(1) the subpopulation D of the member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of the member Collecting a database of time-series obtained biomarker values from individual members of the test population, wherein is identified as not obtaining specific biological conditions within a particular time period or age interval;

(2) 상기 바이오마커로부터 부분 모집단 D 와에 속하는 구성원간의 판별을 위한 바이오마커의 서브세트를, 시험 모집단의 개개의 구성원의 바이오마커값의 분포에 의거하여 선택하는 단계와;(2) subpopulation D and from the biomarker Selecting a subset of biomarkers for discrimination between members belonging to the group based on a distribution of biomarker values of individual members of the test population;

(3) 상기 통계학적 과정을 개발하기 위해 상기 선택된 바이오마커의 분포를 이용하는 단계에 기초하는 것을 특징으로 하는 컴퓨터 기반 시스템.개인 건강 예측용 컴퓨터 시스템을 제공한다.(3) A computer-based system, characterized in that based on using the distribution of the selected biomarker to develop the statistical process.

본 발명의 바람직한 실시예를 상세히 기술하며, 본 발명의 실시예는 예증을 위한 것이며 이 실시예에 한정되지 않는다.Preferred embodiments of the present invention are described in detail, and the embodiments of the present invention are for illustration and are not limited to this embodiment.

본 발명은 개인의 건강이 영양학적, 독물학적, 유전학상, 호르몬성, 바이러스성, 전염성, 인체측정학상, 생활 양식 그리고 그 개인의 비정상적인 생리학적 및 추정 병리학적 상태를 잠정적으로 나타내는 기타 다른 상태에 관련된 넓은 범위의 생리학적 및 생화학적 인자들의 복잡한 상호 작용에 영향을 받는다는 이론에 기초를 두고 있다. 이 이론에 따라, 본 발명은 개인의 바이오마커 값을 대규모 검정 모집단에 대한 다수의 개인 바이오마커 값의 세트의 시계열적으로 얻어진 데이터베이스를 통계적으로 비교한 것에 기초하여 개인의 향후 건강에 대한 정량 예측을 제공할 수 있는 다변수 통계 분석 기술을 이용하여 향후의 건강을 예측하기 위한 실질적 시스템을 제공하는 것을 목적으로 한다. 본 명세서에서 "바이오마커"란 용어의 의미는 개인 건강을 진단 또는 예측하는 것과 관련되어 있거나 이에 영향을 미칠 수 있는 생물학적 지시자(indicator)를 말한다. 또, "시계열적(時系列的; longitudinal)"이란 용어는 시간 주기, 특히 적어도 2개의 측정의 경우에 대해 바이오마커 값이 주기적으로 얻어진다는 것을 말한다.The present invention relates to a condition in which a person's health is tentatively showing nutritional, toxicological, genetic, hormonal, viral, infectious, anthropometric, lifestyle, and abnormal physiological and putative pathological conditions of the individual. It is based on the theory that it is affected by the complex interaction of a wide range of physiological and biochemical factors involved. In accordance with this theory, the present invention provides a quantitative prediction of an individual's future health based on a statistical comparison of an individual's biomarker values with a time series obtained database of a set of multiple personal biomarker values for a large test population. It aims to provide a practical system for predicting future health using multivariate statistical analysis techniques that can be provided. As used herein, the term "biomarker" refers to a biological indicator that may be associated with or affect the diagnosis or prediction of personal health. In addition, the term "time series" means that the biomarker value is obtained periodically for a time period, in particular for at least two measurements.

시계열적인 추정의 빈도와 기간은 변경될 수 있다. 예컨대, 일부 바이오마커는 2년만큼 짧은 기간으로부터 일생만큼 긴 기간 동안 1년 단위로 판단될 수 있다. 신생아의 추정과 같은 특정 상황하에서 바이오마커는 더 빈번하게, 예를 들면 매일, 주마다 또는 월마다 평가될 수 있을 것이다. 시계열적인 추정의 경우는 불규적으로 기간이 정해질 수 있으며, 다시 말해서 동일하지 않은 시간 간격으로 발생할 수 있다는 것이다. 개인에 대한 시계열적인 추정의 세트는 "완전"하게 될 수 있으며, 완전하다는 것은 모든 계획된 추정과 모든 계획된 바이오마커로부터의 데이터를 실제 얻어 이것을 이용할 수 있다는 것을 의미한다. 또한, 개인에 대한 시계열적인 평가의 세트는 "불완전"하게 될 수 있으며, 이것은 일부 경우에 있어서 데이터가 완전하지 않다는 의미이다. 개인의 바이오마커는 특정 시간적으로 또는 시계열적으로 평가될 수 있다. 본 발명은 상기 주지된 임의의 또는 모든 특징들, 즉 특정 시간적 또는 시계열적, 시간에 대해 규칙적 또는 불규칙적, 완전한 또는 불완전한 특징들을 갖는 개인으로부터의 얻은 필요한 데이터의 통계적 분석을 실행할 수 있다.The frequency and duration of time series estimates may change. For example, some biomarkers may be judged on a yearly basis for periods as short as two years to as long as lifetime. Under certain circumstances, such as the estimation of newborns, biomarkers may be assessed more frequently, for example daily, weekly or monthly. In the case of time series estimates, the period can be randomly specified, that is, occur at unequal time intervals. The set of time series estimates for an individual can be “complete”, and complete means that data from all planned estimates and all planned biomarkers can actually be used and used. In addition, a set of time series assessments of an individual may become “incomplete”, meaning that in some cases the data is not complete. The biomarker of an individual can be assessed at specific time or time series. The present invention can carry out a statistical analysis of the necessary data obtained from an individual having any or all of the above mentioned features, i.e., a particular temporal or time series, regular or irregular, complete or incomplete for a time.

향후 건강을 평가하기 위한 본 발명의 시스템은 특정의 시간 주기 내에서 특정의 생물학적 상태를 갖는 개인 확률의 정량적 추정값을 제공한다. 이 정량적 확률 추정값은 본 발명의 순차적 통계 분석을 이용하여 계산된다. 본 발명의 시스템은 1, 2, 3, 5년 또는 극단적으로 15년 내지 20년 이상의 미래에 대한 향후 생물학적 상태의 정량 예측을 제공하는데 사용되는 것이 일반적일 것이다. 본 발명의 시스템은 특정 질병의 징후가 관찰되거나 검사되기 훨씬 전에 사용되는 것이 일반적이지만, 본 발명의 시스템은 단지 몇달이나 몇주 또는 그보다 더 짧은 기간의 비교적 짧은 시간 주기에 걸쳐 향후 건강 상태를 예측하는데 사용될 수도 있다.The system of the present invention for assessing future health provides a quantitative estimate of an individual's probability of having a particular biological state within a certain time period. This quantitative probability estimate is calculated using the sequential statistical analysis of the present invention. The system of the present invention will typically be used to provide quantitative predictions of future biological conditions for the future of 1, 2, 3, 5 or even more than 15-20 years. While the system of the present invention is generally used long before any signs of a particular disease are observed or examined, the system of the present invention can be used to predict future health over a relatively short time period of just months, weeks or even shorter periods. It may be.

시험 모집단에 포함될 수 있는 구성원의 수에 대해서는 상위 제한이 없어 수백만의 검사 구성원을 포함할 수도 있지만, 대표적인 시험 모집단은 최초에 아주 적은 수의 구성원을 포함할 것이다. 시험 모집단은 수집된 데이터의 신뢰성을 향상시키기 위해 적절한 통계적 샘플링 기술을 이용하여 대규모의 일반 모집단으로부터 선택될 수 있다.Although there is no upper limit on the number of members that can be included in a test population, it may include millions of test members, but a representative test population will initially contain a very small number of members. The test population can be selected from a large general population using appropriate statistical sampling techniques to improve the reliability of the collected data.

바람직한 실시예에 있어서, 본 발명은 특정의 시간 주기 또는 연령 구간내의 특정의 생물학적 상태를 갖는 개인의 위험 상태를 추정하고, 가장 위험한 개인을 식별하는데 사용될 수 있는 수학적, 통계적 함수를 생성하기 위한 일련의 통계적 분석 단계를 이용하는 컴퓨터 시스템을 제공한다. 본 발명의 방법의 단계 Ⅰ에 앞서, 이용 가능한 서브젝트는 트레이닝 샘플(Training Sample) 또는 평가 샘플(Evaluation Sample)에 무작위로 할당될 수 있다. 단계 Ⅰ-Ⅲ 은 트레이닝 샘플로부터의 데이터에 대한 것이며, 단계 Ⅳ는 평가 샘플로부터의 데이터에 대한 것이다. 단게 Ⅰ은 위험 추정을 위한 잠재적으로 유용한 정보를 갖는 바이오마커의 큰 서브세트을 선택하기 위해 상관 분석, 로지스틱 회귀 분석, 혼합 모형(mixed model) 분석 및 기타 다른 분석법을 이용하는 스크리닝(screening) 단계이다.In a preferred embodiment, the invention provides a set of mathematical and statistical functions for estimating the risk status of individuals with a particular biological condition within a particular time period or age interval, and which can be used to identify the most at risk individuals. A computer system using statistical analysis steps is provided. Prior to step I of the method of the present invention, the available subjects may be randomly assigned to a training sample or an evaluation sample. Steps I-III are for data from training samples and Step IV is for data from evaluation samples. Stage I is a screening step using correlation analysis, logistic regression, mixed model analysis and other methods to select a large subset of biomarkers with potentially useful information for risk estimation.

단계 Ⅱ는 불완전한 데이터 및/또는 불규칙적으로 시간 규정된 시계열적 데이터의 존재에 있어서도 후보 바이오마커의 기대값 벡터와 구조화된 공분산 행렬 인자를 추정하기 위해 혼합 선형 모형(model)을 이용하는 인자 추정 단계이다. 단계 Ⅲ은 정보가 되는 바이오마커(관련 시계열적 평가 포함)를 선택하고 판별 함수 계수를 추정하며, 개개인의 위험을 추정하기 위해 역누적 분포 함수와 로지스틱 회귀법을 이용하기 위해 판별 분석법 및 로지스틱 회귀법을 이용하는 바이오마커 선택 및 위험 평가 단계이다. 단계 Ⅳ는 판별 과정의 오분류 비율 (misclassification rate)에 대한 비편의 추정치(unbiased estimates)를 제공하기 위해 평가 샘플을 이용하는 평가 단계이다.Step II is a factor estimation step that uses a mixed linear model to estimate the expected vector of the candidate biomarker and the structured covariance matrix factor, even in the presence of incomplete data and / or irregularly time-defined time series data. Step III selects informative biomarkers (including relevant time series evaluations), estimates discriminant function coefficients, and uses discriminant analysis and logistic regression to use inverse cumulative distribution functions and logistic regression to estimate individual risks. Biomarker selection and risk assessment. Step IV is an evaluation step using an evaluation sample to provide unbiased estimates for the misclassification rate of the determination process.

상기 언급된 통계적 과정의 개별적 스텝들은 통계학 문헌에 개시되어 있지만, 이들 개별적 스텝들은 본 명세서에 개시된 것과 같이 단일의 전체 과정에 조합된 적이 없었다. 특히, 다음의 종래 방법은 예컨대, 1985년 존 윌리와 선스(John Wiley & Sons)가 발행한 사무엘, 노멀 및 캠벨(Samuel Kotz, Normal L. Johnson, Campbell B. Read)에 의한 "Encyclopedia of Statistical Sciences" 에 개시되어 있으며, (a) 상관 분석(제2권 193-204쪽), (b) 로지스틱 회귀 분석(제5권, 128-133쪽), (c) 혼합 모형 분석(제3권, 137-141쪽, 제목 "Fixed-, Random-, and Mixed-Effect Model"), (d) 판별 분석(제2권, 389-397쪽)을 본 명세서에서 인용한다.Although the individual steps of the aforementioned statistical process are disclosed in the statistical literature, these individual steps have never been combined in a single overall process as disclosed herein. In particular, the following conventional methods are described, for example, by "Encyclopedia of Statistical Sciences" by Samuel Kotz, Normal L. Johnson, Campbell B. Read, published by John Wiley & Sons in 1985. "(A) Correlation Analysis (Vol. 2, pp. 193-204), (b) Logistic Regression (Vol. 5, pp. 128-133), and (c) Mixed Model Analysis (Vol. 3, 137). Pp. 141, titled “Fixed-, Random-, and Mixed-Effect Model”), (d) Discriminant Analysis (Vol. 2, pp. 389-397).

본 발명은 이들 과정의 종래 방법, 이들의 개선 방법 및 때때로 개선되어 알려질 이들 과정의 더 새로운 방법을 이용할 수 있다.The present invention can use the conventional methods of these processes, their methods of improvement and newer methods of these processes that will sometimes be improved and known.

상관 분석은 2개 또는 그 이상의 변수 사이의 선형 관계의 관련 정도를 추정하기 위한 통계적 방법에 사용되는 용어이다. 본 명세서에서 사용되는 상관은 다양한 유형의 상관 관계를 포함하며, 포아송 곱-모멘트 상관법, 스피어만(Spearman)의 ρ, 켄덜(Kendall)의 τ, 피서-예츠(Fisher-Yates)의 γ_F및 기타 다른 것에 제한되지 않는다.Correlation analysis is a term used in statistical methods to estimate the degree of relevance of a linear relationship between two or more variables. Correlation as used herein includes various types of correlations, including Poisson product-moment correlation, Spearman's ρ, Kendall's τ, Fischer-Yates' γ _F and It is not limited to anything else.

로지스틱 회귀는 선형대수-선형 모형을 포함하여 검사된 종속 변수(비율(proportion 또는 rate)이 될 수 있음)와 해석 변수 사이의 관계 분석에 이용되는 통계적 방법에 사용되는 용어이다. 본 명세서에서 사용된 로지스틱 회귀의 적용은 1차적으로 종속 변수가 서브섹트의 2개의 상보성(겹지지 않음) 집단 중 하나의 개인을 나타내는 2진 출력인 분석에 사용된다. 이 2개의 집단은 특정 시간 주기나 연령대의 특정 질병이나 조건을 가질 집단(본 명세서에서는 때때로 "특정 생물학적 조건"이라고도 함)과, 특정 시간 주기나 연령대의 특정 질병이나 조건을 갖지 않을 집단이다. 여기에서 해석 변수는 바이오마커 또는 바이오마커의 함수인 것이 통상적이다.Logistic regression is a term used in statistical methods used to analyze the relationship between tested dependent variables (which can be either proportional or rate) and analytical variables, including linear algebraic-linear models. The application of logistic regression as used herein is primarily used for analysis in which the dependent variable is a binary output representing an individual in one of two complementary (non-overlapping) populations of subsectors. These two groups are those that will have a particular disease or condition at a particular time period or age (sometimes referred to herein as “specific biological conditions”) and those that will not have a particular disease or condition at a particular time period or age. The interpretation variable here is typically a function of the biomarker or biomarker.

혼합 모형 분석은 상관된 종속 변수(다변수 측정이나 관측, 한 변수의 시계열적 측정이나 관측 및/또는 시계열적 다변수 측정이나 관측)와, 연령과 같은 공분산, 분류 변수(집단의 구성원을 나타냄)를 포함할 수 있는 "독립 변수" 사이의 기대값 관계의 분석에 이용되며, 상관 측정이나 관측 사이의 공분산을 나타내는 구조 및 인자의 분석에 이용되는 통계적 방법에 사용되는 용어이다. "혼합 모형"이란 고정형 모형, 무작위형 모형 및 혼합형 모형을 포함하는 용어이다. 혼합 모형은 기대치 모형 및/또는 공분산 모형에서 선형 또는 비선형 구조를 가질 수 있다. 혼합 모형 분석은 기대치 인자(β로 표시하기도 함)와 공변수 행렬 인자(Σ=ΖΔΖ'+V 의 형태를 가지는 경우도 있음, 여기서 Δ와 V 는 미지 인자의 행렬임)를 포함하는 것이 통상적이다. 또 혼합 모형 분석은 개별적 서브젝트에 대해 무작위 서브젝트형(k번째 서브젝트에 대해 d_k로 표시)과 소위 "최적 선형 비편의 예측자(또는 "BLUPs")를 포함할 수도 있다. 혼합 모형 분석은 기대값 인자 및/또는 공분산 인자에 대한 가설을 검증하고, 인자에 대한 신뢰 영역을 구성하기 위한 과정을 포함하는 것이 일반적이다.Mixed model analysis involves correlated dependent variables (multivariate measurements or observations, time series measurements or observations of one variable, and / or time series multivariate measurements or observations), covariances such as age, and classification variables (representing members of a population). The term is used in the analysis of expected value relationships between “independent variables” which may include and is used in statistical methods used in the analysis of structures and factors representing covariances between correlation measurements or observations. "Mixed model" is a term that includes fixed models, random models, and mixed models. The mixed model may have a linear or nonlinear structure in the expectation model and / or covariance model. Mixed model analysis typically includes an expectation factor (sometimes denoted by β) and a covariate matrix factor (sometimes Σ = ΖΔΖ '+ V, where Δ and V are unknown matrixes). . Mixed model analysis may also include random subject types (denoted as d _k for the _kth subject) and so-called "optimal linear fleet predictors (or" BLUPs ") for individual subjects. It is common to include procedures for verifying hypotheses about factors and / or covariance factors, and constructing confidence domains for factors.

특히, 판별 분석법은 서브젝트의 2개의 상보성(겹치지 않음) 집단(예컨대, 특정 시간 주기나 연령대의 특정 질병이나 조건을 가질 집단과, 특정 시간 주기나 연령대의 특정 질병이나 조건을 갖지 않을 집단) 중 하나에 다변수 관측값(예컨대, 하나의 서브젝트로부터의 바이오마커의 벡터값)을 그 값에 따라 할당하는데 사용될 수 있는 판별 함수를 개선하는 통계적 분석법에 관련되어 있다. 게다가 판별 함수는 주어진 관측값이 주어진 집단에 속할 확률의 추정값을 계산하는데 그 기초로서 사용되는 함수라고도 할 수 있다. 본 발명에 대해서, 관련된 관측값은 대규모 시험 모집단의 각각의 구성원 또는 개별적인 검증 서브젝트로부터 얻어진 복수개의 바이오마커값을 갖는 것이 통상적이다. 본 발명의 판별 함수는 결정된 각각의 바이오마커가 관련되는 이들 바이오마커의 분포값를 이용한 것이다. 이러한 분포는 각각의 바이오마커값과 바이오마커값 그 자체를 갖는 시험 모집단의 개별적 구성원의 총수로 구성된다. 따라서, 본 발명은 시험 모집단으로부터의 개별적 구성원의 각각의 바이오마커에 대해 얻어진 개별적 바이오마커값과, 이와는 별개로, 예컨대 상이한 바이오마커에 대한 상이한 시험 모집단으로부터 얻어진 평균 바이오마커값에 기초한 분포를 이용하는 통계적 과정을 채택한다.In particular, a discriminant assay is one of two complementary (non-overlapping) groups of subjects (eg, a group that will have a particular disease or condition at a particular time period or age, and a group that will not have a particular disease or condition at a particular time period or age group). In relation to a statistical analysis that improves a discriminant function that can be used to assign multivariate observations (eg, vector values of biomarkers from one subject) according to that value. In addition, the discriminant function can be said to be the function used as the basis for calculating an estimate of the probability that a given observation belongs to a given group. For the present invention, the associated observations typically have a plurality of biomarker values obtained from each member of a large test population or from individual validation subjects. The discriminant function of the present invention utilizes the distribution of these biomarkers with which each determined biomarker is associated. This distribution consists of the total number of individual members of the test population with each biomarker value and the biomarker value itself. Accordingly, the present invention provides a statistical approach that utilizes individual biomarker values obtained for each biomarker of an individual member from a test population, and separately from, for example, a distribution based on mean biomarker values obtained from different test populations for different biomarkers. Adopt the process.

"판별 함수"는 관측값(스칼라 또는 벡터)을 2개 이상의 집단으로 분류하기 위한, 예컨대 선형 판별 함수, 2차 판별 함수, 비선형 판별 함수 및 다양한 유형의 소위 최적 판별 과정을 포함하며 이에 한정되지 않는 여러 상이한 유형의 함수 또는 과정 중 임의의 하나를 의미하는 용어이다."Discrimination function" includes, but is not limited to, classification of observations (scalars or vectors) into two or more groups, such as, for example, linear discriminant functions, quadratic discriminant functions, nonlinear discriminant functions, and various types of so-called optimal discriminant processes. The term means any one of several different types of functions or processes.

본 발명의 컴퓨터 시스템은 컴퓨터 프로그램이나 일련의 컴퓨터 프로그램(이하 간단히 "컴퓨터 프로그램"이라 한다)을 구동시킬 수 있으며, 본 발명의 다양한 스텝 및 단계에서의 필요한 연산 및 데이터 처리를 수행하는 스텝을 포함하는 프로세서를 구비하는 컴퓨터를 포함한다. 이 프로세서는 마이크로프로세서, 개인용 컴퓨터, 메인프레임 컴퓨터, 또는 일반적으로 필요한 연산 및 데이터 처리를 수행하는 수 있는 컴퓨터 프로그램을 구동시킬 수 있는 모든 디지털 컴퓨터가 될 수 있다. 이 프로세서는 중앙 처리 장치, 랜덤 액세스 메모리(RAM), 판독 전용 메모리(ROM), 다수의 소자 사이에서 데이터를 전송하기 위한 하나 이상의 버스 또는 채널, 하나 이상의 디스플레이 장치(예컨대, 모니터), 하나 이상의 입출력 장치(예컨대, 플로피 디스크 구동 장치, 고정형 하드 디스크 구동 장치, 프린터 등), 및 입출력 장치 및/또는 디스플레이 장치를 제어하고 이러한 장치들을 버스 또는 채널에 접속시키는 어댑터를 구비하는 것이 일반적이다. 어떤 특정 프로세서는 이들 소자들을 모두 포함하기도 하고, 이들 소자들 중 일부만을 포함하기도 한다.The computer system of the present invention can drive a computer program or a series of computer programs (hereinafter simply referred to as "computer programs"), and includes steps for performing necessary operations and data processing in various steps and steps of the present invention. A computer having a processor. This processor can be a microprocessor, a personal computer, a mainframe computer, or any digital computer capable of running a computer program that can generally perform the necessary computation and data processing. The processor includes a central processing unit, random access memory (RAM), read-only memory (ROM), one or more buses or channels for transferring data between multiple devices, one or more display devices (e.g. monitors), one or more input / output It is common to have devices (eg, floppy disk drives, fixed hard disk drives, printers, etc.), and adapters to control input / output devices and / or display devices and to connect these devices to a bus or channel. Some particular processors may include all of these devices, or some of these devices only.

컴퓨터 프로그램은 ROM 이나 디스크 또는 디스크 세트, 또한 컴퓨터 프로그램을 저장 및 분배하는데 사용될 수 있는 기타 다른 접근 가능한 매체에 저장될 수 있다.The computer program may be stored on a ROM, a disk or a set of disks, and other accessible media that can be used to store and distribute the computer program.

이 컴퓨터 프로그램은 특정 시간적 및/또는 시계열적 다변수 바이오마커 데이터에 대해 분석법의 다양한 단계 및 스텝을 위한 연산을 수행할 수 있다.The computer program can perform operations for various steps and steps of the method on specific temporal and / or time series multivariate biomarker data.

바이오마커 데이터는 2년 내지 3년 주기내의 특정의 관련 생물학적 조건을 갖는 전체 구성원수가 특정 생물학적 조건에 대해 의미를 갖고 채택된 판별 분석법을 위해 충분히 크도록 하기 위해 충분히 큰 시험 모집단으로부터 수집되는 것이 바람직하다. 본 발명의 특징들 중 하나는 임의의 주요 질병을 갖는 것 및/또는 1년 내지 2년과 같은 짧은 주기내의 사망의 주요 기초 원인들로 인해 사망하는 것과 관련하여 예측하기 위한 동일한 데이터베이스를 이용하는 수단을 제공하는 것이기 때문에, 시험 모집단은 서브젝트 시스템을 관련 전체 사망자 중 적어도 대략 60%, 바람직하게는 적어도 대략 75% 인 것인 사망의 기초 원인과 더 많은 공통 질병들 중 임의의 질병에 적용하기 위해 충분히 큰 것이 바람직하다. 여기에서 관련 사망자는 사고, 살인 및 자살에 의한 것과는 구분되는 병리학적으로 사망한 자들을 말한다.Biomarker data is preferably collected from a sufficiently large test population to ensure that the total number of members with a particular relevant biological condition within a two to three year cycle is significant for the particular biological condition and is large enough for the adopted discriminant assay. . One of the features of the present invention is a means of using the same database for predicting death with any major disease and / or death due to major underlying causes of death in short cycles such as one to two years. As such, the test population is large enough to apply the subject system to any of the more common diseases and the underlying cause of death, which is at least about 60%, preferably at least about 75% of the total deaths involved. It is preferable. The relevant deaths here refer to those who have died pathologically, as distinct from accidents, murders and suicides.

예컨대, 1996년 2월 26일 월간 Vital Statistics 보고서, 증보판 제44권 제7번의 질병 관리 및 예방 센터로부터 입수한 데이터를 이용하여, 모든 사망자로부터 75% 이상이 병리학적으로 추론된 사망에 대해 대략 880/100,000 의 전체 순사망률과 비교하여, 연령별 사망률로부터 구분되는 것과 같이 205.6/100,000 의 순사망률을 갖는 악성 종양(ICD 140-208), 367.3/100,000 의 순사망률을 갖는 주요 심장 혈관병(ICD 390-448), 39.2/100,000 의 순사망률을 갖는 만성 폐질환(ICD 490-496), 20.9/100,000 의 순사망률을 갖는 당뇨병의 사망자 원인이 될 수 있다는 것을 나타내고 있다. 이들 질병은 사실상 변경된 규정식 및 생활 습관의 조건에 반응하고 다양하게 한정 가능하고 측정 가능한 바이오마커에 의해 표시되는 주요 규정식 및 생활 습관 영향을 나타내는 것들이다.For example, using data obtained from the Monthly Vital Statistics Report, February 26, 1996, Supplement, Volume 44, Center for Disease Control and Prevention, approximately 75% of all pathologically deceased deaths were approximately 880 Malignant tumor with a net mortality rate of 205.6 / 100,000 (ICD 140-208), major cardiovascular disease with a net mortality rate of 367.3 / 100,000 (ICD 390- 448), chronic lung disease with net mortality of 39.2 / 100,000 (ICD 490-496), and death of diabetes with net mortality of 20.9 / 100,000. These diseases are those that actually respond to altered diet and lifestyle conditions and exhibit major dietary and lifestyle impacts represented by various definable and measurable biomarkers.

본 발명의 고유 특징들 중 하나로서, 컴퓨터 시스템 및 장치는 바이오마커값의 개별 프로파일을 대규모 시험 모집단의 구성원로부터 얻은 바이오마커값을 비교한 것에 기초한 이들 주요 질병들 중 임의의 질병을 갖는 특정된 개인의 위험을 판정하기 위해 사용될 수 있다. 이들 주요 질병들은 바이오마커값에 반영될 수 있는 많은 공통 요인들을 공유하는 것으로 알려졌기 때문에, 서브젝트 컴퓨터 시스템은 이들 주요 질병을 갖는 위험을 동시에 평가하는데 사용될 수 있다. 예컨대, 전체 혈청 콜레스테롤은 이들 많은 질병과 관련된 바이오마커라고 알려졌다. 중요한 예측자인 바이오마커의 각 프로파일을 특정 질병의 다른 중요한 바이오마커 예측자와 조합하여 관리하고, 또는 사망의 원인의 기초가 되고, 이 프로파일을 시험 모집단과 비교하기 위해 본 발명을 이용함으로써, 개별 서브젝트는 특정한 정량 신뢰성을 가지고 그 질병이 특정의 개인에 대해 가장 위험하다고 통지될 수 있다.As one of the inherent features of the present invention, computer systems and devices can be used to identify individuals with any of these major diseases based on comparing individual profiles of biomarker values to biomarker values obtained from members of large test populations. Can be used to determine the risk of Since these major diseases are known to share many common factors that can be reflected in biomarker values, subject computer systems can be used to simultaneously assess the risk of having these major diseases. For example, total serum cholesterol is known to be a biomarker associated with many of these diseases. Each subject of a biomarker, which is an important predictor, is administered in combination with other important biomarker predictors of a particular disease, or as the basis for the cause of death, and by using the present invention to compare this profile with a test population, an individual subject May have a certain quantitative reliability and be informed that the disease is most dangerous for a particular individual.

본 발명의 특징은 특정 질병을 갖는 것이 가장 위험한 개인들에게 그 질병의 통상적인 증상이 명백해지기 이전에 미래에 특정 시간 주기 또는 연령 구간내에서 그 질병을 가질 정량 확률이 제공될 수 있다. 변경된 정규식 및 생활 습관 조건에 반응하여 알려진 많은 질병에 대해 그 정보가 제공되면 개인은 식별된 질병의 위험을 감소시킬 수 있는 행동 변화를 할 수 있다.A feature of the present invention is that in individuals who are most at risk of having a particular disease, a quantitative probability of having the disease within a certain time period or age interval may be provided in the future before the usual symptoms of the disease become apparent. When information is provided about a number of known diseases in response to altered regular expressions and lifestyle conditions, individuals may make behavioral changes that may reduce the risk of identified diseases.

게다가 더 긴 시간 주기에 걸쳐 더 많은 수의 서브젝트에 대해 더 많은 데이터가 제공됨에 따라, 주요 질병 및 사망뿐만 아나라 덜 일반적인 질병의 원인 각각에 대한 더 상세한 구분과 사망의 원인의 기초를 이루는 것이 본 발명의 방법에 의해 정의 및 포함될 수 있다. 에컨대, 여러가지 유형의 암, 예컨대 간암, 폐암, 위암, 전립선암 등으로 분류가 이루어질 수 있다. 따라서 본 발명의 컴퓨터 시스템은 질병이 연속적으로 더 좁은 특성으로 정의되는 특정 시간내의 특정된 병리학적으로 추론된 질병을 갖는 또는 갖지 않는 각 개인의 정량 위험을 예측하기 위해 모집단의 더 큰 부분을 포함하는 수단을 제공한다.In addition, as more data is provided for a larger number of subjects over a longer period of time, it is seen that the basis for more detailed classification and cause of death for each of the major and less common causes of disease is seen. It may be defined and included by the method of the invention. For example, classification may be made into various types of cancers such as liver cancer, lung cancer, gastric cancer, prostate cancer, and the like. The computer system of the present invention thus comprises a larger portion of the population to predict the quantitative risk of each individual with or without a specified pathologically deduced disease within a certain time period where the disease is defined by successively narrower characteristics. Provide means.

바이오마커 데이터가 시험 모집단으로부터 수집된 포괄적인 바이오마커 세트는 가장 일반적인 질병과 병리학적으로 추론된 사망의 기초 원인에 관련된 것으로 알려지거나 믿는 가능한 많은 수의 다양한 바이오마커를 포함하는 것이 바람직하다. 게다가, 생물학적 함수의 공지되고 일반적으로 인정된 유전학적, 생리학적 및 생화학적 영역의 각각으로부터의 대표적인 바이오마커 집단의 값이 포함될 수 있다. 예컨대 표본화가 이루어진 후 분석을 위해 저장될 수 있는 생물학적 샘플에서 측정될 수 있는 모든 것들에 대해 포함되는 것이 바람직한 부가적인 바이오마커가 수집된다.The comprehensive biomarker set from which biomarker data is collected from the test population preferably includes as many diverse biomarkers as known or believed to be related to the underlying cause of the most common disease and pathologically deduced death. In addition, the values of representative biomarker populations from each of the known and generally accepted genetic, physiological and biochemical regions of biological function can be included. For example, additional biomarkers are collected that are preferably included for all things that can be measured in biological samples that can be stored for analysis after sampling has been made.

생물학적 샘플은 혈관 및 소변 샘플을 포함하는 것이 바람직하지만, 다른 생물학적 샘플이 포함될 수 있다. 예를 들면, 타액, 머리털, 발톱 및 손톱, 얼굴, 숨을 내쉰 공기 등의 샘플도 수집될 수 있다. 이러한 생물학적 샘플들은 시험 모집단의 실질적인 모든 구성원로부터 얻는 것이 통상적이다. 그러나 일부 경우에 있어서, 특정 바이오마커의 서브세트는 모집단의 특정 서브세트로부터만 얻을 수 있다.The biological sample preferably includes blood vessel and urine samples, but other biological samples may be included. For example, samples of saliva, hair, toenails and nails, face, exhaled air, and the like can also be collected. Such biological samples are typically obtained from virtually all members of the test population. However, in some cases, a subset of specific biomarkers can only be obtained from a particular subset of the population.

생물학적 샘플을 수집하는 것과 동시에 영양학적 습관 및 생활 습관에 관련된 바이오마커 데이터는 시험 모집단의 각각의 구성원로부터 얻는 것이 일반적이다. 영양학적 습관 및 생활 습관에 관련된 바이오마커는 예컨대 표 1에 도시된 것들을 포함할 수 있다. 표 1에 기재된 영양학적 및 생활 습관의 바이어마커가 영양학적 습관 및 생활 습관에 관련된 유형의 바이오마커를 나타내는 것이지만, 본 발명의 범위에 속하는 영양학적 및 생활 습관에 관련된 바이오마커를 배제하는 것은 아니다. 중요한 영양학적 판단뿐만 아니라 임상학적 및 전염병 바이오마커를 나타내는 바이오마커는 영양 섭취와 같은 다른 요인들에 의해 결정될 수 있다. 따라서 표 9에 나타낸 카테고리의 기술은 바이오마커값을 얻기 위해 선택될 수 있는 카테고리의 예시적 분류를 나타낼 뿐이다. 시간에 따라 변경될 수 있는 영양학적 및 생활 습관적 바이오마커는 생물학적 샘플이 수집되는 시간마다 시험 모집단의 각각의 구성원에 대해 수집 및 기록되는 것이 바람직하다.Biomarker data related to nutritional habits and lifestyles as well as biological samples are collected from each member of the test population. Biomarkers related to nutritional habits and lifestyles may include, for example, those shown in Table 1. The nutritional and lifestyle buyer markers listed in Table 1 represent biomarkers of the type related to nutritional habits and lifestyles, but do not exclude biomarkers related to nutrition and lifestyles within the scope of the present invention. Biomarkers representing clinical and epidemic biomarkers as well as important nutritional judgments can be determined by other factors, such as nutritional intake. Thus, the category descriptions shown in Table 9 only represent exemplary classifications of categories that can be selected to obtain biomarker values. Nutritional and lifestyle biomarkers, which may change over time, are preferably collected and recorded for each member of the test population each time a biological sample is collected.

표 1Table 1

향후 건강 상태를 예측하기 위한 본 발명의 방법에 사용될 수 있는 바이오마커의 리스트가 서술되어 있다.A list of biomarkers that can be used in the method of the present invention for predicting future health status is described.

혈청 바이오마커Serum Biomarkers

전체 콜레스테롤*Total cholesterol *

HDL 콜레스테롤*HDL Cholesterol *

LDL 콜레스테롤*LDL Cholesterol *

아포리포단백질 b*Apolipoprotein b *

아포리포단백질 A₁*Apolipoprotein A ₁ *

트리글리세리드*Triglycerides *

리피드 페록사이드(말론디알데히드 등가: TBA)*Lipid peroxide (malondialdehyde equivalent: TBA) *

α-카로텐(리포단백질 담체로 수집)*α-carotene (collected with lipoprotein carrier) *

β-카로텐(리포단백질 담체로 수집)*β-carotene (collected with lipoprotein carrier) *

γ-카로텐(리포단백질 담체로 수집)*γ-carotene (collected with lipoprotein carrier) *

제타-카로텐(리포단백질 담체로 수집)*Zeta-carotene (collected with lipoprotein carrier) *

α-크립톡산틴(리포단백질 담체로 수집)*α-cryptoxanthin (collected with lipoprotein carrier) *

리코펜(리포단백질 담체로 수집)*Lycopene (collected with lipoprotein carrier) *

루틴(리포단백질 담체로 수집)*Rutin (collected with lipoprotein carrier) *

안하이드로-루틴(리포단백질 담체로 수집)*Anhydro-Routine (collected with lipoprotein carrier) *

뉴로스포렌(리포단백질 담체로 수집)*Neurosporene (collected with lipoprotein carrier) *

피토플루엔(리포단백질 담체로 수집)*Phytofluene (collected with lipoprotein carrier) *

피토엔(리포단백질 담체로 수집)*Phytoenes (collected as lipoprotein carriers) *

α-토코페롤(리포단백질 담체로 수집)*α-tocopherol (collected with lipoprotein carrier) *

γ-토코페롤(리포단백질 담체로 수집)*γ-tocopherol (collected with lipoprotein carrier) *

레티놀*Retinol *

레티놀 결합 단백질*Retinol Binding Protein *

아스코르브산*Ascorbic acid *

Fe*Fe *

K*K *

Mg*Mg *

전체 인*Full Inn *

무기 인*Weapon Inn *

Se*Se *

Zn*Zn *

페리틴*Ferritin *

전체 철 결합 용량*Full iron bonding capacity *

패스팅 글루코스*Fasting Glucose *

요소 질소*Urea Nitrogen *

요소산*Elemental Acid *

프리알부민*Prialbumin *

알부민*albumin*

전체 단백질*Whole protein *

빌리루빈*Bilirubin *

갑상선 자극 호르몬 T3*Thyroid Stimulating Hormone T3 *

갑상선 자극 호르몬 T4*Thyroid Stimulating Hormone T4 *

코티닌Cotinine

아플라톡신-알부민 생성물Aflatoxin-albumin product

간염 B 앤티-코어 항체(HbcAb)Hepatitis B Anti-Core Antibody (HbcAb)

간염 B 표면 항원(GhsAg+)Hepatitis B surface antigen (GhsAg +)

칸디다 알비칸스 항체Candida albicans antibody

앱스타인-바르 바이러스 항체Apstein-Barr virus antibodies

타입 2 헤르페스 심플스 항체Type 2 herpes simplex antibodies

휴먼 파필로마 바이러스 항체Human Papilloma Virus Antibodies

헬리오코박테르 필로리 항체Heliocobacter phyllori antibody

에스트라디올(E2)(여성 사이클용으로 조절)*Estradiol (E2) (adjusted for female cycle) *

성 호르몬 결합 글로불린*Sex hormone binding globulin *

프로락틴(여성 사이클용으로 조절)*Prolactin (adjusted for female cycle) *

테스토스테론(여성 사이클용으로 조절)*Testosterone (adjusted for female cycle) *

헤모글로빈*hemoglobin*

미리스틱산(14:0)*Mystic Acid (14: 0) *

팔미틱산(16:0)*Palmitic acid (16: 0) *

스테아르산(18:0)*Stearic acid (18: 0) *

아라키딕산(20;0)*Arachidic acid (20; 0) *

베헤닉산(22:0)*Behenic Acid (22: 0) *

테트라코스아에1익산(24:0)*Tetracos-A1 Ixane (24: 0) *

미리스틱올레산(14:1)*Mystic Oleic Acid (14: 1) *

팔미트올레산(16:1)*Palmitic Acid (16: 1) *

올레산(18:1n9)*Oleic acid (18: 1n9) *

가드올레산(20:1)*Guardoleic acid (20: 1) *

유식산(22:1n9)*Acidic acid (22: 1 n9) *

테트라코스아에노산(24:1)*Tetracosenoic acid (24: 1) *

린올레익(18:2n6)*Linoleic (18: 2n6) *

린올레산(18:2n6)*Linoleic acid (18: 2n6) *

γ-감마 린올레산(18:3n6)*γ-gamma linoleic acid (18: 3n6) *

에이코스아디에놀산(20:2n6)*Eicos adienoic acid (20: 2n6) *

디-호모-γ리놀산(20:3n6)*Di-homo-γ linoleic acid (20: 3n6) *

아라키도닉산(20:4n6)*Arachidonic acid (20: 4 n6) *

에이코스펜타에놀산(20:5n3)*Eicospentaenoic acid (20: 5n3) *

도코사테트라에놀산(20:4n6)*Docosatetraenoic acid (20: 4n6) *

도코사펜타에놀산(20:5n3)*Docosapentaenoic acid (20: 5n3) *

도코사헥사에놀산(20:6n3)*Docosahexaenolic acid (20: 6n3) *

전체 불포화 지방산(16:0, 18:0, 20:0, 22:0, 24:0)*Totally unsaturated fatty acids (16: 0, 18: 0, 20: 0, 22: 0, 24: 0) *

전체 단불포화 지방산(14:1, 16:1, 18:1n9, 20:1, 24:1)*Total monounsaturated fatty acids (14: 1, 16: 1, 18: 1n9, 20: 1, 24: 1) *

전체 n3 폴리불포화 지방산(18:3n3, 20:5n3, 22:5n3, 26:6n3)*Total n3 polyunsaturated fatty acids (18: 3n3, 20: 5n3, 22: 5n3, 26: 6n3) *

전체 n6 폴리불포화 지방산(18:3n6, 20:2n6, 20:3n6, 20:4n6, 22:4n6)*Total n6 polyunsaturated fatty acids (18: 3n6, 20: 2n6, 20: 3n6, 20: 4n6, 22: 4n6) *

전체 n3 폴리불포화/전체 n6 폴리불포화 지방산(18:3n3, 20:5n3, 22:5n3, 22:6n3/18:3n6, 20:2n6, 20:3n6, 20:4n6, 22:4n6)*Total n3 polyunsaturated / total n6 polyunsaturated fatty acids (18: 3n3, 20: 5n3, 22: 5n3, 22: 6n3 / 18: 3n6, 20: 2n6, 20: 3n6, 20: 4n6, 22: 4n6) *

전체 폴리불포화 지방산(18:2n6, 18:3n3, 18:3n6, 20:2n6, 20:3n6, 20:4n6, 20:5n3, 22:4n6, 22:5n3, 22:6n3)*Total polyunsaturated fatty acids (18: 2n6, 18: 3n3, 18: 3n6, 20: 2n6, 20: 3n6, 20: 4n6, 20: 5n3, 22: 4n6, 22: 5n3, 22: 6n3) *

전체 폴리불포화/포화 지방산(18:2n6, 18:3n3, 18:3n3, 20:2n6, 20:3n6, 20:4n6, 20:5n3, 22:4n6, 22:5n3, 22:6n3/16:0, 18:0, 20:0, 22:0, 24:0)Total polyunsaturated / saturated fatty acids (18: 2n6, 18: 3n3, 18: 3n3, 20: 2n6, 20: 3n6, 20: 4n6, 20: 5n3, 22: 4n6, 22: 5n3, 22: 6n3 / 16: 0 , 18: 0, 20: 0, 22: 0, 24: 0)

[조사된 질병에 따른 대략 10-30 개의 유전학적 마커들임.][Approximately 10-30 genetic markers depending on the disease investigated]

소변 바이오마커Urine Biomarkers

오로티딘Orotidine

Cl*Cl *

Mg*Mg *

Na*Na *

크리티닌Krytinine

볼륨volume

NO₃ NO ₃

아플라톡신(AF)M₁ Aflatoxin (AF) M ₁

AF N⁷구아닌AF N ⁷ guanine

AF P₁ AF P ₁

AF Q₁ AF Q ₁

아플라톡시콜Aplatoxycol

8-디옥시 구아노신8-deoxy guanosine

(질문서에 의한)음식물로부터의 영양물 섭취Nutritional intake from food (by question document)

전체 단백질*Whole protein *

동물성 단백질*Animal Proteins *

식물성 단백질*Vegetable Protein *

어류 단백질*Fish Protein *

지질*Geology *

용해성 탄수화물*Soluble Carbohydrate *

전체 규정식 섬유*Whole Dietary Fiber *

전체 칼로리*Total calorie *

지질로부터 칼로리 섭취의 백분율*Percentage of calorie intake from lipids *

콜레스테롤*cholesterol*

Ca*Ca *

P*P *

Fe*Fe *

K*K *

Mg*Mg *

Mn*Mn *

Na*Na *

Se*Se *

Zn*Zn *

전체 토코페롤(지질 섭취로 수집)*Total Tocopherols (collected from lipid intake) *

전에 레티노이드*Before Retinoids *

전체 카로테노이드*All carotenoids *

티아민*Thiamine *

리보프라빈*Riboprabin *

니아신*Niacin *

비타민 C*Vitamin C *

[대략 30개의 상이한 유형의 음식물)*[About 30 different types of foods] *

[대략 30개의 상이한 지방산]*[About 30 different fatty acids] *

적혈구Red blood cells

RBC 글루타티온 환원 효소*RBC Glutathione Reductase *

RBC 카타라스*RBC Cataras *

RBC 수퍼옥사이드 디스무타스*RBC Superoxide Dismutas *

인체측정학적 인자Anthropometric factors

신장*kidney*

체중*weight*

* 는 중요한 영양학적 결정 인자를 나타내는 바이오마커를 나타낸다.* Indicates biomarkers indicating important nutritional determinants.

바이오마커값이 요구되는 생물학적 샘플에서의 각각의 성분에 대한 바이오마커값을 결정하기 위해 생물학적 샘플들이 분석된다. 생물학적 샘플에서 발견되어 측정된 모든 성분은 본 발명의 범위에 속한다는 것을 알 수 있을 것이다. 예컨대, 혈액 샘플에서 측정될 수 있는 유전학적 바이오마커와 임의의 다른 적절한 생물학적 샘플에서 측정될 수 있는 바이오마커가 또한 포함될 수 있다.Biological samples are analyzed to determine biomarker values for each component in the biological sample for which biomarker values are desired. It will be appreciated that all components found and measured in biological samples are within the scope of the present invention. For example, genetic biomarkers that can be measured in blood samples and biomarkers that can be measured in any other suitable biological sample can also be included.

본 발명의 다른 특징은 질병 및 사망을 예측하기에 유용한 새로운 바이오마커 세트를 식별하는 것이기 때문에, 이 바이오마커 세트는 특정 질병 또는 특정 사망 원인을 예측하기 위한 통계적 중요성을 갖는 것으로 이전에 공지되지 않았던 바이오마커를 포함할 수 있다. 따라서, 사용될 수 있는 바이오마커의 총수는 대체로 제한되지 않기 때문에, 사용될 수 있는 실제 바이오마커의 개수는 일반적으로 실제 경계적 및 방법론적 고려에 의해 제한될 뿐이다.Since another feature of the present invention is the identification of new biomarker sets useful for predicting disease and death, biomarkers that have not been previously known to have statistical significance for predicting specific diseases or causes of certain deaths. It may include a marker. Thus, since the total number of biomarkers that can be used is generally not limited, the number of actual biomarkers that can be used is generally only limited by actual boundary and methodological considerations.

본 발명의 또 다른 특징은 향후 특정 시간 주기 또는 연령 구간내에서 특정의 생물학적 조건을 예측하기 위한 컴퓨터 시스템을 제공하는 것이기 때문에, 바이오마커값의 총 개수는 단일의 특정 생물학적 조건을 예측하기 위한 통계적 중요성을 갖는 바이오마커값에만 제한될 수 있다. 따라서 대부분의, 결국 모든 주요한 유형의 질병과 사망의 기초 원인을 예측 및 관리하기 위한 범용 수단으로서 서브젝트 시스템이 사용되는 것으로 생각되지만, 본 명세서에 개시된 방법의 사용은 한번에 하나의 질병 또는 사망 원인을 위한 것일 수 있다.Since another feature of the present invention is to provide a computer system for predicting a specific biological condition in a particular time period or age interval in the future, the total number of biomarker values is statistically important for predicting a single specific biological condition. Only biomarker values with Thus, although most of the subject systems are believed to be used as a universal means for predicting and managing the underlying causes of all major types of disease and death, the use of the methods disclosed herein is intended for one disease or death cause at a time. It may be.

생물학적 샘플은 수집된 후, 즉각적으로 분석될 수 있으며, 또한 샘플들은 분석후 저장될 수 있다. 다수의 샘플들이 비교적 짧은 시간 주기 동안 그리고 즉각적인 실제 분석에 도움이 되지 않는 환경하에서 수집될 수 있는 것이 예상되기 때문에, 샘플들은 분석후 저장되는 것이 바람직하다. 샘플들은 실질적인 시간 주기 동안 저장되는 것이 일반적이기 때문에, 샘플들은 냉동되는 것이 일반적이다.Biological samples can be collected and analyzed immediately, and samples can also be stored after analysis. Samples are preferably stored after analysis because it is expected that a large number of samples can be collected for a relatively short time period and under circumstances that do not benefit the immediate practical analysis. Since the samples are generally stored for a substantial period of time, the samples are typically frozen.

이 샘플들은 샘플을 완전하게 보전하는 조건을 이용하여 저장되고 운반된다. 이러한 기술은, 예컨대 중국의 첸, 캠블, 리, 페토(Chen, J., Campbell T.C., Li, J., Peto, R)에 의한 Diet, life-style and mortality 와, 1990년 영국 옥스포드, 뉴욕 이타카, PRC 베이징, 옥스포트 대학 출판부, 코넬 대학 출판부, 피플즈 메디컬 출판부의 A Study of the Characteristics of 65 Chinese Counties, 에 개시되어 있다.These samples are stored and transported using conditions that preserve the sample completely. Such techniques include, for example, Diet, life-style and mortality by Chen, J., Campbell TC, Li, J., Peto, R of China, Oxford, England, Ithaca, 1990 , A Study of the Characteristics of 65 Chinese Counties, PRC Beijing, Oxford University Press, Cornell University Press, People's Medical Press.

생물학적 샘플과 같은 물리적 표본을 사용하는 것은 이러한 샘플들이 설정된 효율적인 비용의 기술을 이용하여 수집, 저장 및 분석될 수 있는 시계열적으로 획득된 바이오마커의 풍부한 자료를 제공하는 실제적 수단을 제공하기 때문에 특히 바람직하다. 생물학적 샘플들은 적어도 5-10년, 가장 바람직하게는 15-20년 이상의 연장된 주기에 걸쳐 시험 모집단에 대해 수집되어야 생성된 데이터의 품질이 더 많은 신뢰성 있는 확률 예측을 계속적으로 제공할 수 있을 것이다.Using physical samples, such as biological samples, is particularly desirable because these samples provide a practical means of providing abundant data of time-series acquired biomarkers that can be collected, stored, and analyzed using established cost-effective techniques. Do. Biological samples should be collected for a test population over an extended period of at least 5-10 years, most preferably 15-20 years or more, so that the quality of the data generated will continue to provide more reliable probability predictions.

본 발명의 시스템의 신뢰성은 수집된 바이오마커 데이터의 품질에 의해 최종적으로 결정되기 때문에, 모든 관점을 고려하여 데이터의 완전성을 보장하기 위해 적절한 측정이 취해진다. 예를 들면, 바이오마커의 안정성을 고려하여 보면, 시간에 걸쳐 바이오마커값의 영향 또는 악화 원인이 될 수 있는 많은 요인들을 설명하기 위해 적절한 측정을 취할 필요가 있다.Since the reliability of the system of the present invention is finally determined by the quality of the collected biomarker data, appropriate measures are taken to ensure the integrity of the data in all respects. For example, taking into account the stability of biomarkers, it is necessary to take appropriate measurements to account for many factors that can cause the effects or deterioration of biomarker values over time.

게다가 시험 모집단의 구성원 또는 검증 서브젝트로부터 얻은 물리적 표본으로부터 바이오마커 데이터와, 각각의 검사의 정규식 및 생활 습관 검사로부터 얻은 바이오마커 데이터를 얻기 위한 것으로 서브젝트가 설명되는 것이 통상적이지만, 임의의 자료로부터 얻은 바이오마커 데이터의 사용은 본 발명의 범위에 속한다. 예컨대, 본 발명의 방법은 뇌전도(EEG) 데이터, 심전도(ECG) 데이터, 방사선(X-레이) 데이터, 자기 공진 이미지화(MRI) 등과 같은 전자 생물학적 측정 기술로부터 얻어진 의료 진단 데이터, 또는 생물학적 샘플과 정규식 및 생활 습관 검사로부터 시계열적으로 얻어진 바이오마커 데이터만으로 또는 바람직하게 이들의 조합을 이용하는 것을 포함할 수 있다.In addition, it is common for subjects to be described to obtain biomarker data from physical samples obtained from members of the test population or from a validation subject, and from biomarker data from the regular expression and lifestyle tests of each test, but biodata obtained from any data source. The use of marker data is within the scope of the present invention. For example, the methods of the present invention can be used for medical diagnostic data obtained from electronic biological measurement techniques such as electroencephalogram (EEG) data, electrocardiogram (ECG) data, radiation (X-ray) data, magnetic resonance imaging (MRI), or biological samples and regular expressions. And using only biomarker data obtained in time series from a lifestyle test or preferably a combination thereof.

시험 모집단은 년 단위로 관리되는 것이 바람직하기 때문에, 전체 일반 모집단을 나타내는 시험 모집단에 대해 사망률이 검사될 것으로 예상될 수 있다. 시험 모집단에서의 각 사망률에 대해, 개인이 식별되고 사망의 기초 원인이 기록되며, 공지된 코드화 시스템을 이용하는 것이 바람직하다. 공지된 코드화 시스템의 예로는 1992-1994년 제네바 세계 보건 기구(WHO)의 10번째 개정판인 International Statistical Classification of Diseases and Related Health Problems(ICD-10)가 있다. 다른 코드화 시스템도 이용될 수 있으며, 본 발명의 범위내에 속하게 된다.Since the test population is preferably administered on a yearly basis, it can be expected that mortality will be tested for the test population representing the entire general population. For each mortality rate in the test population, individuals are identified and the underlying cause of death is recorded and it is desirable to use known coding systems. An example of a known coding system is the International Statistical Classification of Diseases and Related Health Problems (ICD-10), the tenth edition of the 1992 World Health Organization (WHO). Other encoding systems can also be used and fall within the scope of the present invention.

시험 모집단의 구성원이 질병 또는 특정 생물학적 조건을 갖는 경우 식별을 위한 효율적인 시스템을 이용함으로써, 질병률 데이터가 시험 모집단의 바이오마커 및 사망률을 수집하는 것에 더하여 수집된다.By using an efficient system for identification when a member of a test population has a disease or specific biological conditions, morbidity data is collected in addition to collecting the biomarkers and mortality of the test population.

바이오마커값의 데이터베이스는 바이오마커와 바이오마커 샘플들이 수집 및 기록된 때에 데이터와 연령을 기록한 각각의 개인으로부터의 정보, 진단 및 발병일을 포함하여 질병의 발생, 의료 조건, 의료 병리학 또는 사망을 기록한 개인의 검사로부터의 정확한 정보를 포함하는 것이 바람직하다. 이 데이터베이스는 각각의 발병전, 발병후 및 발병한 동안 평가된 바이오마커의 값을 포함하며, 이 데이터베이스는 실행 가능하다.The database of biomarker values records the incidence, medical conditions, medical pathology, or death of the disease, including information, diagnosis, and date of onset of each biomarker and biomarker samples from which the data and age were recorded when they were collected and recorded. It is desirable to include accurate information from an individual's examination. This database contains the values of the biomarkers evaluated before, during and after each onset, and this database is viable.

본 발명의 한가지 관점은 특정한 질병 또는 사망의 기초 원인의 향후 발병을 예측하기 위한 통계적 중요성으로 아직 알려지지 않은 바이오마커의 식별에 관련되어 있기 때문에, 가능한 많은 바이오마커가 관리되어야 한다. 바람직한 실시예에 있어서, 컴퓨터를 기반으로 한 통계 분석 방법을 개발하기 위해 사용될 수 있는 바이오마커의 수를 실질적으로 제한하지 않는다 하더라도, 시험 모집단의 각각의 구성원로부터 대략 200개의 바이오마커 값이 얻어진다.As one aspect of the present invention relates to the identification of biomarkers that are not yet known for their statistical significance for predicting future onset of the underlying cause of a particular disease or death, as many biomarkers as possible should be managed. In a preferred embodiment, approximately 200 biomarker values are obtained from each member of the test population, although without substantially limiting the number of biomarkers that can be used to develop computer-based statistical analysis methods.

본 발명은 특정 시간 주기 또는 연령대의 특정된 생물학적 조건을 예측하기 위한 실용적이고 신뢰성 있는 시스템을 제공하는 것을 목적으로 하기 때문에, 적어도 2개의 상이한 시간에서 시험 모집단의 각각의 구성원로부터 실질적으로 완전한 세트의 바이오마커가 수집된다. 시간에 따른 추세와 변화에 대한 정보를 얻기 위해, 적어도 3개의 상이한 시간에서 완전한 세트가 수집되어야 하는 것이 더욱 바람직하며, 실질적으로 실행 가능한 긴 기간에서 바이오마커 값이 수집되는 것이 더욱 바람직하다.Because the present invention aims to provide a practical and reliable system for predicting specific biological conditions at a particular time period or age, a substantially complete set of bios from each member of the test population at at least two different times. Markers are collected. In order to obtain information on trends and changes over time, it is more desirable that a complete set be collected at at least three different times, and more preferably that the biomarker values are collected in a substantially viable period.

본 발명의 또 다른 관점에 있어서, 본 발명은 개인의 개별적 바이오마커값의 변화율 또는 변화율의 변화가 주어진 바이오마커값의 실제 레벨보다는 향후 건강 상태를 예측하기 위해 더 중요할 수 있다는 이론에 기초를 두고 있다. 판별 함수는 실질적으로 완전한 세트의 바이오마커값을 이용하여 결정되는 것이 통상적이다. 전체적으로 완전한 세트의 바이오마커값이 모든 검사 경우에서 시험 모집단의 모든 구성원로부터 얻어지는 것으로 예상되는 것이 합리적이지 못하다는 실제 이유는, 본 발명의 통계 분석 방법이 통계적으로 타당한 방식에서 불완전한 데이터에 대해 신뢰성을 고려한 방법을 포함한다는 것을 알 수 있기 때문이다.In another aspect of the present invention, the present invention is based on the theory that the rate of change or change in the rate of change of an individual individual biomarker value of an individual may be more important for predicting future health status than the actual level of a given biomarker value. have. The discriminant function is typically determined using a substantially complete set of biomarker values. The practical reason why it is not reasonable to expect that a complete set of biomarker values as a whole is obtained from all members of the test population in all test cases is that the statistical analysis method of the present invention takes into account reliability for incomplete data in a statistically valid manner. Because you know how to include it.

본 발명의 다른 목적은 향후 특정 질병의 위험을 정량적으로 평가하는 수단을 제공하는 것과, 모든 향후 질병 중에서 가장 낮은 위험을 갖는 이들 생물학적 조건을 한정 및 식별하기 위한 실제 수단을 제공하는 것이다. 따라서 본 명세서에서 "특정된 생물학적 조건"이란 가장 건강한 건강 상태로부터 가장 심각한 질병 상태까지의 건강 상태의 모든 범위를 포함한다는 것을 의미한다. 따라서, 본 발명은 가장 건강한 상태로부터 가장 열악한 건강 상태에 대한 향후 건강을 관리 및 예측하기 위한 시스템을 제공하기 위한 것이다.Another object of the present invention is to provide a means for quantitatively evaluating the risk of a particular disease in the future and to provide a practical means for defining and identifying those biological conditions with the lowest risk among all future diseases. Thus, a "specific biological condition" is meant herein to encompass the full range of health conditions, from the healthiest health conditions to the most serious disease states. Accordingly, the present invention is to provide a system for managing and predicting future health from the healthiest state to the worst health conditions.

시험 모집단으로부터 획득된 결과가 특정 지역(국가)에서 일반 모집단의 향후 건강 상태를 예측하기 위해 사용될 수 있다고 하더라도, 개별적인 향후 건강 예측이 이루어질 동일한 일반 모집단으로부터의 시험 모집단을 선택하기 위해 반드시 필요한 것은 아니다. 이러한 제한은 그들 지역의 특성인 질병의 확률을 갖는 개인의 모집단이 그리고 상이한 세트의 질병의 확률을 갖는 모집단의 새로운 지역으로 이동이 공지되었기 때문에 그들이 이동하는 지역의 특성인 질병을 얻을 필요는 없다. 이것은 동시에 발생하며, 새로운 지역의 정규식 및 생활 습관의 확득에 따라 발생한다. 즉, 세계의 모든 인종과 민족은 그들 고유의 특성에 관계없이 동일한 일반적인 질병을 갖는 경향이 있으며, 각각의 인종 또는 민족에 고유한 것이 될 수 있다.Although the results obtained from the test population can be used to predict the future health status of the general population in a particular region (country), it is not necessary to select a test population from the same general population for which individual future health predictions will be made. This restriction does not need to obtain a disease that is characteristic of the area in which they travel because it is known that a population of individuals having a probability of disease that is characteristic of their area and to a new area of the population that has a different set of probability of disease. This happens at the same time and in accordance with the acquisition of regular diets and lifestyles in the new area. That is, all races and ethnicities in the world tend to have the same general disease, regardless of their own characteristics, and can be unique to each race or ethnicity.

본 발명의 특징들 중 하나는 향후 질병 문제의 발병이 그 문제가 통상적으로 진단되기 이전에 발생할 때를 예측하기 위해 제공되는 시스템을 제공하는 것이다. 특정 개인에 대해 발생하는 특정의 건강 문제의 향후 발병 시기는 대규모 시험 모집단으로부터 수집된 데이터베이스에 서브젝트 판별 분석법을 적용시키는 것에 기초한 특정의 정량적 확률 추정으로 예측될 수 있다. 게다가, 본 발명은 향후 특정의 건강 문제를 예측하기 위한 시스템을 제공하며, 더 긴 시간 주기 동안 더 대규모의 시험 모집단에 대해 더 큰 신뢰성을 가지고 더 많은 데이터가 수집되는 시스템을 제공한다.One of the features of the present invention is to provide a system provided to predict when future development of a disease problem will occur before the problem is usually diagnosed. The timing of future outbreaks of a particular health problem for a particular individual can be predicted by a specific quantitative probability estimate based on applying subject discrimination methods to a database collected from a large trial population. In addition, the present invention provides a system for predicting specific health problems in the future, and provides a system in which more data is collected with greater reliability for a larger test population over a longer period of time.

생물학적 샘플들은 정량적 값이 필요한 각 바이오마커에 대해 분석되는 것이 통상적이다. 비용 및 편이의 이유에 의해, 수집될 수 있는 대규모의 샘플에 의해, 샘플들은 질병이 있는 것으로 이미 진단된 개인에 대해 또는 샘플이 수집되는 시간 주기 동안에 사망한 사람뿐만 아니라 시험 모집단의 나머지의 무작위로 선택된 부분에 대해서만 초기에 분석될 것이다. 예컨대, 검사된 시험 모집단의 년중 사망률이 통상 2-3% 의 범위에 있다면, 300,000의 구성원을 갖는 시험 모집단이 6,000-9,000 사망자의 년중 사망률을 나타낼 것이며, 대다수의 사망자는 주요한 사망의 기초 원인 각각에 의한 것일 것이다.Biological samples are typically analyzed for each biomarker that requires quantitative values. For reasons of cost and convenience, with a large number of samples that can be collected, the samples are randomized for the rest of the test population as well as for individuals who have already been diagnosed with the disease or who have died during the time period in which the samples were collected. Only selected parts will be analyzed initially. For example, if the yearly mortality rate of the tested test population is typically in the range of 2-3%, a test population with 300,000 members will show a yearly mortality rate of 6,000-9,000 deaths, with the majority of deaths at each of the underlying causes of death. Will be due.

본 발명의 다른 특징은 다수의 사망자가 시험 모집단에서 발생되어 바이오마커값이 초기에 결정된 사람들을 선택할 때까지 대기하는 스텝을 포함한다. 또한, 현재 살아있는 검증 구성원의 집단은 시험 모집단의 나머지로부터 선택될 것이다. 통계적으로 중요한 결과를 얻기 위해 충분히 많은 샘플을 위한 필요와 비용을 조절하기 위한 필요의 균형을 맞출 필요가 있기 때문에, 본 발명의 시스템은 최저의 비용에 대한 다수의 정보를 제공하는 경향을 가질 샘플만으로 분석 측정 비용을 한정하는 방법을 제공한다. 본래, 시험 모집단에서 발생하는 사망자의 수가 많아질수록, 일정 시간에 대해 분석될 샘플의 수는 더 많아진다. 그러나, 향후 건강 상태의 더욱 신뢰성 있는 정량적 예측을 설정하는 관점으로부터 획득된 데이터의 값은 부가적인 바이오마커값을 획득하는 비용과 다소 균형을 이루게 될 것이다. 이것은 공지된 다른 종래 시스템과 구분되는 본 발명의 많은 특징들 중 하나이다. 샘플 분석을 지연시키는 이러한 기술은 획득된 결과가 더 큰 실제 값을 가질 때까지 비용의 지연을 허용한다.Another feature of the invention includes the step of waiting for a number of deaths to occur in a test population to select those whose biomarker values were initially determined. In addition, the population of currently valid validation members will be selected from the rest of the test population. Since we need to balance the need for enough samples with the need to adjust the cost to obtain statistically significant results, the system of the present invention only needs to provide a sample that will tend to provide a large amount of information for the lowest cost. Provides a way to limit the cost of analytical measurements. Inherently, the greater the number of deaths occurring in the test population, the greater the number of samples to be analyzed over time. However, the value of the data obtained from the point of view of establishing a more reliable quantitative prediction of future health status will be somewhat balanced with the cost of obtaining additional biomarker values. This is one of many features of the present invention that distinguishes it from other known conventional systems. This technique of delaying sample analysis allows a cost delay until the obtained result has a larger actual value.

분석될 샘플을 선택하는 경우, 이미 공지된 방법을 이용하여 바이오마커값이 결정될 수 있다. 다수의 샘플들은 대부분은 아니더라도 다수의 바이오마커값에 대해 각각 별도의 측정에 의해 분석되기 때문에, 이들 측정은 다중 채널 분석기, 예컨대 인디아나주 인디아나폴리스에 소재한 Boehringen Mannheim Corp. 에서 제조한 BMD/Hitachi Model 747-100 을 이용하여 이루어지는 것이 일반적이다. 이러한 분석기는 전체 샘플의 비교적 적은 부분을 이용하여 바이오마커의 선택된 다수 부분의 바이오마커값을 동시에 측정하도록 고안될 수 있다. 예컨대, 수집된 혈액의 양은 대략 15 ml 인 것이 통상적이지만, 대략 10-30 ml 만이 분석 측정에 요구될 수 있다. 이와 유사하게, 수집된 소변의 양은 대략 50 ml 이지만, 분석을 위해서는 대략 100 μl 가 요구된다. 다른 생물학적 샘플에 대해서도 소량이 적절하게 사용될 수 있다.When selecting a sample to be analyzed, biomarker values can be determined using methods known in the art. Since many samples are analyzed by separate measurements for multiple, if not most, biomarker values, these measurements are multi-channel analyzers such as Boehringen Mannheim Corp., Indianapolis, Indiana. It is generally made using the BMD / Hitachi Model 747-100 manufactured by. Such an analyzer can be designed to simultaneously measure biomarker values of selected multiple parts of the biomarker using a relatively small portion of the entire sample. For example, the amount of blood collected is typically approximately 15 ml, but only approximately 10-30 ml may be required for analytical measurements. Similarly, the amount of urine collected is approximately 50 ml, but approximately 100 μl is required for the analysis. Small amounts can also be used as appropriate for other biological samples.

바람직한 실시예에 있어서, 물리적으로 보존 가능한 생물학적 샘플들이 사용될 수 있으며, 비교적 소량의 분석 샘플이 임의의 무작위로 선택된 시간에서 측정에 사용될 수 있기 때문에, 샘플이 수집된 후에 주어진 샘플에서 검출 가능한 임의의 바이오마커를 이용하여 본 발명의 방법이 유효하게 적용될 수 있다. 예컨대, 이 시스템이 현재 가장 중요한 바이오마커라고 생각되는 것을 초기에 분석하기 위해 사용될 수 있지만, 시스템은 향후 건강 상태를 예측하기 위한 중요성을 갖기 위해 아직 인식되지 않은 다른 바이오마커를 포함하도록 용이하게 적용될 수 있다. 대체로, 적절한 시간과 경제적인 자료를 가지고, 보존된 생물학적 샘플에서 검출 가능한 모든 바이오마커가 최종적으로 측정될 것이다.In a preferred embodiment, any bioavailable detectable in a given sample after the sample has been collected since physically conservable biological samples can be used and relatively small amounts of analytical sample can be used for measurement at any randomly selected time. Using the marker, the method of the present invention can be effectively applied. For example, this system can be used to initially analyze what is currently considered the most important biomarker, but the system can easily be adapted to include other biomarkers that have not yet been recognized in order to have importance for predicting future health conditions. have. In general, with adequate time and economic data, all biomarkers detectable in preserved biological samples will be finally determined.

시험 모집단의 각 구성원에 대한 실질적으로 완전한 세트의 바이오마커값을 얻도록 하는 것이 바람직하지만, 샘플들이 지리학적으로 넓게 분산된 모집단 베이스로부터 시계열적으로 수집되는 경우에는 그 실현이 매우 어렵게 되는 것이 통상적이다. 종래의 통계석 분석 방법을 이용하여 불완전한 세트의 데이터가 유기되어 전혀 사용되지 않는 경우, 초기 시험 모집단의 대부분을 최종적으로 차지하는 실질적인 데이터의 분량이 유기될 필요가 있을 것이다. 이로써 실질적인 자료의 낭비가 생길 수 있으며, 나머지 데이터에 의해 발생되는 결과로서 심각한 질적인 저하가 생길 수 있다. 이 컴퓨터를 기반으로한 방법은 "누락된 값"을 채우기 위한 통계적으로 검증 가능한 기술을 이용함으로써 수집된 모든 데이터를 이용하는 수단을 제공하는 특징을 포함한다. 이것은 지리적으로 넓게 분산된 시험 모집단으로부터 다수의 검증 구성원에 대해 어떤 종래 기술에 비해서도 대량의 데이터에 해당하는 양을 수집하는 것에 기초한 서브젝트 방법에 특히 유용하다. 다양한 대규모의 시험 모집단으로부터 포괄적인 데이터를 얻는 것은 전체 인간 실험을 나타내는 넓은 범위의 정규식 및 생활 습관을 갖는 구성원로부터 바이오마커값을 획득하는데 특히 바람직하다.It is desirable to obtain a substantially complete set of biomarker values for each member of the test population, but it is common to realize that when samples are collected in time series from a geographically widely distributed population base, the realization becomes very difficult. . If an incomplete set of data is abandoned and not used at all using conventional statistical analysis methods, the amount of substantial data that ultimately occupies most of the initial test population will need to be abandoned. This can lead to substantial waste of data and serious qualitative degradation as a result of the rest of the data. This computer-based method includes a feature that provides a means of using all data collected by using statistically verifiable techniques to fill in "missing values." This is particularly useful for subject methods based on collecting an amount corresponding to a large amount of data compared to any prior art for a large number of validation members from geographically dispersed test populations. Obtaining comprehensive data from various large test populations is particularly desirable for obtaining biomarker values from members with a wide range of regular expressions and lifestyles representing entire human experiments.

본 발명을 기술하기 위해 다음의 용어법이 설명된다.The following terminology is described to describe the present invention.

예컨대, "특정의 생물학적 조건"이란 다음 설명 중 임의의 하나를 말한다.For example, "specific biological conditions" refers to any one of the following descriptions.

예를 들어, International Statistical Classification of Diseases and Related Health Problem, supra 에서 분류된 바와 같은 특정의 질병(예, 당뇨병);Certain diseases (eg, diabetes) as classified, for example, in the International Statistical Classification of Diseases and Related Health Problems, supra;

특정의 의료 또는 건강 상태나 증후(예컨대, 일반의 정규 분포로부터 바이오마커 또는 바이오마커값의 편차로 한정된, 고혈압);Certain medical or health conditions or symptoms (eg, hypertension, limited to a deviation of the biomarker or biomarker value from a normal normal distribution);

특정의 의료 이벤트와 그 후유증(예컨대, 국소 빈혈 발작 및 그로 인한 사망 또는 비사망 그리고 발작과 관련된 부분 마비 및 관련된 조건; 심근 경색과 그로 인한 사망 또는 비사망 및 MI 관련 조건);Certain medical events and their sequelae (eg, ischemic seizures and resulting death or non-death and partial paralysis and related conditions associated with seizures; myocardial infarction and resulting death or non-death and MI-related conditions);

어떤 원인(처음 평가에서의 개인의 성 및 연령으로부터 인출된 것과 같이 사망시의 평균 연령보다 더 조기의 연령에서 사망)으로 인한 조기 사망;Premature death due to any cause (death at an earlier age than the average age at death, such as withdrawn from the sex and age of the individual in the initial assessment);

특정 연령에서의 사망;Death at certain ages;

특정 세트의 바이오마커에 대해 특정 세트의 바이오마커값을 갖는 것에 기초하여 새롭게 정의된 카테고리.A newly defined category based on having a particular set of biomarker values for a particular set of biomarkers.

특정의 생물학적 조건의 획득이나 발병은 주어진 평가 시간에서 특정의 생물학적 조건을 개인이 갖지 않지만, 특정의 생물학적 조건을 연속적으로 실험하는 상태를 말한다. 여기서, 개인이 특정 생물학적 조건을 가졌던 것으로 언급된 경우 발병은 그 개인이 상기 특정의 생물학적 조건을 가졌을 때 발생하는 것으로 정의된다.Acquisition or onset of specific biological conditions refers to a condition in which an individual does not have a specific biological condition at a given evaluation time, but continuously tests a specific biological condition. Here, when an individual is mentioned to have had a specific biological condition, the onset is defined as what occurs when the individual has the specific biological condition.

특정의 생물학적 조건과 특정의 생물학적 조건을 가지지 않거나 않았던 사람의 모집단에 대해서, 집단 D 및 집단로 정의된 2개의 상보적 부분모집단이 다음과 같이 있다.For populations of people with or without specific biological conditions and specific biological conditions, population D and population There are two complementary subpopulations defined by

집단 D : 특정의 시간대내에서 특정의 생물학적 조건을 가질 개인의 부분모집단. 특정의 시간대는 특정의 책력 시간 주기(예컨대, "향후 5년"), 특정의 연령 구간(예컨대, "65세 내지 70세") 또는 이와 유사한 시간 또는 연령 구간을 가리킬 수 있다.Population D: A subpopulation of individuals who will have specific biological conditions within certain time periods. Certain time zones may refer to specific almanac time periods (eg, "5 years to come"), specific age intervals (eg, "65 to 70 years") or similar time or age intervals.

집단: 특정 시간내에서 특정의 생물학적 조건을 갖지 않을 개인의 부분모집단.group : Subpopulation of individuals who will not have specific biological conditions within a certain time.

이들 서브젝트의 부분모집단은 다수(가능한 많이)의 바이오마커에 대해 특정의 시계열적 패턴 데이터에 의해 부분적으로 특정된다. 시계열적 패턴은 바이오마커의 레벨 또는 구조적 집중과 레벨에서의 변화를 포함한다. 만일 시계열적 바이오마커 패턴이 부분모집단을 부분적으로 특정한다는 것을 인식하고, 특정의 개인으로부터 필요한 데이터를 갖는다면, 그 개인은 집단 D 또는 집단중 어느 집단으로 나타나는지에 따라 2개의 상보성 집단중 하나로 분류될 수 있다.A subset of these subjects is specified in part by specific time series pattern data for a large number (as many as possible) of biomarkers. Time series patterns include levels or structural concentrations of biomarkers and changes in levels. If one recognizes that a time series biomarker pattern partially specifies a subpopulation and has the data needed from a particular individual, that individual is group D or population It can be classified into one of two complementarity groups depending on which group they appear in.

집단 PD : 특정 시간대의 초기에 특정 시간대에서 특정의 생물학적 조건을 갖는 것으로 예측, 즉 집단 D 에 속하는 것으로 나타나는 사람들의 집단. 이들 개인들은 특정 시간대에서 특정의 생물학적 조건을 가질 예정된 높은 확률을 갖는 것으로 설명된다.Population PD: A group of people who are expected to have certain biological conditions at a particular time zone early in a particular time zone, ie, appear to belong to group D. These individuals are described as having a high probability of having certain biological conditions at certain times.

집단: 특정 시간대의 초기에 특정 시간대에서 특정의 생물학적 조건을 갖지 않는 것으로 에측, 즉 집단에 속하는 것으로 나타나는 사람들의 집단. 이들 개인들은 특정 시간대에서 특정의 생물학적 조건을 가질 예정된 낮은 확률을 갖는 것으로 설명된다.group : Prediction, that is, a group that does not have a specific biological condition at a specific time zone at the beginning of a specific time zone. A group of people that appear to belong to. These individuals are described as having a low probability of having certain biological conditions at certain times.

"예정된 높은 확률"이란 용어는 몇 퍼센트, 아마도 1% 이하의 낮은 확률, 아니면 10%, 20%, 50% 또는 실질적으로 그보다 더 높은 퍼센트의 확률을 갖는 것이 특정의 생물학적 조건에 따라 그 중요성이 바뀔 수 있다. 예를 들어, 흡연으로 인해 폐암에 걸릴 위험이 증가되는 것은 흡연으로 인한 실제 여러번의 위험 증가가 향후 15-20년 이상만큼 긴 기간으로서 폐암에 걸릴 확률이 5-10%의 범위에 속한다 하더라도 중요하고 바람직하게는 피할 수 있는 위험으로서 인식될 수 있다. 어떤 경우에 있어서, 시스템이 적용되는 특정의 생물학적 조건에 대해, 정량적으로 예정된 확률이 결정될 수 있다. "예정된 낮은 확률"은 특정의 생물학적 조건을 갖는 높은 위험의 집단에 속하지 않는 확률로서 간단히 특정될 수 있으며, 이 용어는 구체적 값으로 개별적으로 특정될 수 있다.The term "expected high probability" means that having a low probability of several percent, perhaps less than 1%, or a probability of 10%, 20%, 50% or substantially higher percentages will vary depending on the particular biological condition. Can be. For example, the increased risk of lung cancer from smoking is important, even if the actual multiple risk increases from smoking are as long as 15-20 years or more, even if the probability of developing lung cancer is in the 5-10% range. Preferably it can be recognized as an inevitable risk. In some cases, for a particular biological condition to which the system is applied, a quantitatively determined probability can be determined. "Expected low probability" can simply be specified as a probability that does not belong to a high risk population with certain biological conditions, and the term can be individually specified by specific value.

시험 모집단의 통계적으로 적절하게 규정된 수의 구성원이 집단 D 또는 집단에 속하는 것으로 식별될 수 있을 때, 특정 시간 주기 또는 연령 구간내의 특정된 생물학적 조건을 갖는 시험 모집단의 각각의 구성원에 대해 구성원을 집단 PD 및로 분류하고 또는 확률을 추정하기 위한 통계적 과정, 즉 집단 PD에 속할 확률 또는 집단에 속할 확률을 결정하기 위해 집단 D 의 구성원의 바이오마커값은 본 발명의 방법을 이용하여 집단와 비교될 수 있다. 본 발명의 바람직한 실시예에 있어서, 구성원을 집단 PD와로 분류하기 위한 통계적 과정은 이하 기술된 바와 같이 판별 분석 과정의 형태가 될 것이며, 이 과정은 "판별 과정(discriminant procedure 또는 discrimination procedure)" 이라 할 수 있다. "통계적으로 적절히 규정된 수"는 분석에 사용된 바이오마커의 전체 개수와 바이오마커값이 이용 가능한 검증 구성원의 전체 개수가 서브젝트 방법에 사용된 연산 과정에 대해 수렴이 얻어지도록 크게 되는 것으로 정의될 수 있다.Group D, or population, with a statistically appropriate number of members of the test population For each member of the test population with a specified biological condition within a particular time period or age interval, when identified as belonging to the group PD and Statistical process for estimating the probability or estimating the probability, ie the probability or group belonging to the group PD To determine the probability of belonging to a biomarker value of a member of population D, the population is determined using the method of the present invention. Can be compared with In a preferred embodiment of the invention, the member is associated with a group PD The statistical process to classify as will be in the form of a discriminant analysis process as described below, which may be referred to as a "discriminant procedure or discrimination procedure". A "statistically appropriately defined number" can be defined as the total number of biomarkers used in the analysis and the total number of verification members for which biomarker values are available are large such that convergence is obtained for the computational process used in the subject method. have.

판별 과정은 2개의 관련 에러율을 갖는다.The discrimination process has two related error rates.

(1) 거짓 정의 명제의 확률, 즉 집단 PD 로 분류되지만 실제 집단에는 속할 향후 서브젝트의 확률.(1) the probability of a false positive proposition, that is, classified as a group PD but the actual group The probability of a future subject to belong to.

(2) 거짓 부정 명제의 확률, 즉 집단로 분류되지만 실제 집단 D 에 속할 향후 서브젝트의 확률.(2) the probability of false negative propositions, ie groups The probability of future subjects classified as but belonging to the actual population D.

본 발명의 바람직한 실시예는 이들 2가지 에러율의 정확한 추정값을 얻기 위한 관련 방법이 될 것이다.A preferred embodiment of the present invention will be a related method for obtaining accurate estimates of these two error rates.

본 발명의 바람직한 실시예는 다음 3가지 단계로 구성되며, 각각의 단계는 다수의 스텝을 갖는다. 3가지 단계는 다음과 같다.A preferred embodiment of the present invention consists of three steps, each of which has a number of steps. The three steps are as follows:

단계 Ⅰ. 평가 방법을 설정하고 고려를 위한 바이오마커 선택.Step Ⅰ. Set up evaluation methods and select biomarkers for consideration.

단계 Ⅱ. 후보 바이오마커를 판별력을 갖는 선택 바이오마커 세트로 감축, 공분산 구조 및 예측값의 혼합 모형 추정 수행.Step II. Estimation of candidate models with mixed discriminant, covariance structure, and predictive values.

단계 Ⅲ. 추정 평균 및 예측값을 이용하여 판별 함수 연산, 각 서브젝트에 대한 로지스틱 예측값 연산; 판별 함수에 대한 에러율 추정.Step III. A discriminant function operation using the estimated mean and the predicted value, and a logistic predictive value operation for each subject; Error rate estimation for the discriminant function.

각각의 단계는 여러개의 스텝을 갖는다.Each step has several steps.

단계내에서 일부 스텝 세트가 반복된다. 즉 특정 세트의 스텝이 특정의 목적을 달성할 때까지 여러번 반복될 수 있다. 단계와 이들의 스텝의 바람직한 실시예가 다음에 기술된다.Some set of steps is repeated within a step. That is, a particular set of steps can be repeated many times until a particular purpose is achieved. Steps and preferred embodiments of these steps are described next.

단계 Ⅰ. 평가 방법을 설정하고 고려를 위한 바이오마커 선택Step Ⅰ. Set up evaluation methods and select biomarkers for consideration

다음 스텝들은 본 발명의 바람직한 실시예에서 나타날 것이다.The following steps will appear in the preferred embodiment of the present invention.

스텝 1: 과정의 에러율을 추정하기 위한 방법 선택Step 1: Choose a method to estimate the error rate of the process

이 방법은 에러율을 추정하는 임의의 통계적으로 적절한 방법을 구체화한 것이다. 사용될 수 있는 많은 방법들 중 2가지는 트레이닝 샘플/유효 샘플 및 서브샘플링(또는 "재샘플링") 방법이다.This method embodies any statistically appropriate method of estimating the error rate. Two of the many methods that can be used are a training sample / effective sample and a subsampling (or “resampling”) method.

트레이닝 샘플/유효 샘플 방법; 트레이닝 샘플/유효 샘플 방법에 있어서, 시험 모집단은 임의로 2개의 서브세트로 분할된다. 이 2개의 서브세트는 "트레이닝 샘플"과 "유효 샘플"이다. 모든 서브젝트(시험 모집단의 구성원)는 트레이닝 샘플 또는 유효 샘플 중 하나에 할당된다. 트레이닝 샘플에 있는 서브젝트로부터의 데이터는 판별 과정과 확률 추정 과정의 특정으로 유도하는 통계적 분석법에 사용된다. 평가 샘플에서의 서브젝트로부터의 데이터는 판별 과정의 에러율과 확률 추정치의 분포를 추정하는데 사용될 것이다.Training sample / effective sample method; For the training sample / effective sample method, the test population is optionally divided into two subsets. These two subsets are "training samples" and "effective samples". All subjects (members of the test population) are assigned to either training samples or valid samples. The data from the subjects in the training sample is used in statistical analysis to derive specificity of the discriminant process and the probability estimation process. Data from the subject in the evaluation sample will be used to estimate the distribution of error rates and probability estimates of the discrimination process.

서브샘플링 방법; "서브샘플링"이란 에러율의 감소적인 추정치를 생성하는데 사용될 수 있는 잭나이핑(jackknifing) 및 부트스트래핑(bootstrapping)을 포함하는 일정 부류의 통계적 방법을 말한다. 서브샘플링 방법에 있어서, 모든 서브젝트로부터의 데이터는 판별 과정의 특정 및/또는 확률 추정치의 분포로 유도하는 통계적 분석법에 사용된다. 모든 데이터를 이용함으로써, 특히, (1) 시험 모집단이 크지 않고, 또는 (2) 생물학적 조건을 갖는 사전 확률이 작은 경우, 대규모의 시험 모집단이라고 하더라도 트레이닝 샘플/유효 샘플 방법에서 얻었던 것보다 더 나은 판별 과정 및/또는 확률 추정 과정으로 유도할 수 있다. 본 발명에 있어서, 서브샘플링 방법은 연산이 집중적으로 이루어진다.Subsampling method; "Subsampling" refers to a class of statistical methods, including jackknifing and bootstrapping, that can be used to produce a reduced estimate of error rate. In the subsampling method, data from all subjects is used in a statistical analysis that leads to the distribution of a particular and / or probability estimate of the discrimination process. By using all the data, in particular, when (1) the test population is not large, or (2) the prior probability of having biological conditions is small, even a large test population can be better discriminated than that obtained by the training sample / effective sample method. Process and / or probability estimation process. In the present invention, the subsampling method is intensive computation.

스텝 2: 판별 과정/확률 추정 과정으로 유도하는 통계적 분석에 사용된 시험 모집단의 서브세트인 "트레이닝 샘플"과 이것과 상보성 서브세트인 "유효 샘플" 선택.Step 2: Select "Training Sample", which is a subset of the test population used for statistical analysis leading to the Discrimination / Probability Estimation process, and "Valid Sample", which is a complementary subset.

만일 서브샘플링 방법이 사용된다면, 모든 서브젝트로부터의 데이터는 판별 과정의 특정 및/또는 확률 추정의 분포로 유도하는 통계적 분석법에 사용된다. 이 경우, "트레이닝 샘플"은 전체 시험 모집단이다.If a subsampling method is used, the data from all subjects is used in a statistical analysis that leads to the distribution of a particular and / or probability estimate of the discrimination process. In this case, the "training sample" is the entire test population.

만일 트레이닝 샘플/유효 샘플 방법이 사용된다면, 트레이닝 샘플은 대략적으로 특정 비율의 시험 모집단을 포함할 것이다. 많은 경우에 있어서, 트레이닝 샘플 비율은 50% 를 차지할 것이지만, 다른 비율이 이용될 수 있다. 유효 샘플은 트레이닝 샘플에 포함되지 않는 모든 서브젝트를 포함할 것이다.If a training sample / effective sample method is used, the training sample will contain approximately a certain percentage of the test population. In many cases, the training sample rate will account for 50%, although other rates may be used. The valid sample will include all subjects that are not included in the training sample.

트레이닝 샘플에 대한 임의의 서브젝트 할당은 서브젝트 연령별로 구분하는 것이 통상적일 것이다. 서브젝트 연령은 적절한 구간으로 분류된다. 즉, 연령 집단층은 연령이 특정 연령 구간에 속하는 모든 서브젝트로 구성된다. 구간은 각 층에서의 서브젝트의 수가 통계적 분석을 위해 적절하도록 선택된다. 연령 집단층내에서 서브젝트는 트레이닝 샘플 또는 유효 샘플에 무작위로 할당될 것이다. 트레이닝 샘플에서 특정의 서브젝트 비율을 달성하기 위해 무작위법이 사용된다. 예컨대, 만일 트레이닝 샘플이 75%의 시험 모집단을 포함하는 것으로 특정된다면, 각 연령 집단층내의 트레이닝 샘플에 대략 75%의 서브젝트가 무작위로 할당될 것이다. 예를 들어, 연령이 65세 이상 70세 미만으로 하나의 연령 집단층을 특정한다면, 층내의 75%의 서브젝트가 트레이닝 샘플에 무작위로 할당될 것이다.Any subject assignment for training samples will typically be divided by subject age. Subject age is classified into appropriate intervals. In other words, the age group consists of all subjects whose age belongs to a specific age section. The interval is chosen so that the number of subjects in each floor is appropriate for statistical analysis. Within the age group, subjects will be randomly assigned to training samples or valid samples. Randomization is used to achieve a specific subject ratio in the training sample. For example, if the training sample is specified to contain 75% of the test population, approximately 75% of the subjects will be randomly assigned to the training samples within each age group. For example, if you specify one age group with ages 65 and below 70, 75% of the subjects in the floor will be randomly assigned to the training sample.

단계 3: 가능한 판별자인 잠재 바이오마커의 리스트 작성Step 3: Create a list of potential biomarkers that are possible discriminators

이 단계의 목적은 잠재 바이오마커(potential biomarker)라고 하는 적당하고 잠재적으로 유용한 바이오마커를 모두 리스트 작성하는 것이다. 바람직한 실시예에 있어서, 잠재 바이오마커의 리스트는 시험 모집단에서 서브젝트의 기록된 정량적, 개인적인 모든 특성을 포함할 것이다. 이 리스트는 시간에 대해 변경되지 않는 특성(예컨대, 출생일)과 체중 또는 혈액이나 소변으로부터의 실험 평가와 같은 시간 종속적인 특성을 포함할 것이다. 비정량적 특성, 예컨대 서브젝트의 좋아하는 색깔의 이름은 배제될 것이다.The purpose of this step is to list all suitable and potentially useful biomarkers called potential biomarkers. In a preferred embodiment, the list of potential biomarkers will include all recorded quantitative and personal characteristics of the subject in the test population. This list will include characteristics that do not change over time (eg, date of birth) and time-dependent characteristics such as body weight or experimental evaluation from blood or urine. Nonquantitative properties, such as the name of the favorite color of the subject, will be excluded.

스텝 3에 기록된 일부 잠재적 바이오마커는 판별을 위해 유용하지 않을 것이다. 이 단계의 나머지 스텝들은 잠재 바이오마커의 스텝 3으로부터 "후보 바이오마커" 세트를 수집한다. 바이오마커가 잠재적으로 유용한 판별자라는 이전의 연구/지식으로부터의 정보와 또는 트레이닝 샘플 데이터로부터의 정량적 증거가 있기 때문에, 각각의 후보 바이오마커가 선택될 것이다. 각각의 스텝에서, 후보로서 선택된 바이오마커는 잠재 바이오마커의 리스트로부터 제거되어 후보 바이오마커의 세트로 옮겨진다. 잠재 바이오마커의 리스트로부터 선택된 후보 바이오마커를 제거하는 이유는 일단 바이오마커가 후보로서 선택되었으면 그것을 고려할 이유가 없다는 것이고, 이미 리스트가 만들어져 있다는 것이다. 과정의 끝에서, 선택되지 않은 모든 잠재 바이오마커는 다음의 고려사항으로부터 제거될 것이며, 후보 바이오마커만이 부가적인 분석에 사용될 것이다.Some potential biomarkers recorded in step 3 will not be useful for determination. The remaining steps of this step collect a set of "candidate biomarkers" from step 3 of the latent biomarkers. Each candidate biomarker will be selected because there is information from previous studies / knowledge that the biomarker is a potentially useful discriminator and quantitative evidence from training sample data. In each step, the biomarkers selected as candidates are removed from the list of potential biomarkers and transferred to the set of candidate biomarkers. The reason for removing a selected candidate biomarker from the list of potential biomarkers is that once the biomarker has been selected as a candidate, there is no reason to consider it, and the list is already made. At the end of the process, all unselected potential biomarkers will be removed from the following considerations, and only candidate biomarkers will be used for further analysis.

스텝 4: 이전의 연구 및 경험에 기초하여 특정의 생물학적 조건에 관련된 것으로 확신있게 믿어지는 임의의 잠재 바이오마커를 포함함으로써 후보 바이오마커 세트를 초기화한다.Step 4: Initialize the candidate biomarker set by including any potential biomarker that is believed to be relevant to the particular biological condition based on previous research and experience.

이 스텝의 목적은 특정의 생물학적 조건을 위한 잠재적으로 중요하게 판별하는 바이오마커에 대한 이전 정보를 이용하는 것이다. 예를 들면, 특정의 생물학적 조건이 특정 시간에서 관상 동맥 심장병(CHD)을 갖는 것이라면, 이전의 연구에 의해 혈청 콜레스테롤, 심장 수축 혈압, 글루코스 과민성, 또는 담배 흡연(단지 몇가지만 제시)의 값들이 CHD 의 발병에 관련되고 잠재 바이오마커의 리스트로부터 후보 바이오마커의 리스트로 복제되어야 한다는 것을 나타내고 있다.The purpose of this step is to use previous information on potentially important biomarkers for specific biological conditions. For example, if a particular biological condition has coronary heart disease (CHD) at a particular time, previous studies have shown that the values of serum cholesterol, systolic blood pressure, glucose intolerance, or tobacco smoking (only a few) suggest CHD. It is related to the onset of and indicates that should be replicated from the list of potential biomarkers to the list of candidate biomarkers.

임의의 신뢰 가능한 정보 자료 또는 "실험에 기초를 둔 추측"은 특정의 생물학적 조건에 관련된 것으로 알려졌거나 믿는 바이오마커의 서브세트를 선택하는 것에 따라 다르게 될 수 있다. 초기에 선택된 바이오마커의 본질이 판별에 이용하기 위해 최종적으로 선택된 서브세트의 본질을 판정하는데 중요하지 않다 하더라도, 특정의 생물학적 조건을 예측하기 위해 가장 큰 통계적 중요성을 갖는 것으로서 이 시스템에 의해 최종적으로 확정된 바이오마커의 초기 선택은 경험적으로 결정된 서브세트로의 더욱 신속한 수렴(convergence)을 제공하는데 도움을 줄 것이다. 다시 말해서, 초기 선택이 경험에 의해 이루어질수록 수렴은 더욱 신속하게 된다.Any reliable informative material or "experiment based guess" may be different depending on the selection of a subset of biomarkers known or believed to be related to a particular biological condition. Although the nature of the initially selected biomarker is not critical to determining the nature of the subset finally selected for use in the determination, it is finally determined by this system as having the greatest statistical significance to predict a particular biological condition. Initial selection of the biomarkers made will help to provide faster convergence into an empirically determined subset. In other words, the faster the initial selection is made by experience, the faster the convergence.

스텝 5: 스텝 4로부터 "알려진 중요한" 바이오마커와 상호 관련되어 "통계적으로 중요한" 임의의 잠재 바이오마커를 후보 바이오마커 세트에 부가한다.Step 5: Add any latent biomarker "statistically important" correlated with the "known" important biomarkers from step 4 to the candidate biomarker set.

트레이닝 샘플로부터 얻은 데이터는 각각의 이전에 식별된 후보 바이오마커("알려진 중요한" 바이오마커)와 각각의 잠재 바이오마커 사이의 상관 계수를 연산하는데 사용된다. 임의의 통계적으로 타당한 상관 계수가 사용될 수 있다.The data obtained from the training sample is used to calculate the correlation coefficient between each previously identified candidate biomarker ("known important" biomarker) and each potential biomarker. Any statistically valid correlation coefficient can be used.

바이오마커를 식별하는 목적은 양호한 판별을 위해서이다. "알려진 중요한" 바이오마커의 상호 관계는 "알려진 중요한" 바이오마커 그 자체보다 더 나은 판별이 될 것이다. 적어도, 알려진 중요한 바이오마커의 상관 관계는 초기 분석에 포함되어야 한다.The purpose of identifying biomarkers is for good discrimination. The interrelationship of "known important" biomarkers would be a better discrimination than "known important" biomarkers themselves. At the very least, the correlation of known important biomarkers should be included in the initial analysis.

특정의 생물학적 조건이 하나 이상의 바이오마커의 값으로 실제 정의된다면(예컨대, 고혈압), 정의되는 바이오마커는 "알려진 중요한" 바이오마커가 될 것이고, 스텝 4의 후보 바이오마커의 리스트로 옮겨지게 될 것이다. 정의되는 바이오마커의 상관 관계는 이 스텝에서 후보 바이오마커의 리스트로 옮겨지게 될 것이다.If a particular biological condition is actually defined as the value of one or more biomarkers (eg, hypertension), the defined biomarker will be a "known important" biomarker and will be moved to the list of candidate biomarkers in step 4. Correlation of defined biomarkers will be transferred to the list of candidate biomarkers at this step.

본 명세서에서 "통계적 중요성"은 "확률적으로 중요한" 상관 관계와 "확률적으로 중요하지 않은" 상관 관계 사이를 결정하기 위한 수단으로서만 사용된다. 바람직한 실시예에 있어서, 통상적인 p 값은 잠재 바이오마커와 후보 바이오마커 사이의 상관 관계를 위해 연산될 것이다. 만일 p 가 임의의 특정 값보다 작다면, 예컨대 p＜0.05 또는 p＜0.01 이라면, 잠재 바이오마커는 후보 바이오마커 리스트로 옮겨지게 된다."Statistical significance" is used herein only as a means for determining between a "probably important" correlation and a "probably not important" correlation. In a preferred embodiment, conventional p values will be calculated for the correlation between potential biomarkers and candidate biomarkers. If p is smaller than any particular value, for example p <0.05 or p <0.01, the potential biomarkers are transferred to the candidate biomarker list.

스텝 6; 종속 변수(Y)로서 특정된 생물학적 조건과 독립 변수(X)로서 연령 및 잠재 바이오마커에 대한 이진 지시 변수를 이용하여 각각의 잠재 바이오마커를 위한 로지스틱 회귀 모형을 적용한다. 로지스틱 회귀 모형에서의 "통계적으로 중요한" 각각의 잠재 바이오마커를 후보 바이오마커 리스트에 부가한다.Step 6; A logistic regression model for each latent biomarker is applied using the biological conditions specified as the dependent variable (Y) and the binary indicator variables for age and latent biomarker as the independent variable (X). Each latent biomarker "statistically important" in the logistic regression model is added to the candidate biomarker list.

이 스텝의 목적은 연령의 (선형)이펙트를 고려한 후에, 특정의 생물학적 조건을 가질 확률과 관련된 이들 잠재 바이오마커를 후보 바이오마커로서 선택하기 위한 것이다. 로지스틱 모형은 서브젝트의 연령과 관련하여 잠재 바이오마커의 값의 함수로서 특정의 생물학적 조건을 가질 확률을 나타낸다.The purpose of this step is to consider these (linear) effects of age and then select those potential biomarkers as candidate biomarkers related to the probability of having a particular biological condition. The logistic model represents the probability of having a particular biological condition as a function of the value of the latent biomarker in relation to the age of the subject.

로지스틱 회귀 모형에서의 바이오마커의 경사도에 대한 한계 p 값에 기초하여 바이오마커가 선택된다(또는 선택되지 않는다). 상기의 상관 관계에 따라, 본 명세서에서 "통계적 중요성"은 "확률적으로 중요한' 판별과 "확률적으로 중요하지 않은" 판별 사이를 결정하기 위한 수단으로만 사용된다. 바람직한 실시예에 있어서, 통상적인 p 값은 잠재 바이오마커의 경사도에 대해 연산될 것이다. 만일 p 임의의 특정 값보다 작다면, 예컨대 p＜0.05 또는 p＜0.01 이라면, 잠재 바이오마커는 후보 바이오마커 리스트로 옮겨지게 된다.The biomarker is selected (or not selected) based on the limit p value for the slope of the biomarker in the logistic regression model. In accordance with the above correlation, "statistical importance" is used herein only as a means for determining between "probably important" and "probably not important" determinations. The value of p will be computed for the slope of the potential biomarker, if p is less than any particular value, eg, p <0.05 or p <0.01, the potential biomarker will be transferred to the candidate biomarker list.

스텝 7: 바이오마커의 값이 시계열적으로 이루어지는 것이 특정의 생물학적 조건의 획득과 관련되는지의 여부를 평가하기 위해 일반적인 선형 혼합 모형("MixMod")을 이용하여 각각의 시계열적으로 할당된 잠재 바이오마커를 평가한다. 통계적으로 중요한 시계열적 경향을 갖는 각각의 잠재 바이오마커는 후보 바이오마커 리스트로 옮겨지게 된다.Step 7: Each time series assigned potential biomarker using a general linear mixed model ("MixMod") to assess whether the time series of values of the biomarker is related to the acquisition of specific biological conditions Evaluate. Each potential biomarker with statistically significant time series trends will be transferred to the candidate biomarker list.

이 스텝의 목적은 바이오마커가 특정의 생물학적 조건을 가질 확률과 관련된 시계열적 경향을 갖는 후보 바이오마커 상태로 이전에 진행되는 것이 아니라 바이오마커를 식별하기 위한 것이다.The purpose of this step is to identify biomarkers rather than proceeding previously with candidate biomarker states having a time series tendency related to the probability that the biomarkers will have a particular biological condition.

본 발명의 통상적인 실시예에 있어서, 각각의 모형은 다음과 같이 제공될 것이다. MixMod 에서의 종속 변수(Y)는 잠재 바이오마커의 시계열적 값을 포함한다. 고정된 이펙트에 대한 독립 변수(X)는 (1) 특정의 생물학적 조건에 대한 이진 지시 변수, (2) 일부 적절한 이벤트로부터의 시간, 방문횟수와 같은 연령 또는 다른 관련 시계열적 메타미터(metameter), (3) 특정의 생물학적 조건에 대한 이진 지시 변수와 시계열적 메타미터 사이의 상호 관계이다. 모형의 램덤 이펙트 부분은 모집단 회귀 라인 및 일부 경우에서의 시계열적 메타미터에 대한 랜덤 경사도의 개념에 대해 무작위의 서브젝트 증가를 포함한다. 2개 이상의 랜덤 이펙트가 포함된 경우, 랜덤 이펙트의 공분산 행렬은 일반적으로 구조화되지 않았다. 연령 또는 다른 관련 시계열적 메타미터는 스텝 6에서와 동일한 이유로 모형에 포함된다.In a typical embodiment of the present invention, each model will be provided as follows. The dependent variable (Y) in MixMod contains the time series of potential biomarkers. Independent variables (X) for fixed effects may include (1) binary indicator variables for specific biological conditions, (2) time from some appropriate event, age such as number of visits, or other relevant time-series metameters, (3) Correlation between binary indicator variables and time series metameters for specific biological conditions. The random effect portion of the model includes a random subject increase for the concept of random slope for the population regression line and in some cases the time series metameters. When two or more random effects are included, the covariance matrix of the random effects is generally unstructured. Age or other related time series meters are included in the model for the same reasons as in step 6.

만일 연령이 아닌 임의의 X 변수에 대응하는 계수가 통계적으로 중요하다면, 잠재 바이오마커는 후보 바이오마커 리스트로 옮겨지게 된다. 스텝 6에서의 통계적 중요성에 대한 사항은 본 명세서에서 이용 가능하다.If the coefficient corresponding to any X variable but not age is statistically significant, the potential biomarker is moved to the candidate biomarker list. Matters on statistical significance in step 6 are available herein.

스텝 4-7 에서, 모든 잠재 바이오마커가 검사되었고, 판별자로서 유용하다는 역사적 또는 정량적 증거를 갖는 각각의 바이오마커는 후보 바이오마커 리스트로 옮겨지게 된다.In steps 4-7, each biomarker with historical or quantitative evidence that all potential biomarkers have been examined and is useful as a discriminator is moved to the candidate biomarker list.

단계 Ⅱ. 후보 바이오마커를 판별력을 갖는 선택 바이오마커 세트로 감소시키고, 공분산 구조 및 예측값의 혼합 모형 추정을 수행한다.Step II. The candidate biomarkers are reduced to a set of selectable biomarkers with discriminant power, and mixed model estimation of covariance structure and prediction values is performed.

배경; 종래의 판별 분석법은 2개의 집단, 집단 D(i=1) 및 집단(i=2) 의 바이오마커의 분산의 공분산 행렬 Σ_i및 평균 벡터 μ_i의 비교적 정확한 추정치를 필요로 하는 것이 통상적이다. μ_i는 단순한 샘플 평균(벡터)으로서 추정되고, Σ_i는 단순한 샘플 공분산 행렬로서 추정되며, 중요한 수반 변수(또는 공변수)에 대한 평균의 조절을 허용하지 않고, 동일한 서브젝트로부터 반복된 측정을 용이하게 포함하지 않는다. 게다가, 종래의 판별 분석법은 "케이스별 삭제" 과정에 통상적으로 기초를 둔 것으로서, 만일 서브젝트가 임의의 분실 데이터를 갖는다면, 그 서브젝트의 모든 데이터는 분석으로부터 삭제된다.background; Conventional discriminant assays include two populations, population D (i = 1) and population It is common to need a relatively accurate estimate of the covariance matrix Σ _i and the mean vector μ _i of the variance of the biomarker of (i = 2). μ _i is estimated as a simple sample mean (vector), Σ _i is estimated as a simple sample covariance matrix, and does not allow adjustment of the mean for important accompanying variables (or covariates), and facilitates repeated measurements from the same subject. It does not include. In addition, conventional discriminant analysis methods are typically based on a "case-by-case" process, where if a subject has any missing data, all data in that subject is deleted from the analysis.

평균 벡터 μ₁및 분산 행렬 ₁의 추정치와 벡터 Y 에서의 서브젝트에 대한 바이오마커(및 관련 데이타)가 주어진다면, 통상의 판별 함수( ₁= ₂이면 1차 함수이고, ₁ ₂이면 2차 함수)는 Y, μ₁, μ₂, ₁및 ₂로부터 평가된다. 특수 서브젝트에 특정한 유일한 정보는 벡터 Y에 있다.Mean Vector μ ₁ and Variance Matrix Given an estimate of ₁ and a biomarker (and associated data) for the subject in vector Y, the usual discriminant function ( ₁ = ₂ is a linear function, _One _{2, the} quadratic function) is Y, μ ₁ , μ ₂ , ₁ and Evaluated from ₂ . The only information specific to the special subject is in vector Y.

단계 Ⅱ의 대부분을 차지하는 혼합 모형 과정은 μ₁, μ₂, ₁및 ₂모두의 모형에 대한 일반적인 선형 혼합 모형(MixMod)을 사용함으로써 통상의 과정을 개선한 것이다(판별 함수에서는 통상의 간단한 비모형 추정치가 아닌 이들 인자의 모형 추정치가 사용됨). 이 MixMod 과정은 통상의 판별 분석에 비해 다음과 같은 중요한 개선이 이루어진다:The mixed model process, which accounts for most of stage II, is μ ₁ , μ ₂ , ₁ and _The general process is improved by using a general linear mixed model (MixMod) for both models (the discriminant function uses model estimates of these factors rather than the usual simple non-model estimates). This MixMod process has the following important improvements over conventional discriminant analysis:

인자는 다음의 혼합 모형을 사용하여 추정된다: Factors are estimated using the following mixed model:

이용 가능한 모든 데이타를 사용하고, 즉 케이스별 삭제(casewise deletion)는 사용하지 않고, Use all available data, i.e. not use casewise deletion,

추정 분산 행렬 ₁의 대응 조정으로 추정 기대값(μ₁)의 분산 조정을 지원하며, Estimated variance matrix Support distributed adjustment of the estimated expected value (μ ₁₎ with corresponding adjustment of the _first and,

동일 서브젝트로부터의 반복 측정치(예를 들어, 연중 방문횟수로부터)의 이용을 지원하는 혼합 모형. A hybrid model that supports the use of repeated measures (eg from yearly visits) from the same subject.

이 MixMod 과정은 판별 함수의 판별 성능을 향상시킬 수 있는 모평균 μ₁의 추정치에 부가하거나 또는 모평균 대신에 개개의 랜덤 이펙트 및 "BLUP"(Best Linear Unbiased Predictor)의 모형 기반 추정치를 이용한다. This MixMod process adds to the estimate of the population mean μ ₁ , which can improve the discriminant performance of the discriminant function, or uses individual random effects and model-based estimates of "BLUP" (Best Linear Unbiased Predictor) instead of the population mean.

단계 Ⅱ 과정의 개략Outline of the Phase II Process

단계 Ⅰ의 결과, 각각의 후보 바이오마커는 판별자로서 유용한 시간적 또는 정량적 증거를 가질 것이다. 그러나, 후보 바이오마커 사이에는 실질적인 상관이 존재한다. 따라서, 바이오마커는 그 자체만으로는 실제 판별력을 갖지만 다른 바이오마커와 조합하여 사용될 경우에는 실질적으로 기여하지 못할 것이다. 또한, 바이오마커의 척도는 폭 넓게 변경될 수 있다.As a result of step I, each candidate biomarker will have temporal or quantitative evidence useful as a discriminator. However, there is a substantial correlation between candidate biomarkers. Thus, biomarkers, by themselves, have real discriminating power but will not contribute substantially when used in combination with other biomarkers. In addition, the scale of the biomarker can vary widely.

서브젝트 과정의 단계 Ⅱ의 목적은,The purpose of phase II of the subject process is

(1) 재환산된 모든 바이오마커의 표준 편차가 동일 정도의 크기(0＜표준편차≤1)에 있도록 바이오마커 값을 재환산하고,(1) reconvert the biomarker values so that the standard deviation of all reconverted biomarkers is about the same magnitude (0 <standard deviation ≤ 1),

(2) 후보 바이오마커의 가능한 추세변동(long term) 목록을 각각이 세트의 판별 검정력에 실질적으로 기여하는 소수의 "선택 바이오마커(Select Biomarker)"로 감소시키며,(2) reduce the list of possible long terms of the candidate biomarkers to a few "Select Biomarkers", each of which substantially contributes to the discriminant power of the set,

(3) E[Y] = Xβ형태의 선형 모형을 사용하여 (재환산된) 바이오마커 값의 벡터 Y의 기대값의 구조를 결정하고, 미지 인자의 벡터 β를 추정하며,(3) determine the structure of the expected value of the vector Y of the (reconverted) biomarker value using a linear model of the form E [Y] = Xβ, estimate the vector β of the unknown factor,

(4)= Z△Z' + V 형태의 모형을 사용하여 (재환산된) 바이오마커 값의 벡터 Y의 분산 행렬의 구조를 결정하고, 행렬 △ 및 V 내의 미지 분산 인자를 추정하며,(4) Using a model of the form ZΔZ ′ + V to determine the structure of the variance matrix of the vector Y of the (reconverted) biomarker values, estimate the unknown variance factors in the matrices Δ and V,

(5) 랜덤 서브젝트 이펙트 벡터 d_ik를 추정하고, 그 서브젝트가 i번째 특정 생물학적 조건 집단으로부터 유래된다 하더라도 k번째 서브젝트의 예측값 벡터 Y_ki ^(P)를 계산하는 것이다(i=1은 집단 D에 대응하고, i=2는 집단에 대응함).(5) Estimate the random subject effect vector d _ik and calculate the predicted value vector Y _ki ^(P) of the k th subject even if the subject is derived from the i th specific biological condition group (i = 1 corresponds to the group D). I = 2 is a group Corresponding to).

본 발명의 대표적인 실시예에서, 이 단계의 스텝 1은 바이오마커 데이타를 재환산하여 데이타를 하나의 데이타 벡터(또는 데이타세트 내의 하나의 변수)로 배열하기 위해 1회 실행된다. 스텝 2 및 스텝 3은 선택 바이오마커의 세트가 선택되고 위에 나열된 추정치가 계산될 때까지 반복적으로 실행된다. 단계 4는 분산 행렬에 대한 적합 모형을 선택함으로써 판별에 사용될 혼합 모형 및 인자 추정치를 재정의한다.In a representative embodiment of the present invention, step 1 of this step is performed once to reconvert the biomarker data and arrange the data into one data vector (or one variable in the dataset). Steps 2 and 3 are executed repeatedly until a set of selection biomarkers is selected and the estimates listed above are calculated. Step 4 redefines the mixed model and factor estimates to be used for discrimination by selecting the fit model for the variance matrix.

스텝 1: 하나의 변수 "RespScal"가 모든 서브젝트로부터의 모든 후보 바이오마커의 환산값(시계열 측정치를 포함함)을 갖고 있는 데이타세트를 작성Step 1: Create a dataset in which one variable "RespScal" contains the conversion values (including time series measurements) of all candidate biomarkers from all subjects

각각의 바이오마커에 대해 별도로 환산이 실행된다. 각각의 바이오마커 값은 그 바이오마커의 샘플 표준 편차로 나누어진다. 따라서, 각각의 바이오마커의 환산 값의 표준 편차는 1.00 이다. 본 발명의 대표 실시예에서, 바이오마커의 한 변수는 "Response-Scaled"의 약어인 "RespScal"로 지정될 것이다. RespScal의 샘플 표준 편차 또한 대략 1.00 이다. 이러한 환산은 후속하는 혼합 모형 계산에서의 반복 과정의 수속을 용이하게 한다.Conversion is performed separately for each biomarker. Each biomarker value is divided by the sample standard deviation of that biomarker. Thus, the standard deviation of the converted values of each biomarker is 1.00. In an exemplary embodiment of the invention, one variable of the biomarker will be designated as "RespScal", which is an abbreviation of "Response-Scaled". The sample standard deviation of RespScal is also approximately 1.00. This conversion facilitates the procedure of an iterative process in subsequent mixed model calculations.

스텝 1은 단지 1회 실행된다. 먼저, 모든 후보 바이오마커는 RespScal 내의 데이타를 갖고, 선택 바이오마커 세트의 일원으로 간주된다. 비판별 바이오마커는 스텝 2 및 스텝 3에서 선택 바이오마커로부터 제거될 것이다.Step 1 is executed only once. First, all candidate biomarkers have data in RespScal and are considered members of the selected biomarker set. Critical biomarkers will be removed from the selected biomarkers in steps 2 and 3.

스텝 2: 다음의 상세 사양을 사용하여 일반적인 선형 혼합 모형(MixMod)을 작성: 인자 행렬 β, △ 및 V의 추정치를 구함. 서브젝트가 각각의 특정 생물학적 조건 집단 i=1, 2에 있는 것으로 가정하는 경우 각 서브젝트의 랜덤 서브젝트 이펙트 d_ik및 각 서브젝트의 예측값 Y_ik ^(min)및 Y_ik ^(avg)의 추정치를 구함.Step 2: Create a general linear mixed model (MixMod) using the following specifications: Obtain estimates of the factor matrices β, Δ, and V. Assuming the subject is in each particular biological condition group i = 1, 2, obtain an estimate of the random subject effect d _ik of each subject and the predicted values Y _ik ^(min) and Y _ik ^(avg) of each subject.

본 발명의 대표적인 실시예에서, 다음은 MixMod의 상세 사양이다:In a representative embodiment of the invention, the following is a detailed specification of MixMod:

종속(Y) 변수 : RespScal;Dependent (Y) variables: RespScal;

독립(X) 변수 및 그 계수(β):Independent (X) variable and its coefficient (β):

"생물학적 조건 상태" 특정 생물학적 조건의 상태에 대한 지시자 변수(분류 변수); Y의 대응 원소가 집단 D로부터의 서브젝트에 대한 정보를 포함하고 있으면 생물학적 조건 상태는 1이고, 그렇지 않으면 0.“Biological condition state” indicator variable (classification variable) for the state of a particular biological condition; The biological condition state is 1 if the corresponding element of Y contains information about the subject from population D, and 0 otherwise.

바이오마커 지시자 변수(분류 변수);Biomarker indicator variables (classification variables);

생물학적 조건 상태 ×바이오마커의 지시자 변수(분류 변수);Biological condition state x indicator variable (classification variable) of the biomarker;

연령(대략적으로 서브젝트의 전체 평균 연령을 중심으로 하는 연도; 연속 변수);Age (approximately year based on the overall mean age of the subject; continuous variable);

랜덤 이펙트 변수(Z_k) 및 랜덤 계수 이펙트(d_ik):Random effect variable (Z _k ) and random coefficient effect (d _ik ):

서브젝트 ×바이오마커 지시자 변수(Z_k의 일부) 및 대응 랜덤 이펙트(인터셉트 증분; d_ik의 일부). 특정 바이오마커에 대한 랜덤 서브젝트 이펙트는 그 서브젝트에 대한 바이오마커의 반복 측정치간의 상관을 발생하는 그 서브젝트의 다수 방문횟수에 걸쳐 일정.Subject x biomarker indicator variable (part of Z _k ) and corresponding random effect (intercept increment; part of d _ik ). The random subject effect for a particular biomarker is constant over a number of visits to that subject, generating a correlation between the biomarker's repeated measurements for that subject.

모형은 E[d_ik]=0 및 V[d_ik]=△ 로 가정함에 유의.Note that the model assumes E [d _ik ] = 0 and V [d _ik ] = △.

b번째 환산된 후보 바이오마커의 시계열 평가에서의 k번째 서브젝트에 대한 바이오마커 랜덤 오차항의 벡터 ∈_kb의 분산 행렬 V_b=(∈_kb). 이 분산 행렬은 k번째 서브젝트에 대한 각각의 바이오마커의 각각의 시계열 평가를 위한 하나의 행과 하나의 열을 가짐. 모형은 또한 E[∈_kb]=0 로 가정함에 유의.Variance matrix V _b of the vector ∈ _kb of the biomarker random error term for the k-th subject in the time series evaluation of the b-th converted candidate biomarker (∈ _kb ). This variance matrix has one row and one column for each time series evaluation of each biomarker for the kth subject. Note that the model also assumes that E [∈ _kb ] = 0.

∈_kbv의 일차 해석은 그 환산된 후보 바이오마커에 대한 서브젝트 k의 연령-종속 평균값에 대해 환산된 후보 바이오마커의 값의 한 평가치에서 다른 평가지까지의 편차를 나타내는 "랜덤 측정치 오차항" 임. 이 해석으로 ∈_kbv의 값은 등분산적이고 비상관되는 것으로 가정, 즉 (k,b,v)(k',b'v') 이라면 Cov(∈_kbv,∈_k'b'v')=0 인 것으로 가정하는 것이 합리적임. Y의 원소가 k(서브젝트 ID), b(바이오마커 ID) 및 v("방문횟수" 또는 서브젝트의 평가 번호 또는 연령)에 분류되면, 다수의 경우에 V_k에 대한 합리 모형은 V_k=BlockDiag(V_kb)=BlockDiag(V_k1, V_k2,...)이며, 여기서 V_kb=λ_bI 및 λ_b=(∈_kbv), b번째 후보 바이오마커의 환산된 값에 대한 측정 오차의 편차, 편차는 모든 서브젝트(k) 및 모든 평가(v)에 대해 동일한 것으로 가정됨.The primary interpretation of ∈ _kbv is the “random measure error term” representing the deviation from one estimate to the other in the value of the candidate biomarker converted to the age-dependent average of subject k for that converted candidate biomarker. This interpretation _{assumes that} the value of ∈ _kbv is equally distributed and uncorrelated, that is, (k, b, v) If (k ', b'v') it is reasonable to assume that Cov (∈ _kbv , ∈ _{k'b'v '} ) = 0. When Y of the elements are classified in k (subject ID), b (biomarkers ID) and v ( "Hit" or the subjects evaluated numbers or age), a reasonable model for the V _k in a number of cases, V _k = BlockDiag (V _kb ) = BlockDiag (V _k1 , V _k2 , ...), where V _kb = λ _b I and λ _b = (∈ _kbv ), the deviation of the measurement error for the converted value of the b th candidate biomarker, the deviation is assumed to be the same for all subjects (k) and all evaluations (v).

RespScal 의 환산은 각 편차 λ_b가 1.00 미만이 될 것이라는 것을 암시한다는 점에 유의. 편차가 1.00에서 어느 정도 낮은지의 정도는 고정 이펙트의 크기(높은 R²가 더 적은 추정 편차를 야기함) 및 랜덤 이펙트의 크기(의 대각 원소)에 좌우됨.Note that the conversion of RespScal implies that each deviation λ _b will be less than 1.00. The degree of how low the deviation is from 1.00 depends on the size of the fixed effect (high R ² causes less estimated deviation) and the size of the random effect ( Diagonal element).

Z_k, d_k, V_k=BlockDiag(V_kb) 및 V_kb=λ_bI 의 상기 조합은 _ik에 대한 높은 구조의 연장된 복합 대칭 모형을 발생함에 유의. 동일 편차 인자가 집단 D 및 집단모두에 적용될 때의 예에서의 핵심을 예시하기 위해, k번째 서브젝트와 b번째 환산된 바이오마커에 대한 랜덤 이펙트의 벡터를 d_k=[d_kb]=[d_k1, d_k2,…]로 하고,(d_k)=△=[δ_bb'] 로 하며, 여기서 δ_bb'= Cov(d_kb, d_kb') 이고, b 및 b'은 상이한 환산 바이오마커를 인덱스할 수 있으며, Z_k는 환산된 바이오마커에 대한 지시자 변수를 포함하는 것으로 하고, V_kb=λ_bI 로 한다. 따라서, _k= Z_k△Z'_k+ _k= [ _k,b,b'], 여기서 _k,b,b'= δ_bb'J + λ_bI = 환산된 바이오마커 b로부터의 다수의 측정치의 분산 행렬이고, _k,b,b'= δ_bb'J = 동일한 경우 또는 상이한 경우에 대해 평가된 환산된 바이오마커 b 및 b' 의 분산(정방 행렬 J의 각 원소는 1).The combination of Z _k , d _k , V _k = BlockDiag (V _kb ) and V _kb = λ _b I Note that we generate a highly complex extended symmetric model of _ik . Equal Deviation Factor for Group D and Group To illustrate the essence of the example when applied to all, a vector of random effects for the k th subject and the b converted biomarker is given by d _k = [d _kb ] = [d _k1 , d _k2 ,... ], (d _k ) = Δ = [δ _{bb '} ], where δ _bb' = Cov (d _kb , d _{kb '} ), b and b' can index different converted biomarkers, and Z _k is converted An indicator variable for the biomarker is included, and V _kb = λ _b I. therefore, _k = Z _k ΔZ ' _k + _k = [ _{k, b, b '} ] where _{k, b, b '} = δ _bb' J + λ _b I = variance matrix of the multiple measurements from the converted biomarker b, _{k, b, b '} = δ _bb' J = variance of the converted biomarkers b and b 'evaluated for the same case or for different cases (each element of the square matrix J is 1).

혼합 모형을 작성하는 처리는 다음의 것들의 추정치를 발생한다:The process of building a mixed model yields estimates of:

모형의 인자 β, △ 및 V_k의 인자. 모형이 2개의 생물학적 조건 상태 집단에 대한 상이한 분산을 가정하면, 모형은 △_i및 V_ik의 분산 인자의 별도의 추정치를 발생한다.Factors β, Δ, and V _k of the model. If the model assumes different variances over two populations of biological condition states, the model generates separate estimates of the variance factors of Δ _i and V _ik .

각 서브젝트의 데이타 벡터 μ_ik의 기대값(서브젝트 k는 생물학적 조건 상태 집단 i내에 있음).Expected value of data vector μ _ik of each subject (subject k is in biological condition state population i).

서브젝트가 다른 응답 집단(i')에 있는 것으로 가정하는 경우 각 서브젝트의 데이타 벡터 μ_i'k의 기대값.The expected value of the data vector μ _i'k of each subject, assuming that the subject is in a different response population (i ').

서브젝트의 실제 처리 집단(i)에서의 각 서브젝트의 랜덤 서브젝트 이펙트 d_ik및 서브젝트가 다른 응답 집단(i')에 있다면 d_i'k.The random subject effect d _ik of each subject in the subject's actual processing population (i) and d _i'k if the subject is in a different response population (i ').

서브젝트의 실제 처리 집단(i)내의 각 서브젝트의 "예측값" Y_ik ^(p)및 서브젝트가 다른 응답 집단(i')에 있다면 Y_i'k ^(p)." _Predicted value" Y _ik ^(p ) of each subject in the actual processing population (i) of the subject and Y _i'k ^(p) if the subject is in a different response population (i ' ⁾ .

서브젝트의 분산 행렬 _k. 모형이 2개의 생물학적 조건 상태 집단에 대한 상이한 분산을 가정한다면, 모형은 분산 행렬 _ik의 별도 추정치를 발생한다.The variance matrix of the subject _k . If the model assumes different variances for two populations of biological condition states, then the model Generate a separate estimate of _ik .

스텝 3: 최소의 외형적 판별 검증력을 갖는 바이오마커를 삭제하고 혼합 모형을 재구성Step 3: Delete the biomarkers with the least apparent discriminant power and reconstruct the mixed model

유효 판별치가 될 바이오마커는 큰 (통계학적으로 중요한) 생물학적 조건 상태 ×바이오마커 고정 이펙트를 가져야만 한다. 반대로, 큰 바이오마커 메인 이펙트는 여기서 관련이 없고, 바이오마커 평균 중에서의 차분을 나타내는 큰 바이오마커 메인 이펙트는 바이오마커가 상이한 유형의 변수이고 상이한 평균(재환산된 축상에서)을 갖기 때문에 단순하게 증가할 수 있다. 반대로, 큰 생물학적 조건 상태 ×바이오마커 이펙트는 [생물학적 조건 상태 = 0](집단)에 대한 바이오마커 평균이 동일 바이오마커에 대한 [생물학적 조건 상태 = 1](집단 D) 평균에 대한 바이오마커 평균과 실질적으로 상이하다는 것을 나타낸다.The biomarker to be an effective discriminator must have a large (statistically important) biological condition state x biomarker fixation effect. Conversely, large biomarker main effects are not relevant here, and large biomarker main effects that represent differences in the biomarker mean are simply increased because the biomarkers are different types of variables and have different means (on the reconverted axis). can do. In contrast, a large biological condition state × biomarker effect is [biological condition state = 0] (group The biomarker mean for) is substantially different from the biomarker mean for the [biological condition state = 1] (group D) mean for the same biomarker.

각각의 선택된 바이오마커가 통계학적으로 중요한 생물학적 조건 상태×바이오마커 고정 이펙트를 갖고 있다면, 스텝 3은 완료되고, 스텝 4로 진행한다. 하나 또는 그 이상의 현재 선택 바이오마커가 비통계학적으로 중요한 생물학적 조건 상태×바이오마커 고정 이펙트를 갖고 있다면, 최소의 생물학적으로 중요한(최장 p-값) 생물학적 조건 상태×바이오마커 고정 이펙트를 갖는 바이오마커가 데이타 벡터 Y로부터 제거되고, MixMod 가 감소된 데이타 벡터에 작성된다.If each selected biomarker has a statistically significant biological condition state × biomarker fixation effect, step 3 is complete and proceeds to step 4. If one or more of the currently selected biomarkers have a nonstatistically important biological condition state × biomarker fixation effect, then the biomarker with the least biologically significant (maximum p-value) biological condition state × biomarker fixation effect It is removed from data vector Y and MixMod is written to the reduced data vector.

스텝 3에서는 단계식 회귀 구문내의 "백워드 삭제(backward elimination)" 과정의 아날로그가 실시된다. 그와 다른 방법은 데이타 벡터 및 모형에 극소수의 명백히 유효한 분산치(바이오마커)만을 포함시키고 각각의 후속 스텝에서 바이오마커를 하나씩 추가시키는 "포워드 선택"의 아날로그를 실시하는 것이다.In step 3 an analog of the "backward elimination" process in the stepwise regression syntax is performed. An alternative approach is to perform an "forward selection" analogy that includes only a few clearly valid variances (biomarkers) in the data vectors and models and adds one biomarker at each subsequent step.

스텝 4: 분산 인자 행렬 △_p및 V_ik의 구조를 결정Step 4: determining a structure of the dispersion factor matrix △ _p and V _ik

판별 분석 방법은 각각의 생물학적 조건 상태 집단 D 및에 대해 별도로 바이오마커의 기대값 및 바이오마커(그 일부는 시계열적으로 평가됨)의 분산 행렬을 모두 사용한다. 가능한 시계열적 평가를 포함하는 선택 바이오마커의 목록이 이미 스텝 3에서 완료되었다는 것은 앞에서 설명한 바 있다. 전술된 바와 같이, MixMod 는 분산 행렬 _ik= Z_ik△_iZ'_ik+ _ik에 대한 다음의 구조를 야기하는 가정을 포함하며, 여기서 i는 생물학적 조건 상태 집단을 인덱스(집단 D에 대해서는 i=1이고, 집단에 대해서는 i=2)하고, k는 서브젝트를 인덱스한다. 또한, 분산 인자 행렬 △_i및 _ik은 특히 _ik가 매우 클 때, 즉 다수의 바이오마커 및/또는 하나 또는 그 이상의 바이오마커의 다수의 시계열 평가가 있을 때 분석으로 개발될 수 있는 구조를 가질 것이다.Discriminant analysis methods are based on each biological condition state group D and For both, we use both the biomarker's expected value and the biomarker's variance matrix, some of which are evaluated in time series. It has been described previously that the list of optional biomarkers, including possible time series evaluations, has already been completed in Step 3. As mentioned above, MixMod is the variance matrix _ik = Z _ik △ _i Z ' _ik + includes the assumption that leads to the following structure for _ik , where i is the index of the biological condition state population (i = 1 for population D, and I = 2), and k indexes the subject. Further, the dispersion factor matrix Δ _i and _ik is especially _{When ik} is very large, that is, when there are multiple time series evaluations of multiple biomarkers and / or one or more biomarkers, they will have a structure that can be developed for analysis.

스텝 4의 목적은 단계 Ⅲ 판별 분석에서 사용하기 위한 분산 인자 행렬 △_i및 _ik의 구조를 결정하는 것이다. 구조화된 큰 분산 인자 행렬의 추정치는 구조화되지 않은 분산 인자 행렬의 추정치보다 더 정확하게 될 것이다. △_i및 _ik의 더욱 정확한 추정치에 의해 _ik= Z_ik△_iZ'_ik+ _ik의 더욱 정확한 추정치를 얻을 수 있고, 그에 따라 β, d_ik및 Y_ik ^(p)의 더욱 정확한 평가와 판별 함수의 더욱 정확한 값을 얻을 수 있다.The purpose of step 4 is to use the variance factor matrix Δ _i for use in step III discriminant analysis and to determine the structure of _ik . The estimate of the structured large variance factor matrix will be more accurate than the estimate of the unstructured variance factor matrix. △ _i and by a more accurate estimate of _ik _ik = Z _ik △ _i Z ' _ik + More accurate estimates of _ik can be obtained, and thus more accurate estimates of β, d _ik and Y _ik ^(p) and more accurate values of the discriminant function.

_ik의 전체적인 구조는 다음 타입의 분산/상관을 고려하여야만 한다: The overall structure of _ik must take into account the following types of variance / correlation:

ADB 타입 : 동일 시각에 평가된 상이한 바이오마커 중에서의 분산/상관;ADB type: dispersion / correlation among different biomarkers evaluated at the same time;

ALESB 타입 : 단일 바이오마커의 시계열 평가 중에서의 분산/상관;ALESB type: variance / correlation in time series evaluation of a single biomarker;

BTBEL 타입 : 시계열적으로 평가된 2개의 바이오마커간의 분산/상관, 즉 서로 다른 시간에 평가된 임의의 한쌍의 바이오마커간의 분산/상관.BTBEL type: dispersion / correlation between two biomarkers evaluated in time series, ie, dispersion / correlation between any pair of biomarkers evaluated at different times.

본 발명의 대표적인 실시예에서, 스텝 2에서 설명된 구조 또는 이들 구조의 확장이 유용할 것이다.In a representative embodiment of the present invention, the structures described in step 2 or extensions of these structures would be useful.

본 발명의 대표적인 실시예에서, 시계열 다변량(multivariate) 자료에 대한 분산/상관 구조를 개발하기 위해서는 1996년 3월 미국 버지니아주 리치몬드에서 개최된 1996 Spring Meeting of the International Biometric Scociety, Eastern North American Region 에서 Tangen, Catherine M. 및 Helms, Ronald W. 이 발표한 "A case study of the analysis of multivariate longitudinal data using mixed (random effects) model"에 개시된 기술이 사용된다. 분산 모형의 선택에는 통상 다수의 MixMod를 구성하고, 동일 기대값 모형을 사용하며, 분산 모형을 변화시키는 과정이 요구된다. 모형은 로그 유사 통계학(하부 정규 분포를 가정)을 통해 비교될 수 있다. 분산의 구조는 미국 노스캐롤라이나 대학의 Ronald W. Helms 및 Grady, J. J. 에 의해 Statistics in Medicine, 14, 1397-1416 에 발표된 "Model Selection Techniques for the Covariance Matrix for Incomplete Longitudinal Data" 에 개시된 기술을 이용하여 그래픽식으로 비교될 것이다.In a representative embodiment of the present invention, Tangen at the 1996 Spring Meeting of the International Biometric Scociety, Eastern North American Region, held in March 1996 in Richmond, Virginia, to develop a scatter / correlation structure for time series multivariate data. The technique disclosed in "A case study of the analysis of multivariate longitudinal data using mixed (random effects) model" by Catherine M. and Helms, Ronald W. is used. The choice of variance model usually involves the process of constructing multiple MixMods, using the same expected model, and changing the variance model. Models can be compared through log-like statistics (assuming lower normal distribution). The structure of variance can be obtained using techniques disclosed in "Model Selection Techniques for the Covariance Matrix for Incomplete Longitudinal Data," published in Statistics in Medicine, 14, 1397-1416 by Ronald W. Helms and Grady, JJ, University of North Carolina, USA. It will be compared graphically.

단계 Ⅲ : 추정 평균값 및 예측값을 사용하여 판별 함수를 계산하고, 각 서브젝트에 대한 논리 기호학 예측값을 연산; 판별 함수에 대한 추정 오차율Step III: Compute a discriminant function using the estimated mean value and the predicted value, and compute the logical semiotic predictive value for each subject; Estimated Error Rate for Discriminant Function

배경; 단계 Ⅲ의 목적은 서브젝트가 "모집단", 즉 집단 D 또는 집단의 어느 집단에 속하는지를 "예측"하는 것이다.background; The purpose of stage III is to have the subject "populate", ie group D or group It is "predicting" which group of people belong to.

집단 D : 특정 시간대 내에서 특정 생물학적 조건을 획득할 사람의 부분 모집단. Population D: A subpopulation of people who will acquire certain biological conditions within a certain time frame.

집단: 특정 시간대 내에서 특정 생물학적 조건을 획득하지 못할 사람의 부분 모집단. group : A subpopulation of people who will not acquire certain biological conditions within a certain time frame.

서브젝트는 서브젝트를 다음의 2개의 집단 중의 하나에 위치시킴으로써 분류된다.Subjects are classified by placing the subject into one of the following two populations.

집단 PD : 특정 시간대의 개시시에 특정 시간대 내에서 특정 생물학적 조건을 획득할 것으로 예측되는 사람, 즉 집단 D에 속할 것으로 예측된 사람의 집단. 이 사람들은 특정 시간대내에서 특정 생물학적 조건을 획득할 규정된 높은 확률을 갖는 사람으로서 기술된다. Population PD: A group of people who are expected to acquire a particular biological condition within a particular time zone at the start of a particular time zone, that is, a person who is expected to belong to group D. These people are described as those who have a defined high probability of obtaining certain biological conditions within a certain time frame.

집단: 특정 시간대의 개시시에 특정 시간대 내에서 특정 생물학적 조건을 획득하지 못할 것으로 예측되는 사람, 즉 집단에 속할 것으로 예측된 사람의 집단. 이 사람들은 특정 시간대내에서 특정 생물학적 조건을 획득할 지정된 낮은 확률을 갖는 사람으로서 기술된다. group : A group of people, or groups, who are expected to fail to acquire certain biological conditions within a certain time zone at the beginning of a particular time zone. A group of people predicted to belong to. These people are described as those who have a specified low probability of obtaining certain biological conditions within a certain time frame.

제2 목적은 서브젝트가 집단 D 및에 속할 확률을 추정하는 것이다.The second purpose is for subjects D and Is to estimate the probability of belonging to.

제1 목적을 달성하기 위한 기술, 즉 서브젝트를 2개의 집단중의 하나로 분류하는 기술은 통상적인 판별 분석의 변형인 판별 과정을 이용한다. 서브젝트가 특정 생물학적 조건을 획득할 서브젝트의 집단에 있을 확률의 추정치는 (1)회귀치로서 판별 함수값을 사용하고, (2)회귀치로서 판별 변수를 사용하는 통상의 논리 기호학 회귀의 변형으로부터 획득된다.Techniques for achieving the first object, that is, techniques for classifying a subject into one of two populations, use a discrimination process that is a variation of conventional discriminant analysis. An estimate of the probability that a subject will be in a group of subjects that will acquire a particular biological condition is obtained from a variation of conventional logical semiotic regression using (1) a discriminant function value as the regression value and (2) a discriminant variable as the regression value. do.

단계 Ⅱ의 배경에서 설명된 바와 같이, 종래 기술의 판별 분석 방법학은 통상적으로 2개의 집단의 바이오마커의 분포의 평균 벡터 μ_i및 분산 행렬 _i의 원시 추정치를 활용한다. 더욱이, 종래 기술의 판별 분석은 서브젝트가 임의의 누락 데이타를 갖고 있다면 그 서브젝트의 데이타 전부가 분석으로부터 삭제되는 "케이스별 삭제" 과정을 기초로 이루어진다.As described in the background of step II, prior art discriminant analysis methodologies typically employ a mean vector μ _i and a variance matrix of the distribution of biomarkers of two populations. _Use the raw estimate of _i . Moreover, prior art discriminant analysis is based on a "case-by-case" process in which all of the subject's data is deleted from the analysis if the subject has any missing data.

단계 Ⅱ에서 설명된 혼합 모형 과정은 μ₁, μ₂, ₁및 ₂전부를 모형화하기 위해 일반적인 선형 혼합 모형(MixMod)을 사용함으로써 통상의 과정을 향상시키며, 이들 인자의 모형화된 추정치가 통상의 간편한 비모형화된 추정치 대신 판별 함수에 사용된다. 혼합 모형의 사용에 의해 본 과정은 통상의 판별 분석에 비해 다음의 중용한 향상을 얻을 것이다. 인자는 모든 이용가능한 데이타를 사용하여 추정된다. 즉, 인자는 케이스별 삭제를 사용하지 않는다. 이 과정은 추정된 분산 행렬 _i의 대응 조정으로 추정된 기대값(μ_i)의 분산 조정을 지원한다. 그리고, 이 과정은 동일 서브젝트로부터의 반복 측정치(예를 들어, 년중 방문횟수로부터의)의 활용을 지원한다.The mixed model process described in step II is μ ₁ , μ ₂ , ₁ and _The general process is improved by using a general linear mixed model (MixMod) to model all of them, and the modeled estimates of these factors are used in the discriminant function instead of the usual simple unmodeled estimates. By using a mixed model, this process will yield the following significant improvement over conventional discriminant analysis. The factor is estimated using all available data. That is, the argument does not use case-by-case deletion. This process uses the estimated variance matrix It supports distributed adjustment of the expected value (μ _i) estimating a corresponding adjustment of the _i. This process then supports the use of repeated measures (eg, from yearly visits) from the same subject.

혼합 모형의 사용에 의해 본 과정이 모집단 평균 μ_i의 추정치에 추가 또는 대신하여 개개의 랜덤 이펙트 및 "BLUP"(Best Linear Unbiased Predictor)의 모형 기반 추정치를 활용할 수 있게 되어 판별 함수의 판별 성능을 증대시킬 수 있다는 점이 더욱 중요할 것이다.The use of a mixed model allows the process to utilize individual random effects and model-based estimates of Best Linear Unbiased Predictor ("BLUP") in addition to or in place of estimates of the population mean μ _i to increase the discriminant performance of discriminant functions. It will be more important that you can.

본 판별의 형식은 다변량 정규성(multivariate normality)을 기초로 하는 통상의 판별과 형식면에서 동일하다. 다음의 일부 주석문이 유용하다:The format of this discrimination is identical in form to the ordinary discrimination based on multivariate normality. Some comments like this are useful:

f_i는 μ_i및 _i의 추정치를 사용하여 평가된, 그룹 i로부터의 서브젝트에 대한 판별 변수의 벡터 Y의 분포의 밀도 함수를 나타내고, 집단 D 또는 PD에 대해서는 i=1이고, 집단또는에 대해서는 i=2이다;f _i is μ _i and the evaluation by using the estimate of _i, represents the distribution of the density function of the vector Y of the determined parameters for the subjects from group i, and i = 1 for the group D, or PD, group or For i = 2;

p_i는 서브젝트가 집단 i로부터 유래할 사전 가능성을 나타내고, 집단 D에 대해서는 i=1이고, 집단에 대해서는 i=2이다. p_i의 값은 시간적 데이타 또는 기타 연구에 의해 공지되어 있다. p_i의 값이 미지의 것이라면, 2개의 집단내의 서브젝트의 비율이 p_i의 추정치로서 사용될 것이다.p _i indicates the likelihood that the subject will be from group i, i = 1 for group D, and Is i = 2. The value of p _i is known by temporal data or other studies. If the value of p _i is unknown, then the proportion of subjects in the two populations will be used as the estimate of p _i .

판별 함수값의 벡터 Y를 갖는 미지 집단의 서브젝터는 Ln[f₁(Y)/f₂(Y)]>Ln[p₂/p₁] 인 경우에는 집단 1(집단 PD)로 분류될 것이고, 그 외의 경우에는 집단 2(집단)로 분류될 것이다.An unknown group of subjects with a vector Y of discriminant function values will be classified as group 1 (group PD) if Ln [f ₁ (Y) / f ₂ (Y)]> Ln [p ₂ / p ₁ ] , Otherwise, Group 2 (Group Will be classified as).

단계 Ⅱ에서, 2개의 집단이 동일한 분산 행렬을 갖는지를, 즉 ₁= ₂=인지를 합리적으로 가정할 수 있는 여부를 결정할 것이다. 그 경우, 본 판별 과정은 다음 형태의 선형 판별 함수의 사용에 의해 감소된다:In step II, two populations have the same variance matrix, i.e. ₁ = ₂ = We will decide whether we can reasonably assume cognition. In that case, this discrimination process is reduced by the use of a linear discriminant function of the form:

D(Y) = [Y - ½(μ₁+ μ₂)]' ^-1(μ₁- μ₂) - Ln[p₂/p₁]D (Y) = [Y-½ (μ ₁ + μ ₂ )] ' ^-1 (μ ₁ -μ ₂ )-Ln [p ₂ / p ₁ ]

μ₁및 _i는 아래에 설명될 "적합한" 추정치에 의해 대체될 것이다. D(Y) 대 0 가 비교될 것이다. 단계 Ⅱ에서 ₁≠ ₂인 것으로 판정되면, 판별 과정은 다음 형태의 2차 판별 함수의 사용에 의해 감소된다:μ ₁ and _i will be replaced by an "suitable" estimate, described below. D (Y) vs. 0 will be compared. In step II ₁ ≠ If determined to be ₂ , the discrimination process is reduced by the use of a second order discrimination function of the form:

Q(Y) = ½ln(| ₂|/| ₁|) - ½(Y-μ₁)' ₁ ^-1(Y-μ₁) + ½(Y-μ₂)' ₂ ^-1(Y-μ₂) - Ln[p₂/p₁]Q (Y) = ½ln (| ₂ | / | ₁ |)-½ (Y-μ ₁ ) ' ₁ ^-1 (Y-μ ₁ ) + ½ (Y-μ ₂ ) ' ₂ ^-1 (Y-μ ₂ )-Ln [p ₂ / p ₁ ]

여기서, μ_i및 ₁은 아래에 설명된 "적합한" 추정치에 의해 대체된다. Q(Y) 대 0 가 비교된다.Where μ _i and ₁ is replaced by the "suitable" estimate described below. Q (Y) vs. 0 are compared.

이 2가지의 경우, "적합한" 추정치는 단계 Ⅱ에서의 혼합 모형 과정로부터 유래되고, 랜덤 서브젝트 이펙트를 포함하거나 포함하지 않을 수도 있다.In both cases, the "suitable" estimate is derived from the blending model process in step II and may or may not include random subject effects.

단계 Ⅲ 과정 - 과정의 단계 Ⅲ의 스텝은 아래와 같이 설명된다. 데이타가 하나 또는 그 이상의 "신규" 서브젝트, 즉 집단 소속이 알려져 있지 않고 단계 Ⅱ의 혼합 모형 계산에 사용되지 않는 서브젝트로부터 사용 가능한 것으로 가정한다. 스텝 1 및 스텝 2에서, 한 번에 하나의 서브젝트를 고려할 것이다. 일부 추가 노테이션이 사용 가능하다. 집단 D 및 PD에 대해서는 i=1로 하고, 집단및에 대서는 i=2로 한다.Step III Process-The steps of Step III of the process are described as follows. Assume that data is available from one or more "new" subjects, i.e., subjects whose group membership is not known and are not used in calculating the mixed model in Step II. In steps 1 and 2, one subject will be considered at a time. Some additional notations are available. I = 1 for groups D and PD, and And For i, i = 2.

Y는 하나의 신규 서브젝트에 대한 판별 변수값의 벡터를 나타낸다. Y의 원소는 RespScal 가 단계 Ⅱ에서 환산될 때 환산된다.Y represents a vector of discriminator values for one new subject. The element of Y is converted when RespScal is converted in step II.

X_i는 서브젝트가 집단 i(여기서, i=1, 2)에 있는 것으로 가정하는 경우 최종 단계 Ⅱ 혼합 모형에 사용된 독립 변수의 값의 행렬을 나타낸다. X_i의 행은 Y의 행(원소)에 대응한다.X _i represents a matrix of values of the independent variables used in the final phase II mixture model, assuming that the subject is in group i (where i = 1, 2). The row of X _i corresponds to the row (element) of Y.

Z_i는 서브젝트가 집단 i(여기서, i=1, 2)에 있는 것으로 가정하는 경우 최종 단계 Ⅱ 혼합 모형에 사용된 랜덤 유효 변수의 값의 행렬을 나타낸다. Y_i의 행은 Y의 행에 대응한다.Z _i represents the matrix of values of the random valid variable used in the final phase II mixture model, assuming that the subject is in group i (where i = 1, 2). The row of Y _i corresponds to the row of Y.

는 최종 단계 Ⅱ 혼합 모형에서 집단 i(i=1, 2)로부터의 랜덤 이펙트의 추정된 분산 행렬을 나타낸다. 다수의 경우에 혼합 모형은 랜덤 이펙트의 단일 분산으로 감소된다. 즉,이다. Denotes the estimated variance matrix of the random effects from population i (i = 1, 2) in the final phase II mixed model. In many cases the mixed model is reduced to a single variance of random effects. In other words, to be.

는 최종 단계 Ⅱ 혼합 모형에서 집단 i(i=1, 2)로부터의 랜덤 잉여항(residual term) 또는 "오차항"의 추정된 분산 행렬을 나타낸다. 다수의 경우에 혼합 모형은 랜덤 이펙트의 단일 분산으로 감소된다. 즉,이다. Denotes the estimated variance matrix of the random residual term or “error term” from population i (i = 1, 2) in the final phase II mixed model. In many cases the mixed model is reduced to a single variance of random effects. In other words, to be.

는 신규 서브젝트가 집단 i(i=1, 2)로부터 유래된 것으로 가정하는 경우 최종 단계 Ⅱ 혼합 모형으로부터의 Y의 추정된 분산 행렬을 나타낸다. 다수의 경우에 혼합 모형은 단일 분산 행렬로 감소된다. 즉,이다. Denotes the estimated variance matrix of Y from the final phase II mixed model, assuming that the new subject is from population i (i = 1, 2). In many cases the mixed model is reduced to a single variance matrix. In other words, to be.

스텝 1: 단계 Ⅱ 혼합 모형으로부터의 결과를 사용하여, 검증 샘플내의 모든 서브젝트를 분류하고, 추정된 랜덤 서브젝트 이펙트의 각종 조합을 활용하여 "추정 값"을 기초로 한 하나의 후보 판별 과정와 "예측값"을 기초로 한 나머지 후보 판별 과정로 이루어진 다수의 후보 판별 과정의 오차율을 추정한다. 최저 추정 오차율을 갖는 과정이 선택되고, "최고 신뢰적인 과정"로서 지칭된다.Step 1: Classify all the subjects in the verification sample using the results from the Step II mixed model, and utilize one combination of the estimated random subject effects to determine one candidate process and "prediction" based on "estimated values". The error rate of the plurality of candidate discrimination processes consisting of the remaining candidate discrimination processes is estimated. The process with the lowest estimated error rate is selected and referred to as the "highest reliable process".

원래의 학습 모집단이 "트레이닝 샘플(training sample)" 및 "유효 샘플(validation sample)"로 분할되면, 다음의 경우에는 검증 샘플을 이용하고, 그 외에는 트레이닝 샘플을 유효 샘플로서 이용한다. 서브젝트가 각 집단으로부터 유래된다면, 유효 샘플내의 각 서브젝트에 대한 다음의 성질을 별도로 추정한다.If the original learning population is divided into a "training sample" and a "validation sample", then a validation sample is used in the following cases, otherwise the training sample is used as a valid sample. If the subjects are from each population, then separately estimate the following properties for each subject in the valid sample.

-- 서브젝트가 집단 i(i=1, 2)로부터 유래되는 것으로 가정하면, Y의 "추정값"이다. Assuming the subject is from group i (i = 1, 2), it is the "estimated value" of Y.

-- 서브젝트가 집단 i(i=1, 2)로부터 유래되는 것으로 가정하면, 서브젝트의 랜덤 서브젝트 이펙트의 추정치이다. Assuming the subject is from group i (i = 1, 2), it is an estimate of the random subject effect of the subject.

이면; 그 외에는--은및의 "최소값" 또는 "최소(집단에 걸쳐) 랜덤 서브젝트 이펙트" 추정치로서 고려될 것이다. Back side ; Otherwise - silver And It will be considered as the "minimum value" or "minimum (over population) random subject effect" estimate of.

--은및의 "평균치" 또는 "평균(집단에 걸쳐) 랜덤 서브젝트 이펙트" 추정치로서 고려될 것이다. - silver And Will be considered as an "average" or "average (over population) random subject effect" estimate of.

-- 서브젝트가 집단 i(i=1, 2)로부터 유래되지만 "최소" 랜덤 서브젝트 이펙트 추정치를 사용하는 경우의 서브젝트의 "예측값". The "prediction" of the subject when the subject is from group i (i = 1, 2) but using a "minimum" random subject effect estimate.

-- 서브젝트가 집단 i(i=1, 2)로부터 유래되지만 "평균" 랜덤 서브젝트 이펙트 추정치를 사용하는 경우의 서브젝트의 "예측값". The "prediction" of the subject when the subject is from the group i (i = 1, 2) but using the "average" random subject effect estimate.

상기한 내용 및 이하의 내용에서, 집단 D 또는 집단 PD에 대해서는 i=1이고, 집단또는에 대해서는 i=2이다.In the foregoing and the following, i = 1 for group D or group PD, and or Is i = 2.

추정값에 기초한 분류Estimate Classification based on

₁= ₂=라는 판정이 단계 Ⅱ에서 이루어지면, μ_i을로 대체하고를로 대체하여 선형 판별 함수 D(Y)를 평가한다. D(Y)≥0 이라면 서브젝트를 집단 1(집단 PD)로 지정하고, 그 외에는 서브젝트를 집단 2(집단)로 지정한다. ₁ = ₂ = Is determined in Step II, μ _i To replace To Evaluate the linear discriminant function D (Y) by substituting for. If D (Y) ≥0, the subject is designated as group 1 (group PD); otherwise, the subject is group 2 (group). )

₁ ₂라는 판정이 단계 Ⅱ에서 이루어지면, μ_i을로 대체하고 _i를로 대체하여 2차 판별 함수 Q(Y)를 평가한다. Q(Y)≥0 이라면 서브젝트를 집단 1(집단 PD)로 지정하고, 그 외에는 서브젝트를 집단 2(집단)로 지정한다. _One _If the determination of ₂ is made in step II, μ _i To replace _i To evaluate the quadratic discrimination function Q (Y). If Q (Y) ≥0, the subject is designated as group 1 (group PD); otherwise, the subject is group 2 (group). )

"최소" 랜덤 서브젝트 이펙트 및 예측값에 기초한 분류"Minimum" random subject effect and prediction Classification based on

"평균" 랜덤 서브젝트 이펙트 및 예측값에 기초한 분류"Average" random subject effects and predictions Classification based on

검증 샘플(전술된 바와 같은)에서의 각 서브젝트가 분류된 후, 3개의 과정의 각각에 대해(추정값 또는 예측값을 기초로) 다음과 유사한 2×2 를 계산한다:After each subject in the verification sample (as described above) is classified, for each of the three processes (based on estimated or predicted values), 2 × 2 similar to the following is calculated:

D내의 실제 및 분류된 구성원 관계에 의해 표로 작성된 검증 샘플내의 서브젝트의 수The number of subjects in the verification sample tabulated by actual and classified member relationships in D 서브젝트가 집단의 구성원으로서 분류됨Subject is classified as a member of a group PDPD 서브젝트가 실제로 집단의 구성원임Subject is actually a member of a collective =참의 부의 분류의 수 = Number of divisions =거짓의 정의 분류의 수 = Number of false positive classifications DD =거짓의 부의 분류의 수 = Number of false negative classifications =참의 거짓의 분류의 수 = True number of false classifications

더욱이, 추정값을 기초로하는 분류와 예측값을 기초로 하는 분류를 별도로 계산한다:Furthermore, separate classifications based on estimates and classifications based on predictions:

= 거짓의 정의 오차율 = 거짓의 정의 분류의 비율 = False positive justice ratio = false positive proportion

= 거짓의 부의 오차율 = 거짓의 부의 분류의 비율 = False negative rate of error = false negative rate of classification

= 총 오차율 = 거짓 분류의 비율 = Total error rate = ratio of false classifications

본 발명의 대표적인 실시예에서, 3가지 타입의 분류 과정, 즉 추정값을 기초로 하는 것과, "최소" 예측값을 기초로 하는 것과, "평균" 예측값을 기초로 하는 분류 과정을 비교하여 "명백한 최고 신뢰적인 과정"을 판정할 것이다. 선택 과정에서의 일부 고려 사항으로는 다음이 있다:In a representative embodiment of the invention, three types of classification processes, i.e., estimates And "minimum" predictions And "average" predictions based on The classification process based on the above will be compared to determine the "apparent most reliable process". Some considerations in the selection process include:

거짓의 부의 분류가 거짓의 정의 분류보다 더욱 심각한 결과를 갖는다면, 더 적은 거짓의 부의 오차율 r_FN을 갖는 과정을 선택한다. 이 상황은 예를 들어 집단 D가 특정 5년 연령 집단내에서 심근 경색("MI")으로 고생하는 사람의 부분 모집단인 경우에 발생할 것이다. 높은 MI 확률을 갖는 사람에 대한 경고 실패인 거짓의 부의 분류는 높은 MI 확률을 갖는 저확률의 사람에 대한 경고인 거짓의 정의 분류보다 더욱 심각한 결과를 가질 것이다. If the false negative classification has more serious consequences than the false positive classification, then choose a process with less false negative error rate r _FN . This situation will arise, for example, if group D is a subpopulation of people suffering from myocardial infarction (“MI”) within a particular five year age group. False classification of false, which is a warning failure for a person with a high MI probability, will have more serious consequences than false definition classification, which is a warning for a low probability person with a high MI probability.

이와 반대로, 거짓의 정의 분류가 거짓의 부의 분류보다 더욱 심각한 결과를 갖는다면, 더 적은 거짓의 정의 오차율 r_FP을 갖는 과정을 선택한다. Conversely, if the false positive classification has more serious consequences than the false negative classification, choose a process with fewer false positive error rates r _FP .

거짓의 부의 분류 또는 거짓의 정의 분류에 대해 더 큰 심각성을 지정하기 위한 사전 이유가 없다면, 더 적은 총 오차율 r_tot을 갖는 과정을 선택한다. If there is no prior reason to specify greater severity for false negative classification or false positive classification, choose a process with a lower total error rate r _tot .

가장 신뢰적인 과정로서 선택된 과정이 서브젝트를 2개의 집단 PD 및으로 분류하기 위해 사용된다.As the most reliable course, the course chosen will be divided into two collective PDs and Used to classify as

스텝 2: 신규 서브젝트가 각 집단에 속할 확률의 추정치를 계산하기 위해 2가지 타입의 논리 기호학적 회귀를 사용한다.Step 2: Use two types of logical semiotic regression to calculate an estimate of the probability that the new subject will belong to each group.

트레이닝 샘플로부터의 데이타는 각 서브젝트에 대한 판별 함수[1차인 경우에는 D(Y), 2차인 경우에는 Q(Y)]의 값이 독립("X") 변수로서 사용되고 생물학적인 조건 상태(집단 D내의 구성원에 대한 지시자 변수)가 종속("Y") 변수로서 사용되는 논리 기호학적 회귀 모형을 작성하도록 사용된다. 모형은 역의 논리 기호학적 변형과 함께 서브젝트가 집단 D에 속할 확률의 추정치를 각 서브젝트에 대해 계산하는데 사용된다.The data from the training sample is used as the independent ("X") variable of the discriminant function for each subject (D (Y) for the first order, Q (Y) for the second order) and the biological condition state (group D). Indicator variables for members within are used to create a logical semiotic regression model, which is used as a dependent ("Y") variable. The model is used to calculate for each subject an estimate of the probability that the subject belongs to group D, along with the inverse logical semiotic transformation.

별도의 계산에서, 트레이닝 샘플로부터의 데이타는 판별 함수에 사용된 바이오마커가 최종 혼합 모형 분산치(X에서의 변수)와 함께 독립("X") 변수로서 사용되고 생물학적 조건 상태(집단 D에서의 구성원에 대한 지시자 변수)가 종속("Y") 변수로서 사용된다. 일반적인 논리 기호학적 회귀 모형 추정치의 획득에 부가하여, 역의 논리 기호학적 변형과 함께 각각의 서브젝트에 대해 서브젝트가 집단 D에 속할 추정 확률을 계산하기 위한 모형이 사용된다. 시계열적인 데이타가 사용될 때에는 특정 기간의 종료시에 서브젝트가 집단 D에 속할 확률을 추정하기 위한 모형이 사용된다. 한 서브젝트로부터의 다수의 이항 결과(binominal outcome) 중에서의 상관을 얻기 위해 논리 기호학적 링크 함수를 갖는 일반화된 추정 수학식 방법을 사용할 수 있다.In a separate calculation, the data from the training sample is used as the biomarker used in the discriminant function as an independent ("X") variable with the final mixed model variance (variable in X), and the member of the biological condition state (group D). Indicator variable for) is used as the dependent ("Y") variable. In addition to obtaining general logical semiotic regression model estimates, a model is used to calculate the estimated probability that the subject will belong to group D for each subject with an inverse logical semiotic transformation. When time series data is used, a model is used to estimate the probability that a subject belongs to group D at the end of a particular period. A generalized estimating equation method with a logical semiotic link function can be used to obtain correlation among multiple binominal outcomes from a subject.

이들 2개의 모형으로부터의 예측 확률은 판별 함수값의 대상 해석을 제공할 수 있다.The prediction probabilities from these two models can provide a target interpretation of the discriminant function values.

서브젝트 알고리듬이 서브젝트에 사용될 판별 함수를 결정하기 위한 바람직한 실시이기는 하지만, 이 알고리듬은 본 발명의 바람직한 실시를 예시하기 위한 용도로 제공되 것이고, 본 발명을 알고리듬의 스텝 또는 서브스텝으로 제한되도록 하는 것은 아니다. 예를 들어, 판별 분석 기술의 분야에는 예를 들어 소위 "최적 판별"로 지칭되는 다른 타입의 판별 함수, 예를 들어 비선형 혼합 모형 등의 다른 타입의 회귀가 있으며, 이들은 본 발명의 기술 사상에 포함되는 것이다.Although the subject algorithm is a preferred implementation for determining the discriminant function to be used in a subject, this algorithm is provided for the purpose of illustrating the preferred implementation of the invention and is not intended to limit the invention to the steps or substeps of the algorithm. For example, in the field of discriminant analysis techniques there are other types of discriminant functions, such as, for example, "optimal discriminant", other types of regression, such as nonlinear mixed models, which are included in the technical idea of the present invention. Will be.

이하에서는 본 발명을 특정 실시예, 예시적인 의미의 물질, 장치 및 처리 단계를 참조하여 설명할 것이다. 특히, 본 발명은 통계학적인 방법, 물질, 조건, 처리 인자, 장치 등으로 제한되지 않을 것이다.The invention will now be described with reference to specific examples, materials, devices and processing steps in an exemplary sense. In particular, the present invention will not be limited to statistical methods, materials, conditions, treatment factors, devices, and the like.

바람직한 실시예의 예Examples of Preferred Embodiments

첨부된 표 및 도면은 본 발명의 방법 및 과정을 사용한 데이타 분석의 예의 결과를 나타낸다.The accompanying tables and figures show the results of examples of data analysis using the methods and procedures of the present invention.

이 예를 위한 토대로 사용된 데이타는 겸상 적혈구 데이타가 연도를 기초로 획득되는 환자를 포함하는 데이타베이스로부터 획득된다. 일부 환자는 연속 3번의 방문횟수로부터 획득된 데이타를 갖는다. 그러나, 환자가 통상 해마다 참여하도록 강요될 수 없기 때문에, 데이타베이스는 1년 또는 2년 방문횟수만으로 데이타가 이용 가능하게 되는 다수의 환자를 포함한다. 본 명세서에 사용된 데이타베이스 정보는 데모 데이타, 임상 화학 데이타 및 혈액학 데이타를 포함한다.The data used as the basis for this example is obtained from a database containing patients whose sickle cell data are obtained based on the year. Some patients have data obtained from three consecutive visits. However, because a patient cannot usually be forced to participate year after year, the database includes a large number of patients whose data is only available for one or two year visits. Database information used herein includes demo data, clinical chemistry data and hematology data.

이 예에서의 대상("질병" 또는 "감염")의 특정 생물학적 조건은 입원 가료를 요하는 고통스러운 고비의 발생이다. 매년 방문시에, 서브젝트는 최근에 입원 가료를 요하는 고통스러운 고비를 겪었는지에 관해 질문된다(그리고 기록을 검사하여 판정된다). 임의의 방문(임의의 해)에 고통스러운 고비에 대해 입원가료한 것으로 기록된 각각의 서브젝트는 "발병" 집단(집단 D)의 구성원이 되고, 다른 모든 서브젝트는 집단의 구성원이 된다.Particular biological conditions of the subject (“disease” or “infection”) in this example are the occurrence of painful ferns requiring hospitalization. At an annual visit, the subject is asked if they have recently suffered a painful crunch that requires hospitalization (and is determined by examining the records). Each subject recorded as hospitalized for painful attrition at any visit (any year) becomes a member of the "onset" group (Group D), and all other subjects are grouped. Become a member of

서브젝트가 최근에 입원가료를 요하는 고통스러운 고비를 경험할 때마다, 그 해 또는 그 후의 다른 해에 입원가료 후에 수집된 모든 데이타는 분석으로부터 제외된다. 이것은 사망 또는 만성병이나 치료불가능한 질병이 발생되는 결과로 나타나는 경우에 사용될 과정에 도움을 준다. 서브젝트의 집단 D 구성원 관계를 기록하는 변수(예를 들어, 발병 여부, 감염 여부)는 "질병 상태" 변수로 지칭된다.Whenever a subject has recently experienced a painful expenditure that requires hospitalization, all data collected after hospitalization in that year or another year is excluded from the analysis. This aids in the process to be used in the event of death or the occurrence of chronic or incurable disease. Variables that record the population D member relationship of the subject (eg, whether they are present or infected) are referred to as “disease state” variables.

다음은 겸상 적혈구 데이타를 사용하는 통계학적인 분석 과정의 예이다. 기밀을 이유로, 이 예에서 사용된 데이타는 인위적으로 작성한 것으로 실제 연구 또는 실제 서브젝트로부터 유래된 것은 아니다. 그러나. 이 데이타는 실제 서브젝트의 연구에서 획득될 수 있는 데이타와 유사하다.The following is an example of a statistical analysis using sickle cell data. For confidentiality reasons, the data used in this example are artificially written and not derived from actual research or actual subjects. But. This data is similar to the data that can be obtained in the study of real subjects.

단계 Ⅰ. 평가 방법의 확정 및 고려 사항을 위한 바이오마커의 선택Step Ⅰ. Selection of biomarkers for confirmation and consideration of evaluation methods

스텝 1: 과정의 오차율을 추정하기 위한 방법학을 선택Step 1: Choose a methodology to estimate the error rate of the process

스텝 2: 판별 과정/확률 추정 과정을 야기하는 통계학적인 분석에 사용될 시험 모집단의 서브세트인 "트레이닝 샘플" 및 상보성 세트(complementary subset)인 "유효 샘플"을 선택.Step 2: Select the "Training Sample", which is a subset of the test population, and the Complementary subset, "Effective Sample," to be used for the statistical analysis that leads to the discriminant / probability estimation process.

이 예에 대해서는 트레이닝 샘플/유효 샘플 방법이 선택된다. 환자는 2개의 샘플중의 한 샘플로 랜덤하게 지정된다. 트레이닝 샘플은 판별 함수를 작성하기 위해 사용되고, 검증 샘플은 판별 함수의 정확도를 평가하기 위해 사용된다.For this example, a training sample / effective sample method is selected. Patients are randomly assigned to one of two samples. The training sample is used to write the discriminant function, and the test sample is used to evaluate the accuracy of the discriminant function.

트레이닝 샘플은 481인의 서브젝트로부터 641개의 "매년" 평가로부터의 정보, 즉 서브젝트당 1.3개의 매년 평가로부터의 정보를 포함한다. 그러나, 서브젝트가 방문한다 하더라도 모든 바이오마커가 평가되지는 않는다. 예를 들어, Direct Bilirubin(변수 L_DBILI)의 88개의 값만이 80인의 서브젝트만으로부터 이용가능하게 된다.The training sample contains information from 641 "yearly" assessments from 481 subjects, that is, information from 1.3 annual assessments per subject. However, not all biomarkers are evaluated even if the subject visits. For example, only 88 values of Direct Bilirubin (variable L_DBILI) are available from only 80 subjects.

스텝 3: 잠재적인 판별치인 잠재적인 바이오마커의 목록을 컴파일Step 3: Compile a List of Potential Biomarkers as Potential Determinants

이 경우, 혈압, 이용 가능한 모든 데모 데이타, 임상 화학 데이타 및 혈액학 데이타가 잠재적인 판별치로서 사용된다. 잠재적인 바이오마커는 표 2에 나열되어 있다.In this case, blood pressure, all available demo data, clinical chemistry data and hematology data are used as potential discriminations. Potential biomarkers are listed in Table 2.

스텝 4: 이전의 연구 개발을 통대로 특정 생물학적 조건에 관려될 것으로 믿겨지는 임의의 잠재적인 바이오마커를 포함함으로써 후보 바이오마커의 설정을 개시Step 4: Initiate the establishment of candidate biomarkers by including any potential biomarkers believed to be involved in specific biological conditions throughout previous research and development

이 예에서, 혈소판 치수는 질병 상태에 대한 "미지" 바이오마커로서 취해진다.In this example, platelet dimensions are taken as "unknown" biomarkers for disease states.

스텝 5: 스텝 4로부터의 "미지의 중요한" 바이오마커와 "통계학적으로 중요하게" 상관되는 임의의 잠재적인 바이오마커를 후보 바이오마커의 목록에 추가Step 5: Add to the list of candidate biomarkers any potential biomarkers that are "statistically important" correlated with "unknown significant" biomarkers from step 4.

스텝 2로부터의 "미지의 중요한" 바이오마커인 혈소판 치수와 상관되는 바이오마커가 선택된다. 이들 상관의 개요는 "상관 W/혈소판치수" 로 표시된 표 3에 도시되어 있다. "p"행은 혈소판 치수와의 상관에 대한 p-값을 나타낸다. 바이오마커는 포아송 곱-모멘트 상관 계수에 대한 이득 p-값을 기초로 선택된다. 이 예에서, 선택을 위해 p<0.01 이 요구된다. "p<cv" 열은 "예(YES)"라는 단어의 존재에 의해 혈소판 치수와 "중요한" 상관의 결과로서 후보 바이오마커가 되는 바이오마커를 나타낸다.A biomarker is selected that correlates to platelet dimensions, which is the "unknown important" biomarker from step 2. An overview of these correlations is shown in Table 3, denoted "correlation W / platelet dimensions". The "p" row represents the p-value for correlation with platelet dimensions. Biomarkers are selected based on gain p-values for Poisson product-moment correlation coefficients. In this example, p <0.01 is required for selection. The column “p <cv” indicates the biomarker that is a candidate biomarker as a result of the “significant” correlation with platelet dimensions by the presence of the word “YES”.

스텝 6: 질병 상태를 종속(Y) 변수로 하고 연령과 바이오마커의 조합을 독립(X) 변수로 하여 각 바이오마커에 대해 논리 기호학적 모형이 작성된다. 이 경우, 각 바이오마커에 대해서 논리 기호학적 모형은 서브젝트의 연령과 함께 고통스러운 고비에 대한 입원 가료의 확률이 그 바이오마커에 의해 얼마나 잘 설명되는지를 평가하였다. 대략적으로 말하면, 논리 기호학적 회귀에서의 바이오마커의 회귀 계수 또는 기울기는 바이오마커와 서브젝트가 특정 생물학적 조건을 획득할 확률간에 관계가 없다면 대략 제로가 될 것이며, 제로가 아닌 기울기는 관계가 있음을 나타낸다. 논리 기호학적 회귀 결과의 개요는 표 3의 "논리 기호학적 회귀"라는 머릿말의 열에 표시하였다. "p" 열은 바이오마커의 회귀 계수에 대한 p-값을 나타낸다. 바이오마커는 논리 기호학적 회귀 모형의 바이오마커의 기울기에 대한 여분의 p-값을 기초로 선택된다. 이 예에서는 선택을 위해 p<0.01이 요구된다. "p<cv" 열은 "예(YES)"라는 단어에 의해 "중요한" 논리 기호학적 회귀 계수의 결과로 후보 바이오마커가 되는 바이오마커를 나타낸다. 이들 바이오마커의 일부는 혈소판 치수와 현저하게 관련되며, 논리 기호학적 회귀가 계산되기 전에 후보 바이오마커가 된다.Step 6: A logical semiotic model is created for each biomarker with disease status as the dependent (Y) variable and age and biomarker combination as the independent (X) variable. In this case, for each biomarker, the logical semiotic model evaluated the age of the subject and how well the probability of inpatient care for painful fern was explained by that biomarker. Roughly speaking, the regression coefficient or slope of a biomarker in a logical semiotic regression will be approximately zero if there is no relationship between the biomarker and the probability that the subject will acquire a particular biological condition, indicating that there is a nonzero slope relationship. . An overview of the logical semiotic regression results is shown in the column heading "Logical semiotic regression" in Table 3. The "p" column represents the p-value for the regression coefficient of the biomarker. Biomarkers are selected based on the extra p-value for the slope of the biomarkers of the logical semiotic regression model. In this example p <0.01 is required for selection. The column "p <cv" indicates the biomarker that becomes a candidate biomarker as a result of the "significant" logical semiotic regression coefficient by the word "YES". Some of these biomarkers are significantly associated with platelet dimensions and become candidate biomarkers before logical semiotic regression is calculated.

스텝 7: 바이오마커의 값에서의 시계열적 추세가 특정 생물학적 조건의 획득에 관련되는지의 여부를 평가하기 위해 일반적인 선형 혼합 모형("MixMod")을 사용하여 각각의 시계열적으로 평가된 잠재적인 바이오마커를 평가Step 7: Potential biomarkers evaluated for each time series using a general linear mixed model ("MixMod") to assess whether the time series trend in the value of the biomarker is related to the acquisition of specific biological conditions Rate it

바이오마커의 시계열적 값을 종속(Y) 변수로 하고 연령, 질병 상태 및 방문횟수×질병 상태를 독립(X) 변수로 하여 각 바이오마커에 대해 혼합 모형이 작성된다(방문횟수 및 질병 상태는 "분류" 변수이고, 해당 계수는 인터셉트에 대한 증분이 된다. 반대로, 연령은 그 계수가 기울기인 연속 변수이다). 혼합 모형의 랜덤 이펙트 부분은 동일 서브젝트로부터의 시계열적 측정치간의 상관을 포함한다. 모형은 방문횟수(시계열적 평가치)를 서브젝트에서 서브젝트로 변화할 수 있다.A mixed model is created for each biomarker, with the time series of the biomarker as the dependent (Y) variable and the age, disease state, and number of visits × disease state as independent (X) variables. Class ", and that coefficient is an increment for the intercept. In contrast, age is a continuous variable whose coefficient is the slope). The random effect portion of the mixed model contains the correlation between time series measurements from the same subject. The model can change the number of visits (time series estimates) from subject to subject.

바이오마커는 질병 상태 "메인 이펙트" 또는 3 방문횟수×질병 상태 작용의 서브벡터 중의 하나가 0(P<0.01)와 통계학적으로 현저하게 상이한 경우에 선택될 것이다. 현저한 질병 상태 "메인 이펙트"는 집단 D내의 서브젝트에 대한 바이오마커 값의 평균치가 집단내의 서브젝트에 대한 평균치와 상이하다는 것을 나타낸다. 3 방문횟수×질병 상태 작용 계수의 현저한 서브벡터는 집단 D내의 서브젝트에 대한 바이오마커 값에서의 시간적인 추세가 집단내의 서브젝트에 대한 시간적인 추세보다 상이하다는 것을 나타낼 것이다. 이 2가지 경우(현저한 메인 이펙트 또는 작용), 그 결과는 바이오마커가 잠재적으로 이용 가능한 판별치이고 후보 바이오마커 목록으로 이동되어야만 한다는 것을 나타낼 것이다. 혼합 모형으로부터의 결과는 표 3의 "혼합 모형"이라는 머릿말의 행에 나타나 있다. 별도의 결과가 상관 및 논리 기호학적 회귀로부터의 결과와 유사한 포맷으로 메인 이펙트 및 작용에 대해 나타내어져 있다.The biomarker will be selected if the disease state "main effect" or one of the three visits x disease state action subvectors is statistically significantly different from 0 (P <0.01). A prominent disease state "main effect" is that the average of the biomarker values for the subjects in group D is grouped. Different from the mean for the subject within. 3 The prominent subvector of the number of visits × disease state action coefficients is that the temporal trend in the biomarker values for the subjects in group D It will indicate that it is different than the temporal trend for the subject within. In both cases (significant main effect or action), the result will indicate that the biomarker is a potentially available determination and should be moved to the candidate biomarker list. The results from the mixed model are shown in the heading of the “mixed model” in Table 3. Separate results are shown for the main effects and actions in a format similar to the results from correlation and logical semiotic regression.

스텝 4 내지 스텝 7의 완료시에, 모든 잠재적인 바이오마커가 검사되고, 판별치로서의 활용의 시간적인 또는 정량적인 증거를 갖는 각 바이오마커가 호보 바이오마커의 목록으로 이동된다. 후보 바이오마커는 표 3의 "선택됨" 이라는 머릿말의 열내에 "예(YES)"라는 단어로 표시된다.Upon completion of steps 4-7, all potential biomarkers are examined and each biomarker with temporal or quantitative evidence of utilization as a discrimination is moved to the list of hobo biomarkers. Candidate biomarkers are indicated by the word "YES" in the column of heading "Selected" in Table 3.

단계 Ⅱ. 후보 바이오마커를 판별 검증력을 갖는 선택 바이오마커의 세트로 감축하고, 분산 구조 및 예측값의 혼합 모형 추정을 수행Step II. Reduce candidate biomarkers to a set of selective biomarkers with discriminant verification power, and perform mixed model estimation of variance structure and predicted values

스텝 1: 변수 "RespScal" 이 전체 서브젝트로부터의 전체 후보 바이오마커의 환산값(시계열적 측정치를 포함)을 포함하는 데이타 세트를 작성Step 1: Create a data set with the variable "RespScal" containing the conversion values (including time series measurements) of all candidate biomarkers from all subjects

이 스텝은 예를 위해 시행되고, 그 결과는 표시되지 않는다. 그러나, 전체의 상이한 바이오마커의 값 모두는 하나의 행 벡터 Y로 배치되고, 이 벡터는 다수의 원소를 포함할 수 있다.This step is performed for example, and the result is not displayed. However, all of the values of the whole different biomarkers are arranged in one row vector Y, which may include a number of elements.

스텝 2: 아래에 나열된 상세 사양을 사용하여 일반적인 선형 혼합 모형(MixMod)을 작성, 인자 행렬 β, △ 및 V의 추정치를 획득하고, 서브젝트가 각각의 특정 생물학적 조건 집단 i=1, 2에 있는 것으로 가정하여 각 서브젝트의 랜덤 서브젝트 이펙트 d_ik와, 각 서브젝트의 예측값 Y_ik ^(min)및 Y_ik ^(avg)을 획득Step 2: Build a general linear mixed model (MixMod) using the detailed specifications listed below, obtain estimates of the factor matrices β, Δ, and V, and assume that the subject is in each specific biological condition group i = 1, 2. Assume a random subject effect d _ik of each subject and the predicted values Y _ik ^(min) and Y _ik ^(avg) of each subject

스텝 3: 최소의 명백한 판별 검증력을 갖는 바이오마커를 삭제하여 혼합 모형을 재구성Step 3: Reconstruct the Mixed Model by Removing the Biomarkers with the Minimum Discriminant Verification Power

스텝 2 및 스텝 3은 혼합 모형 내의 모든 바이오마커가 통계학적으로 중요할 때까지 반복된다. 예에 대한 설명을 간략화하기 위해, 단계 2 및 단계 3에 걸친 반복의 최종 결과만이 설명된다. 단계 2 및 단계 3은 연령을 고정 이펙트 분산량으로 하여 바이오마커의 수를 15로 감소시킨다.Steps 2 and 3 are repeated until all biomarkers in the mixed model are statistically significant. To simplify the description of the example, only the final result of the iteration over steps 2 and 3 is described. Steps 2 and 3 reduce the number of biomarkers to 15 with age as the fixed effect variance.

예의 혼합 모형에 대한 일반적인 정보는 표 4에 제공되어 있다. 각 환자에 대해 최대 3회의 방문횟수를 갖는 481인의 환자로부터의 데이타가 이용가능하다. 분석에는 다수의 관측치가 사용되지 않는 점에 유의하여야 한다. 소프트웨어로 하여금 요구된 예측값을 계산하도록 하기 위해 누락 Y 값을 이용하여 인위적인 관측치가 생성된다. 누락 Y 값을 갖는 인위적인 관측치는 인자의 추정 또는 랜덤 서브젝트 이펙트의 예측에 영향을 주지 않는다.General information about the example mixed model is provided in Table 4. Data from 481 patients with up to three visits for each patient are available. Note that many observations are not used in the analysis. Artificial observations are generated using missing Y values to allow the software to calculate the required predictions. Artificial observations with missing Y values do not affect the estimation of the factor or the prediction of the random subject effect.

표 5는 혼합 모형으로부터의 고정 이펙트의 추정치를 제공한다. 각 바이오마커에 대한 p-값(예를 들어, "L_BUN" 에 대한 p-값)은 이 바이오마커의 평균값이 전체 바이오마커에 걸쳐 평균된 전체 평균치와 동일하다는 가설의 시험을 위한 p-값이 된다. 이들 p-값이 현저하다는 사실은 별로 중요하지 않으며, 한 바이오마커의 값의 평균치가 다른 바이오마커의 값의 평균치와 상이할 것이라는 점을 예측할 수 있다.Table 5 provides estimates of the fixed effects from the mixed model. The p-value for each biomarker (eg, the p-value for "L_BUN") is the p-value for the hypothesis test that the mean value of this biomarker is equal to the overall mean averaged over the entire biomarker. do. The fact that these p-values are significant is not very important and one can predict that the mean value of one biomarker will be different from the mean value of another biomarker.

표 5에서, 각 "바이오마커 X GROUP IA" 작용에 대한 p-값(예를 들어, "ALBUMIN X GROUP IA"에 대한 p-값)은 집단 D에 대한 바이오마커의 평균값이 집단에 대한 바이오마커의 평균값과 현저하게 상이하다는 가설의 시험을 위한 p-값이 된다. 현저한 값(예를 들어, p<0.05)은 바이오마커가 후수한 판별치가 되어야만 한다는 것을 나타낸다. 표 5에 의해 나타난 최종 모형에서의 작용 모두는 통계학적으로 중요하다(전체가 p≤0.05). p-값이 현저하지 않은 경우에도 연령은 모형내에 유지된다.In Table 5, the p-value for each "biomarker X GROUP IA" action (eg, the p-value for "ALBUMIN X GROUP IA") is the mean of the biomarkers for group D. It is the p-value for the test of the hypothesis that it is significantly different from the mean value of the biomarker for. A prominent value (eg p <0.05) indicates that the biomarker should be a good judgment. All of the actions in the final model represented by Table 5 are statistically significant (total p≤0.05). Age remains in the model even if the p-value is not significant.

서브젝트 447에 대한 서브젝트, 바이오마커, 질병 상태("집단") 및 방문 관측값과 예측값이 표 6에 나타내어져 있다. 이 서브젝트는 집단내에 있지만("집단 D?"=아니오(NO). "집단 D?"=예(YES)인 행에 대해서는 "RESPSCAL" 이 없음), 모든 집단에 대해 예측값을 갖는다. 이 서브젝트는 제2 방문에 대한 바이오마커 MCH 또는 MCHC 에 대한 데이타를 갖지 않지만, 그 서브젝트의 방문 2 MCH 및 MCHC에 대한 모형 기반 예측값을 갖는다.The subject, biomarker, disease state (“population”) and visit observations and predictions for subject 447 are shown in Table 6. This subject is a collection (With no "RESPSCAL" for rows where "group D?" = No. "Group D?" = YES), but has predictions for all groups. This subject does not have data for the biomarker MCH or MCHC for the second visit, but has a model based prediction for Visit 2 MCH and MCHC for that subject.

스텝 2 및 스텝 3에서 실시된 방법은 단계식 회귀 구문의 "백워드 삭제" 과정의 아날로그이다. 다른 방법으로는 최초에 모형내에 불과 2개의(또는 극소수의) 명백한 유효 판별치(바이오마커)를 포함하는 "백워드 선택"의 아날로그를 실시하고, 각 후속 스텝에서 바이오마커를 하나씩 추가하는 방법이 있다.The method implemented in steps 2 and 3 is an analog of the "backward delete" process of the stepwise regression syntax. Another method is to first perform a "backward selection" analogue that includes only two (or very few) significant validity determinations (biomarkers) in the model, and add one biomarker at each subsequent step. have.

스텝 4: 분산 인자 행렬 △ 및 V_ik의 구조를 결정Step 4: Determine the structures of the dispersion factor matrices Δ and V _ik

전술된 바와 같이, _ik의 전체적인 구조는 3가지 타입의 분산/상관을 고려하여야만 한다:As mentioned above, The overall structure of _ik must consider three types of variance / correlation:

ADB 타입 : 동일 시점에 평가된 상이한 바이오마커 중에서의 분산/상관ADB type: dispersion / correlation among different biomarkers evaluated at the same time

ALESB 타입 : 단일 바이오마커의 시계열적 평가중에서의 분산/상관ALESB type: dispersion / correlation during time series evaluation of a single biomarker

BTBEL 타입 : 시계열적으로 평가된 2개의 바이오마커간의 분산/상관, 즉 서로 상이한 시간에 평가된 임의의 한쌍의 바이오마커간의 분산/상관BTBEL type: dispersion / correlation between two biomarkers evaluated in time series, ie dispersion / correlation between any pair of biomarkers evaluated at different times

예에서, 다음 구조가 최종적으로 획득된다:In the example, the following structure is finally obtained:

집단 D 및모두에 대한 동일한 랜덤 이펙트 분산 인자 행렬, 즉 △₁= △₂= △, 및Cohort D and The same random effect variance factor matrix for all, i.e. Δ ₁ = Δ ₂ = Δ, and

△는 복합 대칭 구조, i≠j에 대해 δ_ii=0.6669, δ_ij=0.0097를 가짐,Δ has a complex symmetry, δ _ii = 0.6669, δ _ij = 0.0097 for i ≠ j,

집단 D 및모두에 대해 동일하고, 복합 대칭 구조, 즉 i≠j에 대해 V_ii=0.3267, V_ij=0.0151를 갖는 행렬내의 ADB 타입 분산.Cohort D and Same for all, matrix with complex symmetry, i.e. V _ii = 0.3267, V _ij = 0.0151 for i ≠ j ADB type distribution within.

이 분산 구조는 소정의 겸상 적혈구 데이타에 대해 합리적인 것이다.This scatter structure is reasonable for certain sickle cell data.

△ 및의 추정치는 표 7에 나타내어져 있다. 랜덤 서브젝트 이펙트의 분산 행렬 △의 평가 추정치는 표의 상단에 나타내어져 있다. 행과 열은 이 모형에 사용된 15개의 바이오마커에 대응하고, 열에 표시되어 있다.△ and The estimate of is shown in Table 7. The evaluation estimate of the variance matrix Δ of the random subject effect is shown at the top of the table. Rows and columns correspond to the 15 biomarkers used in this model, and are shown in columns.

서브젝트에서의 방문횟수 오차내의 분산 행렬의 추정치는 표의 하단에 나타내어져 있다. △와 마찬가지로, 행과 열은 이 모형에 사용된 15개의 바이오마커에 대응한다.는 환산 데이타에 합리적인 복합 대칭 구조를 갖는다.Variance Matrix within Visit Error in Subject The estimate of is shown at the bottom of the table. Like △, rows and columns are the model Corresponds to the 15 biomarkers used in. Has a complex symmetric structure that is reasonable for the conversion data.

단계 Ⅲ : 추정 평균 및 예측값을 사용하여 판별 함수를 계산하고, 각 서브젝트에 대한 논리 기호학적 예측값을 계산하며, 판별 함수에 대한 오차율을 추정Step III: Compute the discriminant function using the estimated mean and the predicted value, calculate the logical semiotic predictive value for each subject, and estimate the error rate for the discriminant function

스텝 1: 단계 Ⅱ 혼합 모형으로부터의 결과를 사용하여 검증 샘플내의 모든 서브젝트로 분류하고, "추정값"을 기초로 하는 것과 추정 랜덤 서브젝트 이펙트의 각종 조합을 활용하는 "예측값"을 기초로 하는 다수의 후보 판별 과정의 오차율을 추정한다. 최저 추정 오차율을 갖는 과정이 선택된 과정이 되고, "명백히 최고의 신뢰적인 과정"로서 지칭된다.Step 1: Classify all the subjects in the verification sample using the results from the Step II mixed model, and based on the "estimated value" and a number of candidates based on the "predicted value" utilizing various combinations of estimated random subject effects. Estimate the error rate of the discrimination process. The process with the lowest estimated error rate becomes the selected process and is referred to as "apparently the most reliable process".

본 발명은 겸상 적혈구 데이타에 대한 혼합 모형 결과를 사용하여 적용된다. 분산 인자 행렬이 집단 D 및 집단에 대해 동일한 것으로 모형화되므로, 각 판별치는 선형 판별치가 된다. 각 판별치는 트레이닝 샘플(여기서는 검증 샘플로서 사용됨)내의 서브젝트에 적용되어 각 서브젝트가 집단 PD 및중의 하나에 속하도록 할 것이다.The invention is applied using mixed model results for sickle cell data. Variance factor matrix of group D and group Since it is modeled as the same for, each discrimination becomes a linear discrimination. Each determination is applied to a subject within a training sample (used here as a verification sample) so that each subject is a collective PD and Will belong to one of the

추정값을 기초로 한 서브젝트 선형 판별 함수의 평가는 표 8에 나타내어져 있다. 질병 상태="아니오(NO)" 인 집단내의 179개의 서브젝트 중에서 100개의 서브젝트(56%)가 판별치에 의해 집단으로 정확하게 분류되고, 79개의 서브젝트(44%)가 부정확하게 집단 PD로 분류된다. 질병 상태="예(YES)" 인 집단 D내의 262개의 서브젝트 중에서 188개의 서브젝트(72%)가 판별치에 의해 집단 PD로 정확하게 분류되고, 74개의 서브젝트(28%)가 부정확하게 집단로 분류된다. 전체 441개의 서브젝트 중에서 288개의 서브젝트(65%)가 정확하게 분류되고, 35%가 부정확하게 분류된다.The evaluation of the subject linear discriminant function based on the estimates is shown in Table 8. Population with disease state = "no" 100 subjects (56%) out of 179 subjects in the population And 79 subjects (44%) are incorrectly classified as collective PD. Of the 262 subjects in population D with disease state = "YES", 188 subjects (72%) were correctly classified as group PDs by discrimination, and 74 subjects (28%) were incorrectly grouped. Classified as Of the total 441 subjects, 288 subjects (65%) are classified correctly, and 35% are classified incorrectly.

표 9는 최소 랜덤 서브젝트 이펙트를 사용하여 예측값에 기초한 서브젝트 선형 판별 함수의 평가를 나타낸다. 표 9는 표 8과 유사하다. 예측 판별은 집단내의 약간의 판별 향상을 나타내지만, 집단 D에서는 다소 좋지 않은 결과를 나타낸다. 전체적으로 오차율은 거의 동일하다.Table 9 shows the evaluation of the subject linear discriminant function based on prediction values using the minimum random subject effect. Table 9 is similar to Table 8. Predictive Discrimination It shows some discriminant improvement in the population, but results in somewhat poor results in group D. Overall the error rate is about the same.

선행 문단 및 표 8과 표 9내의 분류/오분류 통계치는 편의된다. 즉, 판별 함수를 구하여 평가하기 위해 트레이닝 샘플이 사용되므로, 표는 실제로 발생할 것 같은 것보다 더 우수한 오분류 추정치를 제공한다. 평가 샘플을 사용하는 판별 함수의 평가는 오분류율의 비편의 추정치를 제공할 것이다. 재크나이핑 또는 부트스트래핑과 같은 재샘플화 기술은 트레이닝 샘플로부터의 데이타를 사용하면서도 덜 편의된 추정치를 발생할 수 있다.The preceding paragraph and the classification / misclassification statistics in Tables 8 and 9 are biased. That is, since training samples are used to obtain and evaluate the discriminant function, the table provides better misclassification estimates than are likely to occur. Evaluation of the discriminant function using the evaluation sample will provide an estimate of the deviation of the misclassification rate. Resampling techniques such as jackknifeing or bootstrapping can produce estimates that are less convenient while using data from training samples.

스텝 2: 신규 서브젝트가 각 집단에 속할 확률의 추정치를 계산하기 위해 2가지 타입의 논리 기호학적 회귀를 사용Step 2: Use two types of logical semiotic regression to calculate an estimate of the probability that a new subject will belong to each group

2가지 타입의 논리 기호학적 회귀는 판별 함수의 각각에 대한 트레이닝 샘플 데이타에 적합화된다. 양 논리 기호학적 회귀에서, 질병 상태 지시자는 종속("Y") 변수이다. 첫번째 논리 기호학적 회귀에서, 추정에 의거한 판별 함수의 값은 독립("X") 변수로서 사용된다. 2번째 논리 기호학적 회귀에서, 예측을 기초로 한 판별 함수의 값은 독립("X") 변수로서 사용된다. 3번째 논리 기호학적 회귀에서, 판별 함수에 사용된 바이오마커는 혼합 모형의 고정 이펙트 부분에서 사용된 분산과 함께 독립("X") 변수로서 포함되고, 질병 상태 지시자는 종속("Y") 변수로서 포함된다. 논리 기호 학적 모형으로부터의 추정치는 각 서브젝트에 대해 이 서브젝트가 질병(질병 상태 "예(YES)") 집단에 속하는 추정 확률을 계산하는데 사용된다. 논리 기호학적 회귀 계산의 결과는 표에 나타내어져 있지 않다.Two types of logical semiotic regression are fitted to the training sample data for each of the discriminant functions. In both logical semiotic regressions, the disease state indicator is a dependent ("Y") variable. In the first logical semiotic regression, the value of the discriminant function based on the estimate is used as an independent ("X") variable. In the second logical semiotic regression, the value of the discriminant function based on the prediction is used as an independent ("X") variable. In the third logical semiotic regression, the biomarker used in the discriminant function is included as an independent ("X") variable with the variance used in the fixed effect portion of the mixed model, and the disease state indicator is a dependent ("Y") variable. Included as. The estimates from the logical semiotic model are used to calculate the estimated probability that for each subject this subject belongs to the disease (disease state "YES") population. The results of the logical semiotic regression calculations are not shown in the table.

도 1은 집단 D(실선) 및 집단(점선)에 대한 선형 판별 함수값(추정값에 의거)의 실험적인 판별 함수("EDF")를 도시한다. 그래프를 작성하기 위해, 서브젝트에 대한 데이타가 질병 상태 집단 및 집단내에서의 D(Y)의 값을 증가시킴으로써 분류된다. 데이타 포인트는 그 시퀀스로 플로트된다. EDF는 0에서 시작되고(제1 서브젝트의 데이타가 플로트되기 전), 각 서브젝트에 대해 1/n 씩 증가되며, 여기서 n는 그 집단내의 서브젝트수이다. 따라서, EDF는 각 집단에 대해 0에서 1로 별도로 상승한다. 도 1에서, 집단 D에 대한 EDF가 집단에 대한 EDF의 좌측으로 시프트된다는 사실은 집단 D가 집단보다 더 낮은 스코어를 갖는 추세라는 것을 나타낸다. 집단 D의 대략 72%가 0(집단 PD 및간의 분리 지점)보다 작은 값을 갖는 한편 집단는 0의 좌측에 서브젝트의 EDF 값의 약 44%를 갖는다는 것을 확인할 수 있다. LDF=0에서의 수직선에 인접한 집단의 EDF의 경사는 다수의 서브젝트가 "경계선"에 있어 분류하기 용이하지 않다는 것을 나타낸다. 다음 해의 추가 년도가 이용 가능하다면 집단(이들 데이타의)내의 서브젝트수가 후속 년도에서 고통 고비를 가질 것이고 집단 D에 대해서는 "반전"될 것이다.1 shows group D (solid line) and group The experimental discriminant function ("EDF") of the linear discriminant function value (based on the estimated value) for (dotted line) is shown. To build the graph, the data for the subject is sorted by increasing the value of D (Y) in the disease state population and within the population. Data points are floated in that sequence. The EDF starts at 0 (before the data in the first subject is floated) and is incremented by 1 / n for each subject, where n is the number of subjects in the population. Thus, EDF rises separately from 0 to 1 for each population. In Figure 1, the EDF for population D is population The fact that the D is shifted to the left of the EDF for Indicating a trend with a lower score. Approximately 72% of population D is zero (group PD and Group with a value less than We can see that we have about 44% of the EDF value of the subject on the left side of zero. The slope of the EDF of the population adjacent to the vertical line at LDF = 0 indicates that many subjects are not easy to classify in the "boundary line". If additional years in the next year are available The number of subjects (of these data) will suffer pain in subsequent years and will be "inverted" for group D.

집단 D(실선) 및 집단(점선)에 대한 최소 랜덤 서브젝트 선형 판별 함수값의 실험적인 분포 함수("EDF")는 도 2에 도시되어 있다. 결과 및 해석은 도 1에서와 유사하다. 그러나, 집단의 EDF 라인은 도 1에서보다 도 2에서 LDF=0 에 인접하여 더욱 경사가 크고, 다수의 서브젝트가 경계선에 있다는 사실을 나타낸다.Group D (solid line) and Group The experimental distribution function (“EDF”) of the minimum random subject linear discriminant function value for (dotted line) is shown in FIG. 2. Results and interpretations are similar to those in FIG. 1. However, the collective EDF line is more inclined near LDF = 0 in FIG. 2 than in FIG. 1, indicating the fact that many subjects are at the borderline.

이들 도면은 전술한 통계치와 마찬가지로 분산 과정이 고통으로 인해 입원 가료하여야만 하는 서브젝트를 효율적으로 분류한다는 것을 나타내지만, 이 예에서 이용 가능한 데이타가 제한되므로 입원 가료되지 않을 부집단에 대해서는 덜 효율적이다. 표2는 겸상 적혈구 데이터에 대한 잠재 바이오마커의 설명이고, 표 3a, 3b 는 단계 Ⅰ, 스텝 5-7, 상관, 로지스틱 회귀 및 혼합 모형을 이용하여 잠재 바이오마커의 리스트로부터 후보 바이오마커의 선택을 나타낸 것이며, 표 4는 혼합 모형 정보, 전체 모형 특성을 나타낸 것이고, 표 5 a, 5b 는 고정 이펙트 계수 및 관련 통계의 추정값을 나타낸 것이며, 표 6a, 6b, 6c, 6d 는 서브젝트 447에 대한 예측값 및 관련 통계를 나타낸 것이고, 표 7a, 7b 는 공분산 행렬의 추정값을 나타낸 것이며, 표 8은 추정값을 사용하는 판별 과정의 평가를 나타낸 것이고, 표 9는 예측값을 사용하는 판별 과정의 평가를 나타낸 것이다.These figures, like the statistics described above, indicate that the dispersion process efficiently classifies subjects that must be hospitalized due to pain, but are less efficient for subgroups that will not be hospitalized because the data available in this example is limited. Table 2 shows the description of potential biomarkers for sickle cell data, and Tables 3a and 3b show the selection of candidate biomarkers from a list of potential biomarkers using Step I, Steps 5-7, Correlation, Logistic Regression, and Mixed Models. Table 4 shows mixed model information, overall model characteristics, and Tables 5 a and 5b show estimates of the fixed effect coefficients and related statistics, and Tables 6a, 6b, 6c, and 6d show predictions for subject 447 and Table 7a and 7b show the estimates of the covariance matrix, Table 8 shows the evaluation of the discriminant process using the estimates, and Table 9 shows the evaluation of the discriminant process using the predicted values.

"p < cv "은 "p-값은 임계치보다 작음"을 나타냄, 여기서 cv=0.01."p <cv" stands for "p-value is less than the threshold", where cv = 0.01.

p < cv 에서 블랭크는 "NO(아니오)"를 의미.Blank in p <cv means "NO".

모든 p-값은 2로 라운딩됨.All p-values are rounded to 2.

D내의 실제 및 분류된 구성원 관계에 의해 표로 작성된 검증 샘플내의 서브젝트의 수The number of subjects in the verification sample tabulated by actual and classified member relationships in D 서브젝트가 집단의 구성원으로서 분류됨Subject is classified as a member of a group 예 Yes PD아니오PDNo 서브젝트가 실제로 집단의 구성원임Subject is actually a member of a collective 아니오 no D예D Example

r_tot= 153/441 = 35%r _tot = 153/441 = 35%

D내의 실제 및 분류된 구성원 관계에 의해 표로 작성된 검증 샘플내의 서브젝트의 수The number of subjects in the verification sample tabulated by actual and classified member relationships in D 서브젝트가 집단의 구성원으로서 분류됨Subject is classified as a member of a group 아니오 no PD예PD Example 서브젝트가 실제로 집단의 구성원임Subject is actually a member of a collective 아니오 no D예D Example

r_tot= 155/441 = 35%r _tot = 155/441 = 35%

Claims

In computer-based systems for predicting the future health of an individual,

(a) a subpopulation D of a member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of said member A computer comprising a processor comprising a database of time-series obtained biomarker values from individual members of a test population, wherein s is identified as not obtaining specific biological conditions within a particular time period or age interval;

(b) (1) subpopulations D and Selecting a subset of biomarkers for discrimination between members belonging to the group based on a distribution of biomarker values of individual members of the test population;

(2) (i) a subpopulation PD having a certain high probability of acquiring a particular biological condition within a particular time period or age interval for a member of the test population or a predetermined biological condition within a particular time period or age interval. Subpopulation with low probability Classified as belonging to, or

(ii) quantitatively estimate the probability of acquiring a specific biological condition within a specific time period or age interval for each member of the test population;

And a computer program comprising using the distribution of the selected biomarker to develop a statistical process that can be used to.

The method of claim 1, wherein the statistical process is performed with subpopulation D and And a discriminant function utilizing an estimated mean vector and an estimated variance matrix of the distribution of biomarker values within.

The computer-based system of claim 2, wherein the estimate of the factor of the distribution of the selected biomarkers is obtained by constructing a general linear mixed model for the biomarkers from the test population.

The method of claim 2, wherein (a) the estimated mean vector is modeled as a vector value function of the expected factor or variance, or

(b) A computer-based system in which the estimated variance matrix is modeled as a variance factor or matrix value function of variance values.

The computer-based system of claim 4, wherein the estimate of the factor of the distribution of the selected biomarkers is obtained by constructing a general linear mixed model for biomarker data from the test population.

5. The computer-based system of claim 4 wherein the estimated mean vector or probability comprises an estimate of a member from which the real value or probability of the random subject effect vector for the member being classified is estimated.

The computer-based system of claim 6, wherein the estimate of the factor of the distribution of the selected biomarkers is obtained by constructing a general linear mixed model for biomarker data from the test population.

In computer-based systems for predicting the future health of an individual,

(a) a subpopulation D of a member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of said member A processor comprising a processor comprising a database of biomarker values from individual members of a test population, wherein is identified as having not acquired a particular biological condition within a particular time period or age interval;

(2) (i) a subpopulation PD having a certain high probability of acquiring a particular biological condition within a particular time period or age interval for a member of the test population or a predetermined biological condition within a particular time period or age interval. Subpopulation with low probability Classified as belonging to, or,

Using the distribution of the selected biomarker to develop a statistical process that can be used to generate a statistical process, the statistical process comprising And a computer program comprising a discriminant function utilizing an estimated mean vector of the distribution of biomarker values within and an estimated variance matrix.

The computer-based system of claim 8, wherein the estimate of the factor of the distribution of the selected biomarkers is obtained by constructing a general linear mixture model for biomarker data from the test population.

10. The method of claim 9, wherein (a) the estimated mean vector is modeled as a vector value function of an expected factor or variance, or

11. The computer-based system of claim 10, wherein the estimated mean vector or probability comprises an estimate of a member of which the real value or probability of the random subject effect vector for the member being classified is estimated.

In the method of predicting an individual's health,

Collecting from the individual a plurality of biomarker values from which the at least one biomarker value is obtained by physically measuring the biomarker value;

(i) classify the individual as having a certain high probability of obtaining a particular biological condition within a particular time period or age interval or having a predetermined low probability of obtaining a specific biological condition within a particular time period or age interval. Or, (ii) applying a statistical process to the plurality of biomarker values to quantitatively estimate the probability of obtaining a particular biological condition within a particular time period or age interval for the individual. ,

The statistical process is,

(1) the subpopulation D of the member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of the member Collecting a database of time-series obtained biomarker values from individual members of the test population, wherein is identified as not obtaining specific biological conditions within a particular time period or age interval;

(2) subpopulation D and from the biomarker Selecting a subset of biomarkers for discrimination between members belonging to the group based on a distribution of biomarker values of individual members of the test population;

(3) using the distribution of the selected biomarkers to develop the statistical process.

The method of claim 12, wherein at least one of the biomarker values is obtained from a biological sample.

The method of claim 13, wherein the biological sample is a serum sample or urine sample.

In computer-based systems that predict the future health of an individual,

(a) a computer having a processor containing a plurality of biomarker values from an individual;

(b) collecting from the individual a plurality of biomarker values from which at least one biomarker value is obtained by physically measuring the biomarker value;

(i) the individual as a person with a certain high probability of acquiring a specific biological condition within a particular time period or age interval, or as a person with a certain low probability of acquiring a specific biological condition within a specific time period or age interval. Or (ii) applying a statistical process to the plurality of biomarker values to quantitatively estimate the probability of obtaining a particular biological condition within a particular time period or age interval for the individual. Computer program,

The statistical process is,

The computer-based system of claim 15, wherein the plurality of biomarker values from the individual is a biomarker value obtained in time series.

The computer-based system of claim 15, wherein the particular biological condition is death due to certain potential causes of death within a certain time period or age interval.

The computer-based system of claim 15, wherein the specific biological condition is a specific morbidity within a specific time period or age interval.

16. The computer based system of claim 15, wherein the particular time period is a minimum two year period.

The computer-based system of claim 15, wherein the particular time period is at least three years.

In a method of determining an individual's future risk of death due to certain potential causes of death,

The individual has a certain high probability of dying within a particular time period or age interval due to any of the potential causes of death that account for at least 60% of all deaths in the test population over a particular time period or age interval. And applying a statistical process to the plurality of biomarker values to determine whether they are classified as.

In the method of judging individual's health proof,

There is a certain high probability that the individual will not die within a particular time period or age interval due to any of the potential causes of death that account for at least 60% of all deaths in the test population over a particular time period or age interval. And applying a statistical process to the plurality of biomarker values to determine whether they are classified as having.

In computer-based systems that determine the risk of future death for an individual due to certain potential causes of death,

(a) a computer having a processor comprising a plurality of biomarker values from an individual;

(b) any high probability that the individual will die within a particular time period or age interval due to any of the potential causes of death that account for at least 60% of all deaths in the test population over a particular time period or age interval. And a computer program comprising applying a statistical process to the plurality of biomarker values to determine whether they are classified as having a probability.

In a computer-based system for determining the evidence of individual health,

(b) a predetermined period of time in which the individual will not die within a particular time period or age interval due to any of the potential causes of death that account for at least 60% of all deaths in the test population over a particular time period or age interval. And a computer program comprising applying a statistical process to the plurality of biomarker values to determine whether they are classified as having a high probability.

In a device for determining the risk of an individual's future health problems,

(a) a storage device for storing a plurality of biomarker values from an individual;

(b) is connected to the storage device,

1) programmed to receive the plurality of biomarker values from the storage device,

2) (i) a subpopulation PD having a certain high probability of acquiring a particular biological condition within a particular time period or age interval, or a predetermined low probability of acquiring a specific biological condition within a specific time period or age interval. Subpopulation with Programming to apply a statistical process to the biomarker value to classify as belonging to, or (ii) to quantitatively estimate the probability of obtaining a particular biological condition within a particular time period or age interval for the individual. Processor,

The statistical process is,

(1) the subpopulation D of the member is identified as having acquired a particular biological condition within a particular time period or age interval, and the subpopulation of the member Collects a database of time-series obtained biomarker values from individual members of the test population identified as having not acquired a particular biological condition within a particular time period or age interval;

(2) subpopulation D and from the biomarker Selecting a subset of biomarkers for discrimination between members belonging to the group based on the distribution of biomarker values of individual members of the test population;

(3) a determination device based on using the distribution of the selected biomarker to develop the statistical process.