WO2009105589A1 - Prédiction du résultat thérapeutique d’un traitement médical à l’aide d’une modélisation d’inférence statistique - Google Patents

Prédiction du résultat thérapeutique d’un traitement médical à l’aide d’une modélisation d’inférence statistique Download PDF

Info

Publication number
WO2009105589A1
WO2009105589A1 PCT/US2009/034585 US2009034585W WO2009105589A1 WO 2009105589 A1 WO2009105589 A1 WO 2009105589A1 US 2009034585 W US2009034585 W US 2009034585W WO 2009105589 A1 WO2009105589 A1 WO 2009105589A1
Authority
WO
WIPO (PCT)
Prior art keywords
variables
target
bias
estimator
data
Prior art date
Application number
PCT/US2009/034585
Other languages
English (en)
Inventor
Mark Van Der Laan
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2009105589A1 publication Critical patent/WO2009105589A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Definitions

  • This international search report consists of a total of £- sheets
  • Particular embodiments generally relate to data adaptive selection of adjustment sets for assessing the impact of a variable on an outcome.
  • Statistical applications are concerned with estimating the impact of a treatment variable on an outcome of interest based on a data set that measures treatment variables, an outcome, and other variables, on many units such as a patient.
  • Some variables measured on each subject may be considered confounders, which may confound the relationship between treatment and outcome of interest.
  • Estimators of the impact of a treatment variable on an outcome involves adjustment by a set of confounders, and to estimate a causal effect of the treatment variable the estimator needs to adjust for all confounders.
  • the treatment may not be randomized so that an association between the treatment and the outcome does not imply the existence of a causal effect of treatment on the outcome.
  • a relationship observed may be a high dose of medicine that may be associated with cancer.
  • estimators of parameters that are hardly identifiable from the collected data are typically very biased (referred to as sparse-data bias) and due to the large bias have relatively small variance.
  • sparse-data bias a nonparametric bootstrap distribution (i.e., sampling distribution of estimator based on resampling from data set) of such estimators fails to discover the level of sparse data bias.
  • current statistical practice would assume these estimators are approximately unbiased. In these cases, statistical inference based on an assessment of the variance of the estimator only will result in false (positive) conclusions.
  • a method for selection of a (general) target parameter among a family of (general) parameters based on a criterion assessing the degree of lack of identifiability, and subsequent selection of effective parameter for the purpose of estimating this selected target parameter based on minimizing an estimated mean squared error. Therefore, particular embodiments data adaptively select target parameters for which reasonable estimators and corresponding statistical inference as measured by (e.g.) confidence intervals and p-values exist , so that the method might acknowledge (and output to the user) that the a priori wished target parameter might be unachievable given the data at hand. Subsequently, given the selection of the target parameter that is assessed to be reasonably well identifiable from the data, it is of interest to select data adaptively among the family of estimators indexed by the family of the parameters for the purpose of estimation of this target parameter.
  • a method for determining a target adjustment set of variables that can be reliably adjusted for when assessing the effect of a treatment variable on an outcome.
  • the method comprises determining a data set, the data set associated with a set of variables related to a treatment variable.
  • a target adjustment set of variables is determined from the set of variables.
  • the target adjustment set being variables that are determined to be adjustable based on the data set. For example, based on an identifiability criterion (e.g., sparse data bias) for the effect of treatment on an outcome, controlling for a subset of variables, the target adjustment set may be determined from the set of all variables.
  • the target set may be chosen to be the largest set that still results in an acceptable degree of sparse data bias.
  • the target parameter identified by the target set of variables results in more reliable statistical inference such as confidence intervals and p-values that assess the signal to noise ratio.
  • an effective adjustment set of variables may then determined from the target adjustment set. For example, minimizing (over all subsets of the target set) an estimate of mean squared error of the subset specific estimators with respect to the target parameter defined by target adjustment set may be used to determine the effective adjustment set.
  • the effective adjustment set may be used to determine an estimator to estimate the effect of treatment variable on outcome adjusting for target adjustment set of confounders.
  • the estimator (of the effect of treatment on outcome adjusted for the target set of variables) that is determined may result in less bias and variance (with respect to the target parameter), and more reliable statistical inference such as confidence intervals and p- values that assess the signal to noise ratio.
  • Fig. 1 depicts a system for selecting a target set of variables and subsequently a corresponding estimate of the effect of input variable on an output variable adjusting for the set of target variables.
  • FIG. 2 depicts a more detailed example of a target set determiner according to one embodiment.
  • FIG. 3 depicts a simplified flowchart of a method for determining a target set of variables according to one embodiment.
  • FIG. 4 depicts a simplified flowchart of a method for determining an estimator according to one embodiment.
  • Fig. 5 depicts a simplified flowchart for selecting the targeted parameter of the data generating distribution among a family of candidate parameters based on sparse-data bias according to one embodiment.
  • FIG. 1 depicts a system 100 for adjusting by a set of variables to provide an estimate of an effect of an input variable on an outcome according to one embodiment.
  • a computing device 102 is configured to receive a set of variables, determine a target set of variables, and provide the estimate of the treatment effect on outcome adjusting for the target set of variables.
  • the effective set of variables is determined data adaptively from a group of candidate target sets included in the selected target set of variables, based on a data set. By subsequent selecting the effective set of variables from the target set of variables data adaptively, an estimator using the effective set may provide an optimal estimator of the treatment effect adjusting for target set of variables.
  • the input variable may be a variable in which the estimator wants to understand the effect of input variable on an outcome variable.
  • the input variable is a treatment variable, which is a variable related to the treatment of a patient (such as the treatment of a medical disease).
  • a set of variables may be covariates.
  • a covariate may be of direct interest or it may be a confounding or interacting variable on the relationship of the input variable to the outcome variable.
  • the set of variables are considered confounders that have a potential effect on the treatment variable and have an effect on the outcome.
  • a confounder correlates (positively or negatively) with the input variable. Because of the correlation, there is a need to control for these factors to avoid bias in the estimator of a causal effect of treatment on outcome. Conventionally, the complete (i.e., wished) set of confounders was adjusted to estimate the effect of treatment However, adjusting for all of the variables the user wishes to adjust for may not yield a reliable estimate and statistical inference.
  • bias may result in the estimate of the parameter of interest and possibly also in the estimate of the standard error, either one resulting in false claims based on biased confidence intervals or p-values.
  • a p-value the probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true.
  • Particular embodiments truncate the set of variables that are used in the estimator.
  • a set of variables is determined.
  • the set of variables are related to the treatment variable and the outcome variable.
  • a target set of variables is determined from the data where these variables are ones determined to be adjustable based on the data set. For example, some variables should not be adjusted and are not included in the target set.
  • the target set of variables defines the target parameter of interest that needs to be estimated.
  • An effective set of variables is then determined from the target set where the effective set includes variables that will be adjusted in the estimation of the target parameter.
  • the effective adjustment set is determined data adaptively . For example, an estimate of a mean squared error with respect to the target parameter is obtained based on the data set, and is used to determine which set of variables should be used in the estimator.
  • Each target parameter is a feature of the distribution of the data. Once the target parameter is selected, the goal is to estimate that target feature of the distribution of the data.
  • the target parameter is identified by a target set of variables.
  • an effect of treatment on outcome controlling for target set of variables may be the target parameter.
  • This latter target parameter is now identified by target set of variables.
  • a following strategy may be used. For each set of variables contained in target set of variables construct an estimator of the corresponding parameter. Now, select among all these candidate estimators one that is considered optimal for the target parameter using a data based criterion, and the corresponding choice of set of variables is referred to as the effective set of variables.
  • the effective set can be smaller than the target set since, for example, the estimator defined by target set of variables might be too variable, while smaller sets of variables exist that have much less variance at cost of a little bias with respect to the target parameter.
  • the goal of this estimate is not the parameter identified by effective set of variables, but the goal of this estimate is the target parameter identified by target set of variables.
  • the target set may be determined by a threshold that is set. For example, a user may set a threshold of an acceptable level of bias, variance, or mean square error.
  • the effective set may be determined by minimizing an empirical criterion assessing the performance of subset specific estimators as an estimator of the target parameter identified by target set of variables.
  • the process of first selecting target set is determined.
  • the selection of the target set defines the target parameter, which provides characteristics of the relationship of interest, such as the effect of a treatment on an outcome controlling for the target set of variables.
  • Different target sets identify a different target parameter.
  • a first target parameter may be the effect of a treatment adjusted for a first set of target variables.
  • a second target parameter may result, such as the effect of a treatment adjusted for a second set of target variables.
  • a target set determiner 104 receives a full set of variables and determines the target set of variables that are determined should be adjusted in the estimation.
  • the target set may be determined based on which target parameter is determined to be reasonably well identifiable and is close to the full (i.e., wished) set of variables.
  • Target set determiner 104 is configured to data adaptively adjust the full set of variables to the target set of variables.
  • the target set of variables may include fewer variables than the full set of variables.
  • An estimator determiner 106 is configured to determine an estimator of effect of input variable on an outcome, adjusting for the target set of variables (i.e., target parameter).
  • a data set may be used to determine the effect of the input variable on the outcome adjusting for the target set of variables.
  • the target set of variables is adjusted for and used to determine the effect of the input variable on the outcome adjusting for the target adjustment set.
  • effective set determiner 108 may reduce the target set to an effective set. The reduction may be performed. Effective set determiner 108 may remove variables that it determines may cause too much bias or variance in the estimate with respect to the target parameter defined by target set of variables.
  • the estimator determined may be displayed to a user. Also the selected target adjustment set is displayed to user, and information related to how much bias or mean square error the estimator has relative to the target parameter defined by target set is output.
  • Target set determiner 104 uses a criterion that determines whether the set of variables can be adjusted with a certain amount of bias. For example, a user can be told how biased the estimation may be based on the set of variables that were adjusted.
  • the criterion may be used to diagnose the bias using an inverse probability of treatment weighted (IPTW) estimator or a truncated IPTW estimator. These estimators will be described in more detail below.
  • IPTW inverse probability of treatment weighted
  • Fig. 2 depicts a more detailed example of target set determiner 104 according to one embodiment.
  • the target set of variables or full set may be input into a variable ranker 202, which can rank the variables based on a criterion. For example, different target sets of variables may be determined.
  • the bias caused by different target sets of variables may be estimated and used to rank the sets.
  • a data set is used to estimate the bias for each target set.
  • the data set may be one in which the data for the input variables and outcome variables are known or have known probability distribution.
  • the largest target set within the wished full set of variables that still has an acceptable amount of bias may be selected as the target set to use in the estimator.
  • the variables in the full set may be ranked based on how strongly they are correlated to the input variable. Other methods of ranking the full set of variables may also be appreciated. The correlation may be determined based on the relationship of the variables to the input variable. The relationship is used because if the relationship is very strong, it may be expected that the variables should be removed. For example, if a variable is too strongly correlated to the input variable, then it may not be adjusted for.
  • a subset determiner 204 determines a target set of variables. The process of determining the target set may be iterative.
  • subset determiner 204 uses different algorithms to select the variables to include in the target set. The algorithms may try to include the largest number of variables in the full set within an acceptable amount of bias. For example, the whole set may be taken and evaluated. Then, a smaller set is used and evaluated, and so on. Once all the candidate sets have been evaluated, a target set may be selected.
  • an estimator 206 is used to estimate the effect of interest as identified by this target set.
  • the effect of interest may be the effect of input variable on output variable controlling for target set of variables.
  • the target set is used to form an estimator of the target parameter identified by the target set. Different estimators using different candidate target sets are formed. A criterion is then used to evaluate how biased the estimators are for the effect of interest (i.e., the target parameter) where the target set of variables is adjusted for in the estimation. . For example, bias may be calculated using the different target set of variables.
  • the bias may be determined based on a known data set that includes data on the input variable, outcome and other variables , and this known data set is chosen so that the effect of interest for that data set is known.
  • the data set is input into estimator 206 and an estimate is determined using the target set of variables.
  • the target adjustment set selection is based on an empirical criterion that measures performance of the estimator in estimating the effect of interest identified by the target set.
  • the estimates based on the known data sets obtained by simulating from an estimated data generating distribution) in which the corresponding true effects of interests are known are then compared to the known effects of interests to determine the bias caused by the target set of variables.
  • the bias may also be estimated based on influence curves of the different estimators for the different candidate target sets, and used to select the target set of variables, as described in detail below.
  • the different estimators for the different candidate target sets may be targeted maximum likelihood estimators.
  • the influence curve of a targeted maximum likelihood estimator evaluated at the targeted maximum likelihood estimator is sensitive to sparse data bias, so that estimates of bias and variance of this estimated influence curve provide excellent measures for sparse data bias.
  • evaluator 208 determines the optimal target set of variables per a criterion.
  • the different target sets of variables may be ranked.
  • the estimated bias may be based on the variation of an estimate from a true answer based on known data sets. Since these known data sets are chosen so that the known answers are known, the estimates are compared to the known answers. The average difference between the known answers and the estimated outcome may be the bias for the target set.
  • Evaluator 208 may use a criterion to determine which of the target sets may be optimal. For example, the target set that is largest and still has an acceptable amount of bias may be selected. However, another target set may be selected. Many other choices will be appreciated. Overall the choice of target parameter may be driven by properties such as reliable statistical inference, bias, variance, mean squared error, and being close to wished parameter that the estimator wants to learn (but might be impossible to learn).
  • Fig. 3 depicts a simplified flowchart 300 of a method for determining a target set of variables according to one embodiment. The process will be described where the variables are considered confounders. However, the confounder could be any variable that is related to the relationship of interest.
  • target set determiner 104 determines the set of confounders.
  • the set of confounders may be determined based on a statistical analysis being performed and may be the full set.
  • target set determiner 104 determines different target sets of confounders.
  • the target sets may be determined by different algorithms that select different sets of confounders. For example, a cut-off point may be determined in a ranked list of confounders to determine the target set. Also, different nested sets may be determined for the target sets.
  • target set determiner 104 evaluates the different target sets against a criterion. For example, a criterion is used to determine bias for the target set using the data set that is known. The bias of each target set in estimation is used to evaluate whether the target set includes an optimal number of variables that should be adjusted in the estimation. For example, a target set that has acceptable bias or mean square error may be selected. Step 308 outputs the optimal target set as the selected target set.
  • a criterion is used to determine bias for the target set using the data set that is known.
  • the bias of each target set in estimation is used to evaluate whether the target set includes an optimal number of variables that should be adjusted in the estimation. For example, a target set that has acceptable bias or mean square error may be selected.
  • Step 308 outputs the optimal target set as the selected target set.
  • FIG. 4 depicts a simplified flowchart 400 of a method for an estimator using the target set of confounders according to one embodiment.
  • estimator determiner 106 receives the target set of confounders that defines a target parameter.
  • estimator determiner 106 receives receive data set and defines an estimator for many subsets of the target set.
  • the data set may include an input variable in which an estimate is desired. For example, an effect of a treatment variable on an outcome variable, controlling for the target set of variables, may be estimated for the data set.
  • the data set may include information for the treatment variable, outcome, and variables in the target set.
  • estimator determiner 106 defines a mean squared error criterion measuring performance of subset specific estimators as an estimator of target parameter.
  • estimator determiner 106 selects an effective set minimizing an estimated mean squared error.
  • an estimator of the effect of the treatment variable on the outcome variable, controlling for the effective set of confounders is determined.
  • This latter estimator is optimized to estimate the effect of treatment variable on the outcome, controlling for the target set of variables.
  • ' 'effective' ' refers to "effective for estimating the effect identified by target set of variables".
  • an estimator for the target parameter may be output.
  • the effective set of variables may be output.
  • the estimator, target set, effective set, and a bias and variance estimate may be displayed to a user.
  • the estimator may be generated by training it on the data set received.
  • the bias and variance are used to indicate to a user a confidence level to indicate a measure of reliability.
  • the level may be the amount of bias or variance that may be included if the estimator is used. This gives the user a good idea of how to evaluate the results of the estimator. For example, a user may not rely on results that include a large (sparse data) bias. This may be helpful for a researcher or other person who is basing observations or analysis on the estimator.
  • an estimate of the sparse data bias with respect to the target parameter defined by the target set of confounders may be outputted.
  • the estimate may indicate to the user the risk in relying on the estimate as an estimate of the target parameter, and corresponding confidence level, such as confidence intervals for the target parameter.
  • a confidence interval (CI) or confidence bound is an interval estimate of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given. Thus, confidence intervals are used to indicate the reliability of an estimate.
  • the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true.
  • Particular embodiments will now be described in more detail with respect to an experiment or study. However, the methods can be applied to any other estimation. Particular embodiments provide a statistical application for estimating the impact of a treatment variable on an outcome of interest. If the treatment variable has not been randomized, it is desirable to adjust such effect estimates for a set of covariates that are thought to confound the relationship of interest (i.e., a set of confounders). Such an adjustment, however, relies on the assumption of experimental treatment assignment (ETA) according to which each experimental unit has positive probability of being observed at any of the possible levels of the treatment variable, regardless of the values the confounding factors may take on.
  • ETA experimental treatment assignment
  • Data-adaptive selection of the adjustment set of confounders represents an automated approach for avoiding such problems.
  • the selection is based on a criterion for deciding if a particular adjusted variable importance parameter suffers from too strong an ETA violation (i.e., a correlation between confounders and the input variable) to be reliably estimated from the data.
  • a criterion for deciding if a particular adjusted variable importance parameter suffers from too strong an ETA violation (i.e., a correlation between confounders and the input variable) to be reliably estimated from the data.
  • the adjustment set defining the parameter of interest, as selected based on a given identifiability criterion, is referred to as the targeted adjustment set; the possibly smaller data adaptively determined adjustment set used in estimating this parameter, on
  • the effective adjustment set is thus nested in the targeted adjustment set, which in turn is nested in the full adjustment set.
  • variable importance parameter corresponding to a particular adjustment set can be estimated reliably from the data at hand, it may be advantageous to base estimation of this parameter on an adjustment set that in fact excludes additional covariates.
  • Particular embodiments therefore include a second step that is aimed at evaluating whether such additional exclusions can be used to obtain more efficient estimates of that parameter. This second step then results in the effective adjustment set.
  • the impact of a particular mutation can be assessed on viral load would be to compare the virologic response among patients whose virus has the mutation to that among patients whose virus does not. If it is found that patients in the first group respond much more poorly to a particular drug regimen, a clinician might be inclined not to give this regimen to a new patient entering his office who has this mutation. Patients in the first group are, however, also quite likely to differ from those in the second group in terms of the remaining mutations as well as other measured clinical covariates.
  • the mutation of interest may, for example, be very common among patients who have previously failed several similar drug regimens, making them far more likely to also fail
  • Particular embodiments estimate the impact of a given mutation on viral load that is not due to associations of this mutation with any of the other measured covariates.
  • questions that can be considered include: What difference in viro logic response would be observed if it could somehow give every patient in the study population the mutation interest, holding the remaining covariates fixed at their current values, as opposed to the scenario in which none of the patients are given this mutation, holding again other covariates fixed? Any observed difference could then not be due to differences of the two populations with regard to the remaining covariates and would thus be more likely to generalize to a new population in which the mutation of interest and the other covariates may be related to each other differently.
  • the set of adjustment variables may contain covariates that are not perfectly predictive of the mutation of interest, but still determine the presence or absence of that mutation in a nearly deterministic fashion.
  • a second mutation may, for example, be so strongly correlated with the mutation of interest that 99% of patients with this second mutation also exhibit the mutation of interest. In such instances, a substantial amount of data would be required before the adjusted variable importance of the mutation of interest could be estimated in any reliable way. In smaller samples, it could easily occur by chance that no patients are observed that are discordant for these two mutations, again precluding obtaining an adjusted variable importance estimate. To distinguish this scenario from the one described in the previous paragraph, it is referred to as a practical rather than a theoretical violation of the ETA assumption.
  • Particular embodiments are based on the idea of developing a criterion that can give the user a sense of the extent to which the variable importance parameter corresponding to a proposed adjustment set is identifiable from the data at hand. If this criterion suggested that the parameter corresponding to the full adjustment set was not well identified, it could then also be used to identify a smaller, more workable adjustment set. In one embodiment, two criteria may be used. The first criterion makes use of a simulation-based approach for diagnosing the bias that a so-called Inverse-Probability-of- Treatment- Weighted (IPTW) estimator is subject to if the ETA assumption is violated.
  • IPTW Inverse-Probability-of- Treatment- Weighted
  • the second criterion makes use closed- form estimates derived for the asymptotic bias of a truncated IPTW estimator.
  • the greater computational burden of the first criterion may make the second approach a more appealing option.
  • the choice of IPTW estimator can be replaced by other estimators that are sensitive to ETA violations such as the nonparametric targeted maximum likelihood estimator.
  • particular embodiments involve an approach for defining a sequence of nested candidate target adjustment sets that, in combination with a given identifiable criterion, can be used to select an appropriate effective adjustment set data-adaptively.
  • the adjustment set defining the parameter of interest may, for example, contain a covariate that is a good predictor of the mutation under consideration, but only a weak predictor of viral load.
  • a covariate may be only a weak confounder of the relationship between the mutation and viral load, but can still lead to a mild practical violation of the ETA assumption that would cause the variable importance estimator to become more variable.
  • Not adjusting for this covariate could thus, at the price of a slight increase in bias, offer a considerable reduction in variability, thus leading to an overall reduction in mean squared error.
  • Particular embodiments therefore also involve an approach that, given an adjustment set defining the parameter of interest, can be used to evaluate whether such additional exclusions from the adjustment set can be expected lead to a more efficient estimator with smaller mean squared error.
  • the closed- form mean squared error estimates used for truncated IPTW estimators have an additional application in selecting an appropriate truncation level for IPTW estimators.
  • these estimators By weighting subjects by the inverse of the conditional probability of having selected their observed treatment, given available confounders, these estimators create a new sample in which treatment assignment is independent of the measured confounders. If the ETA assumption is practically violated, observations with very small treatment probabilities and corresponding large weights can dominate the remainder of the sample so that the estimator tends to become highly variable. In such instances, the use of truncated weights can often, at the price of a slight increase in bias, lead to a dramatic reduction in variability and thus typically also to a reduction in mean squared error.
  • the closed- form estimates for the asymptotic bias of a truncated IPTW estimator can be used to select this truncation data-adaptively based on the goal of minimizing the mean squared error of the estimator.
  • the 18 parameter can be reliably estimated from the data at hand is based on closed- form estimates developed for the mean squared error of a truncated Inverse-Probability-of- Treatment- Weighted (IPTW) estimator. If the ETA assumption is practically violated, the performance of this estimator can often be improved by truncating Inverse- Probability-of-Treatment weights.
  • the mean-squared-error estimates can thus in particular be used to select an appropriate truncation level based on the goal of minimizing the mean squared error of the estimator.
  • the data adaptive selection of a truncation level is an important ingredient itself since many estimators require selection of a truncation level in order to make the estimators most robust, and the method for selection of truncation by minimizing an estimated mean squared error apply to each of these cases.
  • the method for selection of the adjustment set in a variable importance analysis is unified to selection of the targeted parameter among a family of parameters based on a criterion that assesses the amount of bias in the target parameter of interest due to lack of identifiability.
  • This criterion is referred to as sparse-data bias, which is a bias due to lack of data.
  • This unifying method is based on using the influence curve of each estimator in the family of estimators corresponding with the family of parameters. Since an influence curve can be derived for any asymptotically linear estimator of any pathwise differentiable parameter, this generalizes the methodology to all problems of interest, and thereby provides a unifying methodology.
  • the typical model may then be the model implied by a model for the full data distribution F (e.g. nonparametric) and some model for G.
  • Fig. 5 depicts a simplified flowchart 500 for selecting the targeted parameter based on sparse-data bias according to one embodiment.
  • Step 502 defines family of path-wise differentiable parameters.
  • the path-wise differentiable parameters may be the effect of an input variable on an outcome adjusting for different target set of confounders.
  • psi(delta) could be defined as a causal- effect parameter based on a reduction of O indexed by delta (e.g., the reduction is obtained by removing certain baseline or time-dependent covariates/confounders).
  • the choice of delta can also index a particular algorithm applied to the data generating distribution defining the parameter psi(delta).
  • Step 504 defines data based estimators that are estimators that only rely on assumptions not affecting the information bound of the pathwise differentiable
  • estimators psi_n* (delta) whose consistency does not, or only minimally, rely on unknown modeling assumptions (i.e., beyond the assumptions specified by the actual model for P), but these estimators are allowed to rely on assumptions that do not affect the asymptotic information bound for psi(delta).
  • these estimators are heavily data-based estimators minimally or not relying on model-based extrapolation improving the information bound for psi(delta) relative to the information bound for psi(delta) in the actual model for the data generating distribution P.
  • these estimators need to be chosen such that, if the parameter psi(delta) is hardly identifiable from the data given the actual model (i.e., the model known to be true), then the influence curve of this estimator may be a random variable with very large variance (i.e., allowing for extreme outliers), or the estimator will simply be asymptotically biased.
  • psi_n* (delta) may be targeted maximum likelihood estimator since the influence curve of targeted maximum likelihood estimators are very sensitive to sparse data bias.
  • psi_n* (delta) may also bean IPTW-estimator, possibly with known treatment/censoring mechanism.
  • an IPTW- estimator may be selected in the model in which the treatment mechanism is known, since a model with known CAR-treatment/censoring mechanism has the same information bound for the variable importance parameter as a model with unknown CAR- treatment/censoring mechanism.
  • Step 506 derives influence curves of family of data-based estimators.
  • IC*(delta,P) represent the influence curve of these data-based estimators psi_n* (delta) of psi(delta) for all delta.
  • Step 508 generates analytic influence-curve-based measures of sparse-data bias:
  • a user-supplied, practically reasonable truncation level M for the influence of a single observation may be defined, such as let IC*(delta,M) be the corresponding truncated influence curve.
  • IC*(delta,M) be the corresponding truncated influence curve.
  • a user might decide that a single observation should never represent an influence curve contribution exceeding more than 2% of the sample variance of the influence curve. In essence, the user needs to decide what level of robustness is minimally required for any estimator to be admissible.
  • the bias e.g, the empirical mean
  • the bias e.g. the empirical mean
  • the unknowns in the influence curve in such a way that the empirical mean of the untruncated influence curve equals zero.
  • a targeted maximum likelihood estimator is used to solve the empirical mean of the untruncated influence curve.
  • the sparse-data bias is defined as the empirical mean of the at M truncated influence curve.
  • the empirical mean of the truncated influence curve is replaced by the cross-validated empirical mean of the truncated influence curve.
  • the proposed measures of sparse-data bias are numbers on the same scale as the parameters psi(delta), thereby allowing natural quantification, and their theoretical underpinning is based on the fact that the expectation of an influence curve at a candidate Pl equals in first order the deviation of the parameter of interest at Pl minus the true parameter.
  • Other measures of sparse data bias derived from the influence curve of the delta-specific (e.g. targeted maximum likelihood) estimators may be appreciated.
  • Step 510 selects the target parameter.
  • An example of a target parameter is an effect of an input variable on an outcome adjusting for a target set of confounders.
  • the target parameter is now defined by setting an acceptable level of sparse-data bias, possibly relative to a particular parameter that is easy to estimate. In this way, the user is able to evaluate how far a given parameter is from the edge (in a space/set of candidate parameters) at which parameters become, for all practical purposes, impossible to identify so that statistical estimation of this parameter as well as inference (i.e., standard errors) will become completely unreliable. If there are multiple parameters within the specified identifiable range, then the user may have to make a choice.
  • the parameters psi(delta) can be ordered with respect to their distance to the wished (but possibly impossible to estimate) parameter so that the first parameter is selected in this ordered list which satisfies the acceptable identifiable bias.
  • delta* denote the selected choice of target parameter.
  • Step 512 data adaptive Iy selects the effective parameter in order to estimate target parameter based on mean square error: Given the selection of the targeted parameter Psi(delta*) and given user-supplied wished estimators (e.g., targeted MLE) psi n(delta) of psi(delta) with influence curves IC(delta) for all delta, an effective parameter choice is selected by minimizing over delta an estimate of the Mean Squared Error (MSE). (Note that the wished estimators psi n(delta) need not to be equal to the estimators psi_n* (delta) used to assess identifiability bias to select target parameter).
  • MSE Mean Squared Error
  • the bias component of the MSE at delta can be estimated as the difference between the estimator psi n(delta) and psi_n(delta*) and the variance component is estimated with the influence curve of psi n(delta). Due to the fact that the target parameter is chosen away from the edge of becoming non-identifiable, this method for selecting delta may have a superior practical performance relative to its performance applied to a target parameter that is close to non-identifiable.
  • the effective parameter could be an effect of input variable on outcome adjusting for an effective set of confounders.
  • the estimate of the bias term in the MSE criterion for psi n(delta) can be penalized with an additional finite sample bias term which can be estimated with the cross-validated mean of the influence curve of psi n(delta).
  • the purpose of this finite sample bias term is that it picks up a contribution due to the second order term in a Taylor expansion of the estimator psi n(delta) relative to psi ⁇ (delta), while the variance of the influence curve is only based on the first order linear approximation of the estimator.
  • a particular application of the selection methodology above concerns the selection among candidate targeted maximum likelihood estimators of a variable importance or a causal effect indexed by different algorithms for the nuisance parameters such as the initial regression estimator.
  • the use of data adaptive algorithms for these nuisance parameters can potentially cause non-negligible finite sample bias in the targeted maximum likelihood estimators which is not captured by variance estimates based on the influence curve.
  • a particular case of interest is the case that all targeted maximum likelihood estimators (i.e., for each choice of delta) are actually known to asymptotically target the right parameter psi, such as targeted maximum likelihood estimators of the causal effect of a treatment in a randomized trial.
  • these targeted maximum likelihood estimators may be indexed by different initial regression estimators, but each targeted maximum likelihood estimator is known to be asymptotically consistent for the wished causal effect psi. That is, a class of asymptotically linear estimators psi n(delta) of psi is
  • routines of particular embodiments including C, C++, Java, assembly language, etc.
  • Different programming techniques can be employed such as procedural or object oriented.
  • the routines can execute on a single processing device or multiple processors.
  • steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
  • Particular embodiments may be implemented in a computer-readable storage medium or tangible medium for use by or in connection with the instruction execution system, apparatus, system, or device.
  • Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both.
  • the control logic when executed by one or more processors, may be operable to perform that which is described in particular embodiments.
  • Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used.
  • the functions of particular embodiments can be achieved by any means as is known in the art.
  • Communication, or transfer, of data may be wired, wireless, or by any other means.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

L’invention concerne, dans un mode de réalisation, un procédé pour déterminer un jeu cible de variables qui est utilisé pour sélectionner un estimateur d’un effet d’une variable d’entrée sur un résultat parmi une famille d’estimateurs d’un effet d’une variable d’entrée sur un résultat, chaque estimateur et son effet cible correspondant étant indexés par différents jeux d’ajustement cible candidat de variables. Le procédé comprend la détermination d’un jeu de données, le jeu de données étant associé à un jeu de variables lié à la variable d’entrée et au résultat. Un jeu d’ajustement cible de variables est déterminé à partir du jeu de variables. Le jeu d’ajustement cible constitué de variables peut être déterminé comme étant le plus grand jeu qui donne toujours un estimateur non biaisé de façon raisonnable et une inférence statistique fiable pour son effet cible.
PCT/US2009/034585 2008-02-22 2009-02-19 Prédiction du résultat thérapeutique d’un traitement médical à l’aide d’une modélisation d’inférence statistique WO2009105589A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US3091908P 2008-02-22 2008-02-22
US61/030,919 2008-02-22

Publications (1)

Publication Number Publication Date
WO2009105589A1 true WO2009105589A1 (fr) 2009-08-27

Family

ID=40985909

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/034585 WO2009105589A1 (fr) 2008-02-22 2009-02-19 Prédiction du résultat thérapeutique d’un traitement médical à l’aide d’une modélisation d’inférence statistique

Country Status (1)

Country Link
WO (1) WO2009105589A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234763A1 (en) * 2004-04-16 2005-10-20 Pinto Stephen K Predictive model augmentation by variable transformation
US20050266420A1 (en) * 2004-05-28 2005-12-01 Board Of Regents, The University Of Texas System Multigene predictors of response to chemotherapy
US20070172844A1 (en) * 2005-09-28 2007-07-26 University Of South Florida Individualized cancer treatments

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234763A1 (en) * 2004-04-16 2005-10-20 Pinto Stephen K Predictive model augmentation by variable transformation
US20050266420A1 (en) * 2004-05-28 2005-12-01 Board Of Regents, The University Of Texas System Multigene predictors of response to chemotherapy
US20070172844A1 (en) * 2005-09-28 2007-07-26 University Of South Florida Individualized cancer treatments

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GREENLAND, S.: "Variable Selection versus Shrinkage in the Control of Multiple Confounders,'", AMERICAN JOURNAL OF EPIDEMIOLOGY ADVANCE ACCESS, 27 January 2008 (2008-01-27), Retrieved from the Internet <URL:http://aje.oxfordjoumals.org/cgi/reprint/kwm355v1> [retrieved on 20090323] *

Similar Documents

Publication Publication Date Title
Allocco et al. Quantifying the relationship between co-expression, co-regulation and gene function
Qu et al. Linear score tests for variance components in linear mixed models and applications to genetic association studies
Imoto et al. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network
Fischer et al. CAFASP2: the second critical assessment of fully automated structure prediction methods
Andrinopoulou et al. Bayesian shrinkage approach for a joint model of longitudinal and survival outcomes assuming different association structures
Wang et al. Improved protein structure selection using decoy-dependent discriminatory functions
Hu et al. Proper use of allele-specific expression improves statistical power for cis-eQTL mapping with RNA-seq data
Robertson et al. An all‐atom, distance‐dependent scoring function for the prediction of protein–DNA interactions from structure
Samudrala et al. A comprehensive analysis of 40 blind protein structure predictions
Hong et al. Semi‐supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping
Tang et al. PASTA: splice junction identification from RNA-Sequencing data
Ghosh et al. A semiparametric Bayesian approach to multivariate longitudinal data
Mishra et al. D2N: Distance to the native
Zabad et al. Fast and accurate Bayesian polygenic risk modeling with variational inference
Yan et al. DescFold: a web server for protein fold recognition
Kundu et al. A framework for understanding selection bias in real-world healthcare data
Gecili et al. Bayesian regularization for a nonstationary Gaussian linear mixed effects model
Zhu et al. How well can we predict native contacts in proteins based on decoy structures and their energies?
JP2024536911A (ja) 遺伝子データを分析するためのコンピュータ実装方法および装置
Ng Recent developments in expectation‐maximization methods for analyzing complex data
Chen et al. Using propensity scores to predict the kinases of unannotated phosphopeptides
WO2009105589A1 (fr) Prédiction du résultat thérapeutique d’un traitement médical à l’aide d’une modélisation d’inférence statistique
Kwee et al. Simple methods for assessing haplotype‐environment interactions in case‐only and case‐control studies
Shen et al. Docking with PIPER and refinement with SDU in rounds 6–11 of CAPRI
Wolfsheimer et al. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09711564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09711564

Country of ref document: EP

Kind code of ref document: A1