WO2009135076A1 - Estimation based on case-control designs - Google Patents

Estimation based on case-control designs Download PDF

Info

Publication number
WO2009135076A1
WO2009135076A1 PCT/US2009/042429 US2009042429W WO2009135076A1 WO 2009135076 A1 WO2009135076 A1 WO 2009135076A1 US 2009042429 W US2009042429 W US 2009042429W WO 2009135076 A1 WO2009135076 A1 WO 2009135076A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
type
observation
samples
weights
Prior art date
Application number
PCT/US2009/042429
Other languages
French (fr)
Inventor
Mark Van Der Laan
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2009135076A1 publication Critical patent/WO2009135076A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires

Definitions

  • Particular embodiments generally relate to estimation.
  • a randomized experiment may be considered the optimal way of defining cause and effect relationships. However, for some interventions, a randomized experiment may not be possible or may take a long amount of time to observe the events being studied. Case control studies use subjects who may have the condition and look back to see if there are characteristics of these patients that differ from those who do not have the condition.
  • Case-control sampling is used to generate data to estimate effects of exposures or treatments on a binary outcome of interest when the proportion of cases (i.e., binary outcome equal to 1) in the population of interest is low.
  • Case-control sampling represents a biased sample of a target population of interest by sampling a disproportional number of cases. Case-control studies are also commonly employed to estimate the effects of genetic or biomarkers markers on phenotypes.
  • the typical approach used in practice is to fit (conditional) logistic regression models, ignoring the case-control sampling, in order to estimate the conditional odds ratios of being a case, given baseline covariates and exposure. Although these methods do not rely on knowing the true incidence probability (i.e., probability of being a case).
  • an estimator is determined for an unbiased sample of a probability distribution.
  • the estimator maps a set of data points to a target feature.
  • the unbiased sample includes a first type of observation of the probability distribution and a second type of observation of the probability distribution.
  • the case control sampling design may be based on the first type and the second type of observations.
  • a biased sample is determined that may include samples of the first type of observation and second type of observation.
  • Clusters of the samples are determined from the biased sample where a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation.
  • Weights are assigned to each sample within a cluster based on characteristics of the target population. For example, weights may be determined by assigning a weight to every sample in the cluster such that the expectation of the weighted average of a scoring function of the samples in the cluster equals the expectation of this scoring function under unbiased sampling from the target population.
  • the unbiased estimator may be used to map data points of the unbiased sample to the target feature by inputting the biased sample with the corresponding weights for each sample into the estimator. The estimator then maps the weighted data points to the target feature.
  • the target feature may be a causal effect of a characteristic, a prediction function, a probability distribution, or any other target feature.
  • FIG. 1 depicts a simplified computing system according to one embodiment.
  • FIG. 2 depicts a simplified flowchart of a method for determining a target feature according to one embodiment.
  • FIG. 3 depicts a simplified flowchart of a method for determining clusters for the biased sample.
  • FIG. 4 depicts a simplified flowchart of a method for determining the weights according to one embodiment.
  • a method to estimate a target feature based on a biased sample is provided.
  • a probability distribution may generate a first type of observation and a second type of observation of the probability distribution.
  • An observation is either the first type of observation or second type of observation, but not both.
  • a case control study may be used where for a case, a number of controls are used.
  • a case may be a subject that has the type of observation while a control does not have the type of observation. For example, a case may have cancer but the control does not.
  • the estimator for an unbiased sample of a probability distribution is determined, where the unbiased sample includes the first type and second type of observations.
  • the estimator is configured to map a set of data points with corresponding set of weights to a target feature of the probability distribution of the unbiased sample.
  • the estimator is configured so that setting the weights all equal to 1 and inputting a non-biased sample of the target probability distribution into the estimator yields a valid unbiased estimator of the target feature of the probability distribution.
  • the estimator for the unbiased sample is then used to map weighted data points of the biased sample to the target feature.
  • the weights are determined by determining clusters of samples and assigning weights to samples in the cluster.
  • a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation.
  • the cluster may include a case and one or more controls.
  • Weights are assigned to each sample within each cluster of samples based on characteristics of the samples included in each cluster.
  • the controls and cases can be generalized to any splitting up of population in K disjoint sub-populations spanning the whole population.
  • the biased sample is split up (typically using the methodology of the case control design) in clusters of observations, where each cluster contains one or more observations of the first type and one or more of second type observations.
  • the biased sample may be split up by design because the case-control study was performed in a way that facilitates clustering. For example, in a case control design, a case is matched with one or more controls based on the characteristics of the case and controls (e.g., controls of the same age (the matching variable) as the case are sampled). Thus, for a case that is sampled, controls of the same age are sampled. This sampling method of clustering based on the matching variable used to sample forms the basis for the clusters.
  • Each first type and second type observation in the cluster gets its own weight (e.g., in matched case control designs, the weight for the second type of observation associated with a first type of observation may depend on the matching variable of the first type of observation).
  • This combined assignment of weights within each cluster of observations in the biased sample is done so that it corrects for the biased sampling of this cluster of coupled first type and second type observations.
  • the estimator developed for the unbiased sample, including two types of observations, can now be used with the assigned weights for the clusters.
  • FIG. 1 depicts a simplified computing system 100 according to one embodiment.
  • Computing system 100 may include one or more computing devices. It will be understood that functions described may be distributed among multiple computing devices or may be performed by the same computing device.
  • An unbiased estimator determiner 102 is configured to determine an estimator for unbiased sample of a target probability distribution.
  • the probability distribution identifies the probability of each value of a random variable, where a random variable is defined by a set of possible outcomes of an experiment, and a probability distribution on these possible outcomes.
  • the experiment defining the random variable could be sampling a subject from a target population and measuring a number of characteristics on the subject.
  • the probability distribution describes the range of possible values that a random variable can attain and the probability that the value of the random variable is within any (measurable) subset of that range.
  • the estimator may be configured to receive data and output a target feature of the probability distribution.
  • the data received may be information for a biased sample and the estimator used may be a targeted maximum likelihood estimator or any other estimator. This estimator will be used to map data points of a biased sample with a corresponding set of weights to the target feature.
  • a cluster determiner 104 is configured to receive a biased sample.
  • a biased sample may be a statistical sample of a target population in which some members of the population are less likely to be included than others.
  • a population may be a set of entities from which the observations may be drawn. The population may also refer to a set of potential measurements or values, including not only cases actually observed but those that are potentially observable.
  • the biased sample is a subset of the population in which measurements have been made.
  • the biased sample represents an (unbiased) sample from a probability distribution that may be determined by the target probability distribution.
  • the probability distribution of the biased sample maybe a conditional probability distribution of the target probability distribution.
  • an estimator is configured for the target feature that assumes a random sample from this target population, and also allows the inputting of weights assigned to each observation inputted.
  • the biased sample is obtained by sampling from one or more probability distributions that are determined by this target probability distribution but are not equal to it, such as different conditional probability distributions, conditioning on samples having a certain type. Sampling from a conditional probability distribution involves (unbiased) sampling from the target probability distribution but only accepting the sample if it is of the type conditioned upon.
  • Cluster determiner 104 is configured to determine clusters of samples.
  • the clusters may be automatically determined by the sampling design, or a user may determine the clusters and input them into cluster determiner 104.
  • a cluster has a plurality of different types of observations, where each observation can be represented by a sequence of numbers.
  • An observation may include a binary outcome of a characteristic, such as whether or not a subject had a heart attack or not. This binary outcome can be used to label an observation as a case (e.g., binary outcome is heart attack) or a control (e.g., binary outcome is no heart attack).
  • a cluster may include a case and one or more controls.
  • a case may be a subject that has a particular event as indicated by the binary outcome included in the observation (e.g., the case may be a person who has had a heart attack).
  • a control may be a subject that did not have the condition indicated by the binary outcome, but may be similar, such as the control may include characteristics that are similar to the case. For example, the control may be of the same age as the subject for the case (that had a heart attack) but may not have had a heart attack.
  • the clusters may be determined by the nature of how the case control study was performed. For example, a natural cluster may be determined by including a case that was sampled in addition to a control that was matched to the case. A matching variable may have been used to sample case and controls in the case control design. The matching variable may be a characteristic of the samples, such as age. Multiple controls that have been matched to the same case and may be included in the cluster.
  • cluster may be determined based on including a case and one or more controls, possibly based on characteristics of the population that was sampled. For example, the cluster may include a case and controls that may include similar characteristics. Although one case is being described as being included in the cluster, it will be understood that a cluster may include multiple cases. In one embodiment, the cluster may be the whole biased sample.
  • weight determiner 106 determines the weights.
  • weights are determined for each cluster individually without taking into account samples found in other clusters.
  • a weight may be determined for every sample in the cluster.
  • the weights may be determined based on data that is known for the target population outside of the data from the sampled population. For example, characteristics from data that is known for the target population may be used to determine the weights.
  • a regular case control sampling design that does not use matching the proportion of cases in the target population can be used to weight the samples in the cluster. For example, if 1 out of 10,000 in the unbiased target population is a case, then a weight of 1/10,000 may be assigned to the case in the cluster. If the cluster contains one control, then, for the control, the weight of 9,999/10,000 is assigned to the control, and if the cluster contains k controls, then the weight (l-l/10,000)/k may be assigned to the k controls in the cluster.
  • the weights may be determined by taking the weighted average of the samples in the cluster and selecting weights such that the expectation of the weighted average of a scoring function of the samples in the cluster equals the expectation of a scoring function of a random/unbiased sample from the target population. For example, given a function that maps an observation (e.g., sequence of numbers) of a subject into a score, the sum of the weighted average of the scores of the samples in the cluster has the same expectation as the expectation of the function applied to an observation on a randomly sampled subject. Particular embodiments may require this equality for many scoring functions thereby uniquely identifying the necessary weights.
  • An expectation of a score under unbiased or random sampling of a target population would be the average of all the scores across all subjects in the population where all subjects are weighted equally.
  • An expectation of a score under biased sampling of a target population would be a weighted average of all the scores across all subjects in the population, where the subjects are now weighted non-equally due to certain subjects having larger probability of being sampled than others.
  • an expectation of a random variable requires knowing the probability distribution on all the possible outcomes of that random variable, and it is defined as the weighted average of the possible outcomes where each possible outcome is weighted by its probability. This process of determining weights for the observations in the biased sample will be described in more detail below.
  • Estimator 108 is configured to output the target feature based on the weighted biased sample, i.e., based on a biased sample of observations with corresponding weights.
  • the estimator is configured to be a valid (i.e., approximately unbiased) estimator of the target feature if provided an unbiased sample of the target probability distribution, and if all weights are equal to 1.
  • the estimator receives the biased sample as input. Samples in the biased sample are assigned weights according to the weights determined. The samples may include data that has been measured. This data and its corresponding weight are inputted into the estimator.
  • the estimator is configured to take this data on each sample and its corresponding weight, for all samples in the biased sample, which includes many samples, and map them to the target feature.
  • the estimator then outputs the target feature.
  • the target feature to be estimated may be a causal effect of a characteristic such as a treatment or exposure, a prediction function, a probability distribution, or any other function of the target probability distribution.
  • the weights allow the estimator to output the target feature based on the biased sample.
  • the weights adjust for the biased sample such that the biased sample has the effect of being an unbiased sample.
  • the weights correct for the bias that occurred in the biased sample or non-randomized sample.
  • Fig. 2 depicts a simplified flowchart 200 of a method for determining a target feature according to one embodiment.
  • the method may be performed by computing device 100.
  • Step 202 determines an estimator for an unbiased sample.
  • the estimator may be based on a random sampling of a population.
  • Step 204 determines a biased sample.
  • the biased sample may include any number of samples.
  • the information for the sampling may be received at computing system 100.
  • Step 206 determines a cluster of observations.
  • the cluster of observations includes samples of different types.
  • Fig. 3 depicts a simplified flowchart 300 of a method for determining clusters for the biased sample.
  • Step 302 determines a population of observations where the observations include different types.
  • Step 304 determines characteristics of the population. Clusters of observations within the biased sample may already be defined by the design or may be constructed to include observations of multiple types (e.g., case and controls).
  • Step 306 clusters a portion of the observations in the biased sample based on the characteristics of the samples and the characteristics of the population.
  • a cluster may include multiple different types of observations. For example, in a matched case control design, a case and the number of controls that were matched to the case may be determined as a cluster. In other embodiments, the case and number of controls may be determined randomly, may be matched based on characteristics of the cases and controls, or a cluster might represent the total biased sample
  • step 208 assigns weights to the observations in the cluster.
  • the weights may be assigned based on characteristics of the unbiased/target population and the actual sample to which the weight is assigned.
  • Step 210 determines if additional clusters need to be processed. If so, the process reiterates to step 204 where a different cluster of observations is determined. Weights for this cluster are then determined without regard to the weights determined previously. [34] Once all the clusters have been processed, data may be mapped to the target feature. Step 212 receives the unbiased sample and the corresponding weights for the samples along with the estimator for the unbiased sample. Step 212 then outputs the target feature. For example, the estimator uses the weighted biased sample as input to the estimator to output the target feature. In one embodiment, the target feature may be displayed on a display, sent to another device or user, or stored.
  • Fig. 4 depicts a simplified flowchart 400 of a method for determining the weights according to one embodiment. The method may be performed by computing device 100. In step 402, a cluster is determined.
  • the cluster may be of samples of J types.
  • the cluster has a total of n(l) + n(2) + . . . n(K) observations. If the observations of a first type were drawn from a marginal (biased sampling) probability distribution described by a function with a probability distribution p_l (.), the probability that an observation of type 1 equals a value x (e.g., vector values represented by row in a file) is given by p_l(x), for all possible configurations of x.
  • the probability distribution p_l(.) might be the probability distribution of the data x on a randomly- sampled subject from the subpopulation that may have cancer (the first type).
  • p_l is a conditional probability distribution of the target probability distribution, conditioning on the sample being a case, i.e., a sample that has cancer.
  • the probability distribution p_J is the probability distribution of the data on a randomly- sampled subject from the subpopulation of the target population defined by having cancer type J. This definition of p_J is provided in case control designs.
  • An example of marginal sampling distribution of controls in a case-control sampling design may be an example of a probability distribution p_2 (where p_l corresponds with a sampling of subjects of type 1 and p_2 is a sampling of subjects of type 2) where p_2 is the probability distribution of the data on a type 2 (i.e., control) sample obtained as follows: particular embodiments first sample an observation from the type 1 probability distribution p_l and obtain a value for a matching variable (e.g., age), then particular embodiments sample the control sample by sampling an observation from a subpopulation defined by having type 2 and having a matching variable equal to a value of the matching variable of the first sampled case observation.
  • a matching variable e.g., age
  • the biased sampling probability distributions p_J may correspond with a conditional probability distribution given that certain variables may need to have set values, which can be determined from the target probability distribution.
  • the target probability distribution may be the probability distribution of a random sample from a target population.
  • the certain variables that may be set include age, disease, death, cancer type, or other characteristics of the samples.
  • Step 404 determines one or more scoring functions.
  • the scoring function may be a function that takes an observation x and maps it into a number.
  • weights are set such that the wished equality in expectation is achieved for all scoring functions that are used.
  • Step 406 determines a weight for each sample in the cluster by requiring for each scoring function, the weighted average of the scores across all members of the cluster has the same expectation as the expectation of the scoring function applied to the data x drawn from the target probability distribution (of the unbiased sample), where the drawing from a target probability distribution may correspond with randomly sampling a subject from a target population and measuring data x on this subject.
  • the weighted average of the scores across all members of the cluster may be viewed as a random variable.
  • the weighted average of the scoring function across a cluster of members is now representative of the score of a randomly- drawn (i.e., unbiased sampled) observation.
  • the weights may be applied to the members of the cluster to in effect provide a randomly-drawn sample from the target population even though the samples in the cluster were obtained with biased sampling from the target population.
  • the weight for a type J observation x in the cluster is set equal to the ratio of 1) the probability that a random (unbiased) X equals x when randomly sampled and 2) n(J) times the biased sampling probability distribution p_J(x). n(J) is the number of observations of type J in the cluster.
  • Step 408 then outputs the weights.
  • the weights may be input into the estimator for the unbiased sample as described above.
  • the above weights may be determined using available knowledge about the target population such as this proportion of cases q in the target population. This knowledge required to determine the weights may be obtained by available data or other sources than the data sampled. For example, the target probability distribution may not be known, but the required characteristics of the target probability distribution to determine the weights may be known from sources other than the data sampled. Thus, values from sources other than the data sampled may be used to determine the weights. For example, certain features may be known, such as the proportion of people of type 1 or the proportion of people of type 2. The weights can be worked out by assuming the values for certain features in the target probability distribution. For example, the weights may be determined based on the proportion of people that have type 1 or type 2 (i.e., the incidence probability).
  • the probability that you are type 1 given your age may be used as a feature of the target probability distribution.
  • the weights may depend on a small number of features of the probability distribution. Thus, a portion of the target probability distribution is used to determine the weights.
  • a matching variable is used, such as age.
  • a case may be sampled, and then a control of the same age as the case is sampled. This is repeated until there are enough samples or all cases of the target population of interest have been sampled.
  • the weights determined may be a function of the proportion of cases in each age category. This information cannot be estimated from such a sampling design, but may be available from external databases (e.g., the proportion of liver cancer patients in the USA is a statistic that is known, similarly the proportion of breast cancer patients in the USA is known, etc). This information along with information about the biased sample is used to determine the weights.
  • the formula above for the weight assigned to a subject in the cluster of type I and having observed values x may be a function of x, including the type.
  • the formula is defined as the ratio 1) the marginal probability on sampling a data vector with values x on a subject of type I when random sampling from the target population, and n(I) times 2) the marginal probability of sampling a data vector with same values x of type I using the biased sampling actually employed. This ratio of random and biased sampling probabilities is the weight assigned to a subject with observed values x of type I in the cluster.
  • the formula can be applied for each value of x and each type of observation.
  • Particular embodiments will now be described in more detail. Particular embodiments provide new locally efficient methods for causal inference and variable importance analysis (or any other analysis) in semiparametric models (i.e., models relying a relatively few assumptions on a data generating distribution) for matched and unmatched case control studies relying on knowing the incidence probability, conditional on the matching variable if matching is used. If this incidence probability is unknown, then these methods can still be used as a sensitivity analysis.
  • a method that deterministically maps a logistic regression fit (possibly weighted for matched case-control designs) into a valid model based fit of the actual conditional probability on being a case, given the covariates, where this deterministic mapping involves adding a known intercept to the fit.
  • the resulting estimate of the conditional probability of being a case has now the important property that its standard error is proportional to the incidence probability (divided by the square root of the sample size) so that the obtained precision for the conditional probability is good enough for accurately estimating marginal causal relative risks or causal odds-ratios even when the incidence probability is very low.
  • a weighting scheme of cases and controls is used that maps any estimation method for a parameter developed for prospective sampling (i.e., unbiased sample) from the population of interest into an estimation method based on case control sampling (i.e. biased sample) from this population.
  • case control sampling i.e. biased sample
  • the weighting only relies on knowing the true population proportion of cases or, equivalently, the true probability of being a case, and for matched case-control sampling it also relies on knowing this proportion of cases within each population strata of the matching variable.
  • the case-control weighting when applied to an efficient estimator for a prospective sample from the population of interest maps into an efficient estimator for matched and unmatched case-control sampling.
  • Particular embodiments involve the application of this methodology to obtain double robust locally efficient targeted maximum likelihood estimators of causal parameters such as the causal relative risk and causal odds ratio for regular case-control sampling and matched case-control sampling.
  • the double robust locally efficient targeted maximum likelihood estimator is also provided in marginal structural models and semi- parametric logistic regression models.
  • a first component presents a method that deterministically maps the commonly employed logistic regression fit that ignores the case-control sampling (and thereby results in a biased fit for the actual conditional probability it models) into a valid model- based fit of the actual conditional probability on being a case, given the covariates.
  • this methodology is applied to matched case-control designs, the initial logistic regression fit is based on weighted control observations. For both case-control designs this mapping adds an intercept determined by the known incidence probability to the standard or control- weighted logistic regression fit.
  • This component may play a role as an ingredient in order to construct targeted maximum likelihood estimators or other locally efficient estimators according to the general new methodology representing the main part of particular embodiments.
  • Particular embodiments provide estimation based on case-control sampling for un-matched and matched case control designs involves weighting the cases and controls with q_0 and (l-q_O)/J for some J (and for matched the latter is replaced by a quantity depending on the incidence probability, conditional on matching variable), respectively, and then applying a method of choice developed for prospective sampling to estimate the parameter of interest (e.g., targeted maximum likelihood estimators or estimating equations for the causal effect or variable importance parameter of interest), as if the data was drawn from the target population distribution of interest.
  • a method of choice developed for prospective sampling to estimate the parameter of interest (e.g., targeted maximum likelihood estimators or estimating equations for the causal effect or variable importance parameter of interest), as if the data was drawn from the target population distribution of interest.
  • estimation procedures developed for prospective sampling are mapped into highly or fully efficient estimation procedures for case-control sampling.
  • the method is now able to fully exploit software developed for prospective sampling. That is, an estimator for an unbiased sample can be used to estimate the target feature using a weighted biased sample.
  • Particular embodiments can be generalized to any kind of variation of biased sampling as case-control sampling by calculating the appropriate weights.
  • Particular embodiments show the appropriate weights for paired-matching, stratified case-control sampling, and counter-matching case-control sampling, among others.
  • particular embodiments allow estimation of the effect of the matching variable on the outcome.
  • Particular embodiments may rely on knowing the incidence probability, allow double robust locally efficient estimation in semiparametric models, thereby allowing the use of methods which minimize the reliance of the inference on unknown model assumptions, thereby also requiring less need for matching (which can easily cause over-matching), since confounding can be adjusted for in more flexible double robust manner.
  • Particular embodiments allow targeting any causal effect or any other parameter of interest.
  • routines of particular embodiments including C, C++, Java, assembly language, etc.
  • Different programming techniques can be employed such as procedural or object oriented.
  • the routines can execute on a single processing device or multiple processors.
  • steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
  • Particular embodiments may be implemented in a computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, system, or device.
  • Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both.
  • the control logic when executed by one or more processors, may be operable to perform that which is described in particular embodiments.
  • Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used.
  • the functions of particular embodiments can be achieved by any means as is known in the art.
  • Distributed, networked systems, components, and/or circuits can be used.
  • Communication, or transfer, of data may be wired, wireless, or by any other means.

Abstract

In one embodiment, an estimator is determined for an unbiased sample of a probability distribution. The estimator is configured to accept a weight for each data point. The unbiased sample includes a first type of observation of the probability distribution and a second type of observation of the probability distribution. Clusters of the samples are determined from the biased sample where a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation. Weights are assigned to each sample within a cluster based on characteristics of the sample and the target population.. After the weights are determined, the estimator for an unbiased sample may be used to map data points of the biased sample to the target feature by inputting the biased sample with the corresponding weights for each sample into the estimator.

Description

PATENT APPLICATION
ESTIMATION BASED ON CASE-CONTROL DESIGNS
Cross References to Related Applications
This application claims priority from U.S. Provisional Patent Application Serial No. 6/050,063, entitled ESTIMATION" BASED ON CASE-CONTROL DESIGNS WITH KNOWN INCEDENCE PROBABILITY, filed on May 2. 2009, which is hereby incorporated by reference as if set fonh in full in this application for all purposes.
Background
[01] Particular embodiments generally relate to estimation.
[02] A randomized experiment may be considered the optimal way of defining cause and effect relationships. However, for some interventions, a randomized experiment may not be possible or may take a long amount of time to observe the events being studied. Case control studies use subjects who may have the condition and look back to see if there are characteristics of these patients that differ from those who do not have the condition.
[03] Case-control sampling is used to generate data to estimate effects of exposures or treatments on a binary outcome of interest when the proportion of cases (i.e., binary outcome equal to 1) in the population of interest is low. Case-control sampling represents a biased sample of a target population of interest by sampling a disproportional number of cases. Case-control studies are also commonly employed to estimate the effects of genetic or biomarkers markers on phenotypes. [04] The typical approach used in practice is to fit (conditional) logistic regression models, ignoring the case-control sampling, in order to estimate the conditional odds ratios of being a case, given baseline covariates and exposure. Although these methods do not rely on knowing the true incidence probability (i.e., probability of being a case). and provide logistic regression model based estimates of the conditional odds ratios relying on correct specification of such logistic regression models, they do not provide an estimate of a marginal causal odds ratio or causal relative risk, which are causal parameters representing the typical parameters of interest in randomized trials comparing different treatment or exposure levels. By the same argument, these methods do not provide measures of marginal variable importance. In fact, without knowing the incidence probability, these causal parameters are simply non identifiable.
[05] Another less commonly used method for unmatched case-control studies is inverse probability of treatment weighted estimation of the causal odds ratio based on a marginal structural logistic regression model, relying on a fit of the treatment mechanism based on control observations only and relying on the incidence probability being small enough relative to sample size. Since this estimator targets a non-identifiable parameter, this estimator will be very sensitive towards violation of these assumptions.
Summary
[01] Particular embodiments generally relate to estimation. In one embodiment, an estimator is determined for an unbiased sample of a probability distribution. The estimator maps a set of data points to a target feature. The unbiased sample includes a first type of observation of the probability distribution and a second type of observation of the probability distribution. The case control sampling design may be based on the first type and the second type of observations.
[06] A biased sample is determined that may include samples of the first type of observation and second type of observation. Clusters of the samples are determined from the biased sample where a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation. Weights are assigned to each sample within a cluster based on characteristics of the target population. For example, weights may be determined by assigning a weight to every sample in the cluster such that the expectation of the weighted average of a scoring function of the samples in the cluster equals the expectation of this scoring function under unbiased sampling from the target population. After the weights are determined, the unbiased estimator may be used to map data points of the unbiased sample to the target feature by inputting the biased sample with the corresponding weights for each sample into the estimator. The estimator then maps the weighted data points to the target feature. For example, the target feature may be a causal effect of a characteristic, a prediction function, a probability distribution, or any other target feature.
[07] A further understanding of the nature and the advantages of particular embodiments disclosed herein may be realized by reference of the remaining portions of the specification and the attached drawings.
Brief Description of the Drawings
[08] Fig. 1 depicts a simplified computing system according to one embodiment.
[09] Fig. 2 depicts a simplified flowchart of a method for determining a target feature according to one embodiment.
[10] Fig. 3 depicts a simplified flowchart of a method for determining clusters for the biased sample.
[11] Fig. 4 depicts a simplified flowchart of a method for determining the weights according to one embodiment.
Detailed Description of Embodiments
[12] In one embodiment, a method to estimate a target feature based on a biased sample is provided. A probability distribution may generate a first type of observation and a second type of observation of the probability distribution. An observation is either the first type of observation or second type of observation, but not both. In one example, a case control study may be used where for a case, a number of controls are used. A case may be a subject that has the type of observation while a control does not have the type of observation. For example, a case may have cancer but the control does not. The estimator for an unbiased sample of a probability distribution is determined, where the unbiased sample includes the first type and second type of observations. The estimator is configured to map a set of data points with corresponding set of weights to a target feature of the probability distribution of the unbiased sample. The estimator is configured so that setting the weights all equal to 1 and inputting a non-biased sample of the target probability distribution into the estimator yields a valid unbiased estimator of the target feature of the probability distribution.
[13] The estimator for the unbiased sample is then used to map weighted data points of the biased sample to the target feature. The weights are determined by determining clusters of samples and assigning weights to samples in the cluster. A cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation. For example, the cluster may include a case and one or more controls.
[14] Weights are assigned to each sample within each cluster of samples based on characteristics of the samples included in each cluster. The controls and cases can be generalized to any splitting up of population in K disjoint sub-populations spanning the whole population. The biased sample is split up (typically using the methodology of the case control design) in clusters of observations, where each cluster contains one or more observations of the first type and one or more of second type observations. The biased sample may be split up by design because the case-control study was performed in a way that facilitates clustering. For example, in a case control design, a case is matched with one or more controls based on the characteristics of the case and controls (e.g., controls of the same age (the matching variable) as the case are sampled). Thus, for a case that is sampled, controls of the same age are sampled. This sampling method of clustering based on the matching variable used to sample forms the basis for the clusters.
[15] Each first type and second type observation in the cluster gets its own weight (e.g., in matched case control designs, the weight for the second type of observation associated with a first type of observation may depend on the matching variable of the first type of observation). This combined assignment of weights within each cluster of observations in the biased sample is done so that it corrects for the biased sampling of this cluster of coupled first type and second type observations. The estimator developed for the unbiased sample, including two types of observations, can now be used with the assigned weights for the clusters.
[16] Fig. 1 depicts a simplified computing system 100 according to one embodiment. Computing system 100 may include one or more computing devices. It will be understood that functions described may be distributed among multiple computing devices or may be performed by the same computing device.
[17] An unbiased estimator determiner 102 is configured to determine an estimator for unbiased sample of a target probability distribution. The probability distribution identifies the probability of each value of a random variable, where a random variable is defined by a set of possible outcomes of an experiment, and a probability distribution on these possible outcomes. The experiment defining the random variable could be sampling a subject from a target population and measuring a number of characteristics on the subject. The probability distribution describes the range of possible values that a random variable can attain and the probability that the value of the random variable is within any (measurable) subset of that range. The estimator may be configured to receive data and output a target feature of the probability distribution. The data received may be information for a biased sample and the estimator used may be a targeted maximum likelihood estimator or any other estimator. This estimator will be used to map data points of a biased sample with a corresponding set of weights to the target feature.
[18] A cluster determiner 104 is configured to receive a biased sample. A biased sample may be a statistical sample of a target population in which some members of the population are less likely to be included than others. A population may be a set of entities from which the observations may be drawn. The population may also refer to a set of potential measurements or values, including not only cases actually observed but those that are potentially observable. The biased sample is a subset of the population in which measurements have been made.
[19] The biased sample represents an (unbiased) sample from a probability distribution that may be determined by the target probability distribution. The probability distribution of the biased sample maybe a conditional probability distribution of the target probability distribution. For a biased sample from the target population, an estimator is configured for the target feature that assumes a random sample from this target population, and also allows the inputting of weights assigned to each observation inputted. However, the biased sample is obtained by sampling from one or more probability distributions that are determined by this target probability distribution but are not equal to it, such as different conditional probability distributions, conditioning on samples having a certain type. Sampling from a conditional probability distribution involves (unbiased) sampling from the target probability distribution but only accepting the sample if it is of the type conditioned upon.
[20] Cluster determiner 104 is configured to determine clusters of samples. The clusters may be automatically determined by the sampling design, or a user may determine the clusters and input them into cluster determiner 104. A cluster has a plurality of different types of observations, where each observation can be represented by a sequence of numbers. An observation may include a binary outcome of a characteristic, such as whether or not a subject had a heart attack or not. This binary outcome can be used to label an observation as a case (e.g., binary outcome is heart attack) or a control (e.g., binary outcome is no heart attack). In a case control design, a cluster may include a case and one or more controls. A case may be a subject that has a particular event as indicated by the binary outcome included in the observation (e.g., the case may be a person who has had a heart attack). A control may be a subject that did not have the condition indicated by the binary outcome, but may be similar, such as the control may include characteristics that are similar to the case. For example, the control may be of the same age as the subject for the case (that had a heart attack) but may not have had a heart attack.
[21] The clusters may be determined by the nature of how the case control study was performed. For example, a natural cluster may be determined by including a case that was sampled in addition to a control that was matched to the case. A matching variable may have been used to sample case and controls in the case control design. The matching variable may be a characteristic of the samples, such as age. Multiple controls that have been matched to the same case and may be included in the cluster.
[22] If matching was not used, cluster may be determined based on including a case and one or more controls, possibly based on characteristics of the population that was sampled. For example, the cluster may include a case and controls that may include similar characteristics. Although one case is being described as being included in the cluster, it will be understood that a cluster may include multiple cases. In one embodiment, the cluster may be the whole biased sample.
[23] After the clusters are determined, weight determiner 106 determines the weights. In one embodiment, weights are determined for each cluster individually without taking into account samples found in other clusters. A weight may be determined for every sample in the cluster. In one example, the weights may be determined based on data that is known for the target population outside of the data from the sampled population. For example, characteristics from data that is known for the target population may be used to determine the weights. In one example of a regular case control sampling design that does not use matching the proportion of cases in the target population can be used to weight the samples in the cluster. For example, if 1 out of 10,000 in the unbiased target population is a case, then a weight of 1/10,000 may be assigned to the case in the cluster. If the cluster contains one control, then, for the control, the weight of 9,999/10,000 is assigned to the control, and if the cluster contains k controls, then the weight (l-l/10,000)/k may be assigned to the k controls in the cluster.
[24] In more detail, the weights may be determined by taking the weighted average of the samples in the cluster and selecting weights such that the expectation of the weighted average of a scoring function of the samples in the cluster equals the expectation of a scoring function of a random/unbiased sample from the target population. For example, given a function that maps an observation (e.g., sequence of numbers) of a subject into a score, the sum of the weighted average of the scores of the samples in the cluster has the same expectation as the expectation of the function applied to an observation on a randomly sampled subject. Particular embodiments may require this equality for many scoring functions thereby uniquely identifying the necessary weights.
[25] An expectation of a score under unbiased or random sampling of a target population would be the average of all the scores across all subjects in the population where all subjects are weighted equally. An expectation of a score under biased sampling of a target population would be a weighted average of all the scores across all subjects in the population, where the subjects are now weighted non-equally due to certain subjects having larger probability of being sampled than others. In general, an expectation of a random variable requires knowing the probability distribution on all the possible outcomes of that random variable, and it is defined as the weighted average of the possible outcomes where each possible outcome is weighted by its probability. This process of determining weights for the observations in the biased sample will be described in more detail below.
[26] Estimator 108 is configured to output the target feature based on the weighted biased sample, i.e., based on a biased sample of observations with corresponding weights. The estimator is configured to be a valid (i.e., approximately unbiased) estimator of the target feature if provided an unbiased sample of the target probability distribution, and if all weights are equal to 1. For example, the estimator receives the biased sample as input. Samples in the biased sample are assigned weights according to the weights determined. The samples may include data that has been measured. This data and its corresponding weight are inputted into the estimator. The estimator is configured to take this data on each sample and its corresponding weight, for all samples in the biased sample, which includes many samples, and map them to the target feature. The estimator then outputs the target feature. The target feature to be estimated may be a causal effect of a characteristic such as a treatment or exposure, a prediction function, a probability distribution, or any other function of the target probability distribution. The weights allow the estimator to output the target feature based on the biased sample. The weights adjust for the biased sample such that the biased sample has the effect of being an unbiased sample. The weights correct for the bias that occurred in the biased sample or non-randomized sample.
[27] The processing of the biased sample will now be described in more detail. Fig. 2 depicts a simplified flowchart 200 of a method for determining a target feature according to one embodiment. The method may be performed by computing device 100. Step 202 determines an estimator for an unbiased sample. The estimator may be based on a random sampling of a population.
[28] Step 204 determines a biased sample. The biased sample may include any number of samples. The information for the sampling may be received at computing system 100.
[29] Step 206 determines a cluster of observations. The cluster of observations includes samples of different types. Fig. 3 depicts a simplified flowchart 300 of a method for determining clusters for the biased sample. Step 302 determines a population of observations where the observations include different types.
[30] Step 304 determines characteristics of the population. Clusters of observations within the biased sample may already be defined by the design or may be constructed to include observations of multiple types (e.g., case and controls).
[31] Step 306 clusters a portion of the observations in the biased sample based on the characteristics of the samples and the characteristics of the population. A cluster may include multiple different types of observations. For example, in a matched case control design, a case and the number of controls that were matched to the case may be determined as a cluster. In other embodiments, the case and number of controls may be determined randomly, may be matched based on characteristics of the cases and controls, or a cluster might represent the total biased sample
[32] Referring back to Fig. 2, step 208 assigns weights to the observations in the cluster. For example, the weights may be assigned based on characteristics of the unbiased/target population and the actual sample to which the weight is assigned.
[33] Step 210 determines if additional clusters need to be processed. If so, the process reiterates to step 204 where a different cluster of observations is determined. Weights for this cluster are then determined without regard to the weights determined previously. [34] Once all the clusters have been processed, data may be mapped to the target feature. Step 212 receives the unbiased sample and the corresponding weights for the samples along with the estimator for the unbiased sample. Step 212 then outputs the target feature. For example, the estimator uses the weighted biased sample as input to the estimator to output the target feature. In one embodiment, the target feature may be displayed on a display, sent to another device or user, or stored.
[35] The method of determining the weights will now be described in more detail. Fig. 4 depicts a simplified flowchart 400 of a method for determining the weights according to one embodiment. The method may be performed by computing device 100. In step 402, a cluster is determined.
[36] The cluster may be of samples of J types. The cluster may have n(J) observations of type J, J = 1, . . . K. The cluster has a total of n(l) + n(2) + . . . n(K) observations. If the observations of a first type were drawn from a marginal (biased sampling) probability distribution described by a function with a probability distribution p_l (.), the probability that an observation of type 1 equals a value x (e.g., vector values represented by row in a file) is given by p_l(x), for all possible configurations of x. For example, the probability distribution p_l(.) might be the probability distribution of the data x on a randomly- sampled subject from the subpopulation that may have cancer (the first type). In that case p_l is a conditional probability distribution of the target probability distribution, conditioning on the sample being a case, i.e., a sample that has cancer. Also, for each type J among K possible types the observations of type J have a marginal (biased sampling) probability distribution p_J(.), J=I, . . . , K. So, for example, if K=5, there are observations of types labeled by 1,2,3,4, and 5. For example, the probability distribution p_J is the probability distribution of the data on a randomly- sampled subject from the subpopulation of the target population defined by having cancer type J. This definition of p_J is provided in case control designs.
[37] An example of marginal sampling distribution of controls in a case-control sampling design may be an example of a probability distribution p_2 (where p_l corresponds with a sampling of subjects of type 1 and p_2 is a sampling of subjects of type 2) where p_2 is the probability distribution of the data on a type 2 (i.e., control) sample obtained as follows: particular embodiments first sample an observation from the type 1 probability distribution p_l and obtain a value for a matching variable (e.g., age), then particular embodiments sample the control sample by sampling an observation from a subpopulation defined by having type 2 and having a matching variable equal to a value of the matching variable of the first sampled case observation.
[38] The biased sampling probability distributions p_J may correspond with a conditional probability distribution given that certain variables may need to have set values, which can be determined from the target probability distribution. The target probability distribution may be the probability distribution of a random sample from a target population. The certain variables that may be set include age, disease, death, cancer type, or other characteristics of the samples.
[39] Step 404 then determines one or more scoring functions. The scoring function may be a function that takes an observation x and maps it into a number. In one embodiment, weights are set such that the wished equality in expectation is achieved for all scoring functions that are used.
[40] Step 406 determines a weight for each sample in the cluster by requiring for each scoring function, the weighted average of the scores across all members of the cluster has the same expectation as the expectation of the scoring function applied to the data x drawn from the target probability distribution (of the unbiased sample), where the drawing from a target probability distribution may correspond with randomly sampling a subject from a target population and measuring data x on this subject. The weighted average of the scores across all members of the cluster may be viewed as a random variable. By making the expectation of the weighted average of the scores across all members of the cluster equal to the expectation of the scoring function applied to the data x drawn from the target probability distribution, the weighted average of the scoring function across a cluster of members is now representative of the score of a randomly- drawn (i.e., unbiased sampled) observation. Thus, the weights may be applied to the members of the cluster to in effect provide a randomly-drawn sample from the target population even though the samples in the cluster were obtained with biased sampling from the target population.
[41] Making the expectation of a scoring function the same across many scoring functions corresponds to solving an equation obtained by setting the two expectations equal to each other for each scoring function. In the setting described above, these equations are solved by setting the weight of an observation x of type J equal to w_J(x)=Px(X=x,Type=J)/p_J(x)n(J) is set, where the numerator denotes the probability that the random (unbiasedly sampled) data X equals x for a realization x of type J, where X denotes a random/unbiased draw from the target population. In other words, the weight for a type J observation x in the cluster is set equal to the ratio of 1) the probability that a random (unbiased) X equals x when randomly sampled and 2) n(J) times the biased sampling probability distribution p_J(x). n(J) is the number of observations of type J in the cluster.
[42] Step 408 then outputs the weights. The weights may be input into the estimator for the unbiased sample as described above.
[43] This formula can now be applied to different types of biased sampling designs. For example, if there are only two types, cases and controls, and each cluster includes one case and k controls, thus an observation from the subpopulation of the target population including cases is sampled, and k controls for each case from the subpopulation of the target population including controls is sampled. In this case the biased sampling distribution for the cases is P(X=xltype =case) and for the controls the biased sampling distribution is P(X=xltype=control), and the formula shows the weights for the cases is q and the weights for the controls is (l-q)/k, where q is the proportion of cases in the target population.
[44] The above weights may be determined using available knowledge about the target population such as this proportion of cases q in the target population. This knowledge required to determine the weights may be obtained by available data or other sources than the data sampled. For example, the target probability distribution may not be known, but the required characteristics of the target probability distribution to determine the weights may be known from sources other than the data sampled. Thus, values from sources other than the data sampled may be used to determine the weights. For example, certain features may be known, such as the proportion of people of type 1 or the proportion of people of type 2. The weights can be worked out by assuming the values for certain features in the target probability distribution. For example, the weights may be determined based on the proportion of people that have type 1 or type 2 (i.e., the incidence probability). Also, the probability that you are type 1 given your age (e.g., age is the matching variable in a matched case control design) may be used as a feature of the target probability distribution. The weights may depend on a small number of features of the probability distribution. Thus, a portion of the target probability distribution is used to determine the weights.
[45] In another example, in a matched case control sampling design, a matching variable is used, such as age. A case may be sampled, and then a control of the same age as the case is sampled. This is repeated until there are enough samples or all cases of the target population of interest have been sampled. In this type of biased sampling (called matched case control sampling), the weights determined may be a function of the proportion of cases in each age category. This information cannot be estimated from such a sampling design, but may be available from external databases (e.g., the proportion of liver cancer patients in the USA is a statistic that is known, similarly the proportion of breast cancer patients in the USA is known, etc). This information along with information about the biased sample is used to determine the weights. [46] The formula above for the weight assigned to a subject in the cluster of type I and having observed values x may be a function of x, including the type. The formula is defined as the ratio 1) the marginal probability on sampling a data vector with values x on a subject of type I when random sampling from the target population, and n(I) times 2) the marginal probability of sampling a data vector with same values x of type I using the biased sampling actually employed. This ratio of random and biased sampling probabilities is the weight assigned to a subject with observed values x of type I in the cluster. This ratio simplifies a lot, both in terms of its dependence on characteristics of the target population (i.e., as well as a function of x, since both (the biased and random) marginal probability distributions have a lot in common (the biased sampling probability distribution is often a conditional distribution of the target population conditioning on some features), and therefore often only depends on very few simple features of the target population, and only on a small component of x, such as the type and the matching variable used. The formula can be applied for each value of x and each type of observation.
[47] Particular embodiments will now be described in more detail. Particular embodiments provide new locally efficient methods for causal inference and variable importance analysis (or any other analysis) in semiparametric models (i.e., models relying a relatively few assumptions on a data generating distribution) for matched and unmatched case control studies relying on knowing the incidence probability, conditional on the matching variable if matching is used. If this incidence probability is unknown, then these methods can still be used as a sensitivity analysis.
[48] For both case-control designs, a method is provided that deterministically maps a logistic regression fit (possibly weighted for matched case-control designs) into a valid model based fit of the actual conditional probability on being a case, given the covariates, where this deterministic mapping involves adding a known intercept to the fit. The resulting estimate of the conditional probability of being a case has now the important property that its standard error is proportional to the incidence probability (divided by the square root of the sample size) so that the obtained precision for the conditional probability is good enough for accurately estimating marginal causal relative risks or causal odds-ratios even when the incidence probability is very low.
[49] A weighting scheme of cases and controls is used that maps any estimation method for a parameter developed for prospective sampling (i.e., unbiased sample) from the population of interest into an estimation method based on case control sampling (i.e. biased sample) from this population. For regular case-control designs the weighting only relies on knowing the true population proportion of cases or, equivalently, the true probability of being a case, and for matched case-control sampling it also relies on knowing this proportion of cases within each population strata of the matching variable.
[50] For a large class of semiparametric models, the case-control weighting when applied to an efficient estimator for a prospective sample from the population of interest maps into an efficient estimator for matched and unmatched case-control sampling. Particular embodiments involve the application of this methodology to obtain double robust locally efficient targeted maximum likelihood estimators of causal parameters such as the causal relative risk and causal odds ratio for regular case-control sampling and matched case-control sampling. The double robust locally efficient targeted maximum likelihood estimator is also provided in marginal structural models and semi- parametric logistic regression models.
Extending Logistic Regression fits ignoring biased sampling into valid accurate fits based on known incidence probability.
[51] A first component presents a method that deterministically maps the commonly employed logistic regression fit that ignores the case-control sampling (and thereby results in a biased fit for the actual conditional probability it models) into a valid model- based fit of the actual conditional probability on being a case, given the covariates. When this methodology is applied to matched case-control designs, the initial logistic regression fit is based on weighted control observations. For both case-control designs this mapping adds an intercept determined by the known incidence probability to the standard or control- weighted logistic regression fit.
[52] The resulting estimate of the conditional probability of being a case given the exposure/treatment and other risk factors, has now the important property that its standard error is proportional to the marginal probability of being a case (divided by the square root of the sample size n) so that the obtained precision is good enough for accurately estimating marginal causal relative risks or causal odds-ratios, even when the probability of being a case is extremely rare.
[53] This component may play a role as an ingredient in order to construct targeted maximum likelihood estimators or other locally efficient estimators according to the general new methodology representing the main part of particular embodiments.
(Locally efficient) Estimation of any parameter based on case-control samples with known incidence probability.
[54] Particular embodiments provide estimation based on case-control sampling for un-matched and matched case control designs involves weighting the cases and controls with q_0 and (l-q_O)/J for some J (and for matched the latter is replaced by a quantity depending on the incidence probability, conditional on matching variable), respectively, and then applying a method of choice developed for prospective sampling to estimate the parameter of interest (e.g., targeted maximum likelihood estimators or estimating equations for the causal effect or variable importance parameter of interest), as if the data was drawn from the target population distribution of interest.
[55] Particular embodiments show the important and convenient result that case- control weighting of the efficient procedure for the parameter of interest (as formalized by the efficient influence curve) for the prospective sampling actually maps into the efficient procedure for the case-control sampling. This implies, in particular, that case- control weighting of the locally efficient targeted maximum likelihood estimator developed for prospective (i.e., unbiased) sampling model results in a locally efficient targeted maximum likelihood estimation procedure for case-control (i.e., biased) sampling in many semiparametric models including non-parametric models.
[56] In general, the estimation procedures developed for prospective sampling are mapped into highly or fully efficient estimation procedures for case-control sampling. In particular, the method is now able to fully exploit software developed for prospective sampling. That is, an estimator for an unbiased sample can be used to estimate the target feature using a weighted biased sample.
Demonstration of method in various classes of examples:
[57] In the case-control weighted double robust targeted maximum likelihood and case-control weighted double robust estimating function methodology, particular embodiments estimate causal or variable importance parameters in nonparametric models, and based on assuming a semi-parametric logistic regression model, thereby avoiding the need for inverse weighting by a fit of the treatment mechanism. Particular embodiments also work out in detail the case-control weighted targeted maximum likelihood estimator to fit a mar 'gb*inal structural model.
[58] Particular embodiments can be generalized to any kind of variation of biased sampling as case-control sampling by calculating the appropriate weights. Particular embodiments show the appropriate weights for paired-matching, stratified case-control sampling, and counter-matching case-control sampling, among others.
[59] Contrary to the conditional logistic regression methods for matched case-control designs, particular embodiments allow estimation of the effect of the matching variable on the outcome. Particular embodiments may rely on knowing the incidence probability, allow double robust locally efficient estimation in semiparametric models, thereby allowing the use of methods which minimize the reliance of the inference on unknown model assumptions, thereby also requiring less need for matching (which can easily cause over-matching), since confounding can be adjusted for in more flexible double robust manner. Particular embodiments allow targeting any causal effect or any other parameter of interest.
[60] Although conventional IPTW methods target a causal odds ratio of interest, these methods do not acknowledge that in the common observational settings in which the treatment mechanism is not known the latter methods target a non-identifiable parameter and thereby, by necessity, heavily rely on non-testable assumptions. If the incidence probability is known, then particular embodiments may be significantly more efficient and robust than the IPTW method. If the incidence probability is unknown, then particular embodiments allow a sensitivity analysis involving the application of the locally efficient method for each set value of the incidence probability value among a set of plausible values, thereby obtaining honest assessment of what can be learned from the data at hand.
[61] Although the description has been described with respect to particular embodiments thereof, these particular embodiments are merely illustrative, and not restrictive. Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
[62] Particular embodiments may be implemented in a computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, system, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular embodiments.
[63] Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used. In general, the functions of particular embodiments can be achieved by any means as is known in the art. Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
[64] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. It is also within the spirit and scope to implement a program or code that can be stored in a machine-readable medium to permit a computer to perform any of the methods described above.
[65] As used in the description herein and throughout the claims that follow, "a", "an", and "the" includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of "in" includes "in" and "on" unless the context clearly dictates otherwise. [66] Thus, while particular embodiments have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular embodiments will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.

Claims

ClaimsWe claim:
1. A method comprising: determining a biased sample from a target population, the biased sample characterized by a first type of observation and a second type of observation; determining an estimator for an unbiased sample of a probability distribution of the target population, the estimator configured to map a set of data points to a target feature of the probability distribution; determining clusters of samples from the biased sample, wherein a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation; assigning weights to each sample within each cluster of samples based on characteristics of the target population and the sample itself; inputting data points from samples from each cluster with their weights into the estimator to map the data points with their weights to the target feature; and outputting the target feature.
2. The method of claim 1, wherein the first type of observation comprises a positive value of a binary outcome and the second type of observation comprises a negative value of a binary outcome.
3. The method of claim 1, wherein assigning the weightings comprises selecting weights such that, for one or more scoring functions, an expectation of a weighted average of scores of samples in a cluster equals the expectation for a score of a random sample drawn from the probability distribution of the unbiased sample.
99
4. The method of claim 3, wherein selecting weights comprises: determining one or more scoring functions that map data on a sample into a score; and determining the weights by setting, for each scoring function, the expectation of the weighted average of the scores evaluated by the scoring function of the samples in the cluster is equal to the expectation of the scoring function applied to a random sample drawn from the probability distribution.
5. The method of claim 4, wherein the weights are determined by determining a value for a feature of the probability distribution of the unbiased sample and measuring characteristics of the sample to which the weight is assigned
6. The method of claim 5, wherein the value comprises the incidence probability of the first type of observation and second type of observation occurring under unbiased sampling.
7. The method of claim 6, wherein the value is conditional on the characteristics of the sample.
8. The method of claim 1, wherein inputting comprises: assigning weights to data points for each cluster; and inputting the data points and their corresponding weights for the clusters to the estimator to estimate the target feature.
9. The method of claim 1, wherein three of more types of observations are used to determine the clusters.
10. The method of claim 1, wherein determining the clusters comprises determining a sample of the first type of observation and determining one or more samples of the second type of observation, wherein the one or more samples include similar characteristics as the sample of the first type of observation.
11. The method of claim 9, wherein the similar characteristics are determined based on a matching variable used to determine the sample of first type of observation and one or more samples of second type of observation.
12. The method of claim 1, wherein the estimator configured for the unbiased sample comprises a targeted maximum likelihood estimator.
13. The method of claim 1, wherein the estimator configured for the unbiased sample comprises an estimator of a prediction function that can be used to predict an outcome based on an input variable.
14. The method of claim 1, wherein the weights of the samples correct for the bias that occurred in the biased sample in mapping the data points to the target feature.
15. The method of claim 1, wherein the characteristics of the target population comprises characteristics determined from information outside of the samples of the biased sample.
16. The method of claim 1, wherein the biased sample is drawn from one or more probability distributions that are determined by the probability distribution of the unbiased sample.
17. A computer-readable storage medium comprising encoded logic for execution by the one or more processors, the logic when executed is operable to: determine a biased sample from a target population, the biased sample characterized by a first type of observation and a second type of observation; determine an estimator for an unbiased sample of a probability distribution of the target population, the estimator configured to map a set of data points to a target feature of the probability distribution; determine clusters of samples from the biased sample, wherein a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation; assign weights to each sample within each cluster of samples based on characteristics of the target population and the sample itself; input data points from samples from each cluster with their weights into the estimator to map the data points with their weights to the target feature; and output the target feature.
18. The computer-readable storage medium of claim 17, wherein assigning the weightings comprises selecting weights such that, for one or more scoring functions, an expectation of a weighted average of scores of samples in a cluster equals the expectation for a score of a random sample drawn from the probability distribution of the unbiased sample.
19. The computer-readable storage medium of claim 17, wherein determining the clusters comprises determining a sample of the first type of observation and determining one or more samples of the second type of observation, wherein the one or more samples include similar characteristics as the sample of the first type of observation.
20. An apparatus comprising: one or more processors; and logic encoded in one or more tangible media for execution by the one or more processors and when executed operable to: determine a biased sample from a target population, the biased sample characterized by a first type of observation and a second type of observation; determine an estimator for an unbiased sample of a probability distribution of the target population, the estimator configured to map a set of data points to a target feature of the probability distribution; determine clusters of samples from the biased sample, wherein a cluster includes one or more samples of the first type of observation and one or more samples of the second type of observation; assign weights to each sample within each cluster of samples based on characteristics of the target population and the sample itself; input data points from samples from each cluster with their weights into the estimator to map the data points with their weights to the target feature; and output the target feature.
PCT/US2009/042429 2008-05-02 2009-04-30 Estimation based on case-control designs WO2009135076A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US5006308P 2008-05-02 2008-05-02
US61/050,063 2008-05-02

Publications (1)

Publication Number Publication Date
WO2009135076A1 true WO2009135076A1 (en) 2009-11-05

Family

ID=41255439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/042429 WO2009135076A1 (en) 2008-05-02 2009-04-30 Estimation based on case-control designs

Country Status (1)

Country Link
WO (1) WO2009135076A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598578A (en) * 2019-08-23 2019-12-20 腾讯云计算(北京)有限责任公司 Identity recognition method, and training method, device and equipment of identity recognition system
WO2020055580A1 (en) * 2018-09-10 2020-03-19 Google Llc Rejecting biased data using a machine learning model
WO2020055581A1 (en) * 2018-09-10 2020-03-19 Google Llc Rejecting biased data using a machine learning model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058638B2 (en) * 2002-09-03 2006-06-06 Research Triangle Institute Method for statistical disclosure limitation
US20070258898A1 (en) * 2006-03-01 2007-11-08 Perlegen Sciences, Inc. Markers for addiction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058638B2 (en) * 2002-09-03 2006-06-06 Research Triangle Institute Method for statistical disclosure limitation
US20070258898A1 (en) * 2006-03-01 2007-11-08 Perlegen Sciences, Inc. Markers for addiction

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020055580A1 (en) * 2018-09-10 2020-03-19 Google Llc Rejecting biased data using a machine learning model
WO2020055581A1 (en) * 2018-09-10 2020-03-19 Google Llc Rejecting biased data using a machine learning model
KR20210025108A (en) * 2018-09-10 2021-03-08 구글 엘엘씨 Reject biased data using machine learning models
KR20210028724A (en) * 2018-09-10 2021-03-12 구글 엘엘씨 Biased data removal using machine learning models
JP2022500747A (en) * 2018-09-10 2022-01-04 グーグル エルエルシーGoogle LLC Biased data rejection using machine learning models
US11392852B2 (en) 2018-09-10 2022-07-19 Google Llc Rejecting biased data using a machine learning model
JP7241862B2 (en) 2018-09-10 2023-03-17 グーグル エルエルシー Rejecting Biased Data Using Machine Learning Models
KR102556497B1 (en) * 2018-09-10 2023-07-17 구글 엘엘씨 Unbiased data using machine learning models
KR102556896B1 (en) * 2018-09-10 2023-07-18 구글 엘엘씨 Reject biased data using machine learning models
KR20230110830A (en) * 2018-09-10 2023-07-25 구글 엘엘씨 Rejecting biased data using a machine learning model
KR102629553B1 (en) 2018-09-10 2024-01-25 구글 엘엘씨 Rejecting biased data using a machine learning model
CN110598578A (en) * 2019-08-23 2019-12-20 腾讯云计算(北京)有限责任公司 Identity recognition method, and training method, device and equipment of identity recognition system

Similar Documents

Publication Publication Date Title
Eaton-Rosen et al. Towards safe deep learning: accurately quantifying biomarker uncertainty in neural network predictions
Qin Biased sampling, over-identified parameter problems and beyond
Andrinopoulou et al. Bayesian shrinkage approach for a joint model of longitudinal and survival outcomes assuming different association structures
May et al. Maximum likelihood estimation in generalized linear models with multiple covariates subject to detection limits
Prague et al. Accounting for interactions and complex inter-subject dependency in estimating treatment effect in cluster-randomized trials with missing outcomes
Sun et al. Variable selection in semiparametric nonmixture cure model with interval‐censored failure time data: An application to the prostate cancer screening study
Chambaz et al. Targeted sequential design for targeted learning inference of the optimal treatment rule and its mean reward
Maleyeff et al. Assessing exposure-time treatment effect heterogeneity in stepped-wedge cluster randomized trials
WO2009135076A1 (en) Estimation based on case-control designs
Cai et al. Joint modeling of longitudinal, recurrent events and failure time data for survivor's population
Marbac et al. Model-based clustering for conditionally correlated categorical data
Blaha et al. Design and analysis of cluster randomized trials with time‐to‐event outcomes under the additive hazards mixed model
Chen et al. On the estimation of structural equation models with latent variables
Saarela et al. Secondary analysis under cohort sampling designs using conditional likelihood
McLain et al. Modeling longitudinal data with a random change point and no time‐zero: Applications to inference and prediction of the labor curve
Flórez et al. Fast two-stage estimator for clustered count data with overdispersion
Huang et al. LCA_Distal_BCH Stata function users’ guide (Version 1.1)
Zhang et al. A robust joint modeling approach for longitudinal data with informative dropouts
Brinkley A doubly robust estimator for the attributable benefit of a treatment regime
Hyun et al. Sample‐weighted semiparametric estimation of cause‐specific cumulative risk and incidence using left‐or interval‐censored data from electronic health records
Yu et al. Estimating multiple treatment effects using two-phase semiparametric regression estimators
Braun et al. GPUCSL: GPU-Based Library for Causal Structure Learning
Yadlowsky et al. Estimation and Validation of a Class of Conditional Average Treatment Effects Using Observational Data
Spieker et al. A causal approach to analysis of censored medical costs in the presence of time-varying treatment
Wang et al. Handling incomplete outcomes and covariates in cluster-randomized trials: doubly-robust estimation, efficiency considerations, and sensitivity analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09739878

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09739878

Country of ref document: EP

Kind code of ref document: A1