WO1992017853A2 - Procede de diagnostic et de prevision base sur une analyse directe d'une base de donnees - Google Patents

Procede de diagnostic et de prevision base sur une analyse directe d'une base de donnees Download PDF

Info

Publication number
WO1992017853A2
WO1992017853A2 PCT/US1992/002757 US9202757W WO9217853A2 WO 1992017853 A2 WO1992017853 A2 WO 1992017853A2 US 9202757 W US9202757 W US 9202757W WO 9217853 A2 WO9217853 A2 WO 9217853A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
record
derived
values
derived data
Prior art date
Application number
PCT/US1992/002757
Other languages
English (en)
Other versions
WO1992017853A3 (fr
Inventor
Peter W. Frey
Original Assignee
Pattern Recognition, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pattern Recognition, L.P. filed Critical Pattern Recognition, L.P.
Publication of WO1992017853A2 publication Critical patent/WO1992017853A2/fr
Publication of WO1992017853A3 publication Critical patent/WO1992017853A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing

Definitions

  • the present invention relates to databases and more particularly to methods for forecasting and diagnosis based on direct analysis of data in such data bases.
  • One technique for evaluating data in order to produce useful information is to produce a model of the data and attempt to derive parameters for the model from the data. Desirably, useful information such as forecasts, predictions, and diagnoses can then be derived from the model.
  • Models by their very nature, are based on assumptions relative to the general relationships among the variables. These assumptions are often inconsistent with the properties of the data being examined. In cases where the assumptions are violated, the predictive accuracy is reduced. Furthermore, the imposition of these assumptions tends to place a limit on the asymptotic degree of accuracy of the predictions independent of the number of observations in the existing data base. Increasing the sample size initially improves predictions because a larger number of observations permits more precise specification of the model parameters. At some point, however, additional data provide no further enhancement. For example, with linear regression or discriminant analyses, it is generally believed that further improvements are not observed when the sample size is larger than about ten thousand.
  • model building is currently the standard approach for working with large sets of data, in spite of these limitations. Historically, data bases tended to be much smaller than they are today. Model building is an effective technique for deriving useful information within the context of small and medium size data bases. Furthermore, for many years, the available computational resources were inadequate for the direct analysis of large amounts of data in a reasonable time frame. Direct methods, although discussed from time to time in the academic literature, were simply not cost effective. The rapid evolution of low cost, high-speed, large volume computers may change these limitations. Many computational methods which were impractical less than a decade ago may become potentially feasible.
  • Modeling techniques are often not adequate to provide the desired capabilities.
  • Many data bases are organized as two-dimensional flat files.
  • the data is typically organized in rows and columns in which the rows represents individual records in the data base, e.g., persons, households, accounts, or events, and the columns represent fields describing attributes, e.g., age, weight, symptom, or outcomes, e.g., diagnoses, credit risk, which comprise each record.
  • attributes e.g., age, weight, symptom, or outcomes, e.g., diagnoses, credit risk
  • Often information is acquired which can be used to establish a partial record in which the attribute information is present but the outcome information is missing. In these cases, it may be desirable or necessary to forecast, predict, or diagnose the unknown outcome information for a new record by making use of the other information in the data base.
  • Traditional methods for predicting outcomes have been based on linear regression or discriminant analyses. More recently, other approaches have been employed, such as rule-based expert systems.
  • the fields in a data base commonly represent information of three different types. Some of the fields represent variables that are boolean, such as, e.g., true-false, yes-no, agree-disagree, like-dislike. Other fields represent variables that are categorical, such as, e.g., marital status — single, married, separated, divorced, widowed; and employment status — full-time, part-time, retired, student, unemployed. Still other fields represent variables that are numerical, such as, e.g., annual income, age, months at current job. Determination of the similarity of two records is greatly complicated by these multiple data types.
  • the similarity between any two records might be calculated by counting the number of fields for which the two records have identical values.
  • the nearest neighbor would be the record which has the most fields with values in common with the target record (see e.g., Stanfil & Waltz, Toward memory-based reasoning. Communications of the ACM f 1986, 29, 1213-1228) .
  • each field can be considered as one dimension of a multi- dimensional hyperspaoe, and the distance between any two records can be computed by assuming that the hyperspaoe has Euclidean properties. By computing the distance between the test case and every other case in the data base, the nearest neighbor is identified as the record with the smallest Euclidean distance from the test record.
  • a method of organizing large sets of data in a data base to facilitate evaluation of a test record for producing information about the test record extracts information from large sets of data which is useful for prediction, forecasting, and diagnosis.
  • the method incorporating the present invention involves the prediction of an expected outcome based on an analysis of the similarity of a test case or test record in a data base to each of the prior cases, i.e., existing records, which have been stored in the data base.
  • the method incorporating the present invention is an outgrowth and refinement of the classical nearest neighbor method. Methods incorporating the present invention provide techniques for transforming a raw data base into a numerical representation which permits utilization of the power which is inherent in the nearest neighbor approach. Additionally, methods incorporating the present invention, when compared to the classical technique, greatly enhance the effectiveness of the process.
  • the various data fields of the records for most large data bases include a mixture of numerical, boolean, and categorical information.
  • the appropriate method for determining similarity becomes problematic. There is no obvious method which is clearly best for measuring inter-record similarity. This is a serious impediment to the application and use of the nearest neighbor approach to real-world data.
  • the effectiveness of a procedure for determining the similarity between records would be enhanced by the ability to take into account and reflect the relevance, value or weight of the data of the various fields of the data base in correctly forecasting the desired outcome.
  • Yet another important aspect of making effective use of the nearest neighbor approach is to determine when and how to aggregate individual fields to form one or more composite traits which can then be employed to determine similarity between records of the data base.
  • fields which may provide little useful information when considered by themselves can be useful when evaluated in combination with, or on a relational basis with, other fields.
  • Yet another difficulty in measuring the similarity between records is that numerical fields are often not scaled in a way which faithfully reflects the relationship between the values of a given trait and the values of the outcome which is being predicted or diagnosed.
  • credit risk may vary with age
  • the difference between records due to differences in values of this data field may differ for different values. The risk may be great for certain values, e.g., during certain age ranges such as the younger ages of 18 to 30.
  • the difference may be small for other age ranges, e.g., the senior years of 55 to 70.
  • the similarity on the basis of the age attribute (the difference in age) for two individuals whose ages differ by eight years are different for different age values.
  • analysis of the data would show that two individuals who are 56 and 64 years of age are considered to be highly similar.
  • two individuals who are 20 and 28 years of age would be considered to be somewhat different.
  • a simple mathematical treatment of these data would conclude that both cases are equally similar, namely eight years difference in age.
  • a method in accordance with the present invention attempts to optimize the predictive validity of a nearest neighbor system by scaling each predictor value such that the numerical values reflect an accurate relationship to the outcome measure.
  • the method incorporating the current invention addresses these problems and thereby is an advance over classic nearest neighbor technology.
  • the solution is based on the use of a recursive binary classification process (cf. , Breiman, Fried an, Olshen, & Stone, Classification and Regression Trees. Monterey, CA.Wadsworth, 1984; Quinlan, I. R. Induction of decision trees. Machine Learning. 1986, 181-106). This process has been used by itself to classify or categorize data and is a powerful tool for these purposes.
  • a method in accordance with the present invention utilizes numerical scaling of similarities created by use of non-linear mapping functions.
  • Numerical scaling as used in accordance with the present invention combines a number of original variables to produce a derived variable.
  • a binary classification tree is used in a novel way as a preparatory step to translate (i.e., map) diverse information in raw data base form into a new, derived representation having equal or equivalent units of measurement which are appropriate for applying the nearest neighbor approach.
  • the binary classification tree provides a means for mapping the fields in the original records of the data base into a new set of derived fields which can be employed for making determinations of the similarity between records. This mapping process generally involves a reduction in the number of fields contained in each derived record. This results from the elimination of fields which provide no relevant information, and from the combination of two or more original fields into a single new derived field.
  • this mapping process involves the following steps:
  • This outcome measure is to be available as one of the fields in the original record layout and is to be numerical or boolean in type (i.e., not categorical).
  • each record in the data base can be associated with one, and only one, terminal node within the tree.
  • the outcome value associated with the terminal node i.e., the average value for the records within that terminal node, is the numerical value assigned to that record for the trait, i.e., the new field, derived from the original fields used to create the classification tree.
  • a uniform numerical unit of measurement is derived which is independent of the original field type (boolean, categorical, or numerical).
  • the relative importance of each of the original fields in the original data base is determined and weighted properly by the creation of the binary classification tree.
  • the combination or elimination of fields is also a natural consequence of tree creation.
  • the scaling of the new derived field values in respect to the target outcome is an integral aspect of the binary classification process.
  • the method incorporating the current invention provides an effective procedure for preparing a raw data base for nearest neighbor forecasting.
  • the mapping procedure creates a new, derived representation in which the fields are all numerical and are in equal or equivalent units of measurement, in which the values are scaled properly, and in which each field provides meaningful predictive information.
  • This new representation permits the use of the powerful Minkowski distance metric to determine inter-record similarity.
  • selected variables are grouped or aggregated for combined processing and analysis. A plurality of such groupings may be utilized, each of which contains information or data derived from fields different from the fields used in other groupings in so far as the desired output is concerned.
  • a grouping of data for analysis to provide an output predictive of one selected outcome could very well be different from a grouping or aggregation of data fields with respect to another outcome. Values of the data in a selected group are categorized with respect to the selected outcome to produce multiple discrete combinations each having a predictive value for the desired outcome.
  • the current invention incorporates procedures which provide refinements of the classical nearest neighbor procedure (c.f., Duda & Hart, 1973). This is to be distinguished from existing procedures, such as those for determining the outcome for the test record by observing the outcome associated with its nearest neighbor in the data base (i.e., the record which is most similar to the test record) .
  • a common variation of the procedure is to determine the k nearest neighbors (where k typically ranges between 1 and 20) and designate the outcome for the test case as the statistical average of the outcomes observed for the k nearest neighbors when the outcome is numerical or boolean or as the most frequent value when the outcome is categorical. In both of these circumstances, each record within the subset of the k nearest neighbors has equal weight in determining the outcome for the test record.
  • the method incorporating the current invention refines the k-nearest neighbor approach in four specific ways:
  • the number of neighboring records i.e., the population of voters
  • the number of voters or participating records typically ranges from about 50 to 800.
  • the number of participating records varies, depending jointly on two independent criteria which are established to determine the eligibility of each potential participating record. Participating records are typically include or are selected from those records (a) within a pre-specified numerical distance of the test record, and (b) having a nearness rank (e.g., such as the 400th closest record) which is less than a pre-specified value. All records which fit these criteria may be included in the population of participating records. Records which do not meet both of the criteria above are not used as participating records, i.e., do not vote.
  • Each of the participating records has a differential amount of influence on the prediction or diagnosis.
  • the influence of each of the records used is proportional to the similarity of that record to the test record. The more similar records have greater influence. This contrasts with the one record, one vote method of the classical k-nearest neighbor approach.
  • the voting process can consider the number of eligible participating records and both the central tendency and the variance of the distribution of outcome values for the eligible participating records, i.e., the nearest neighbors, in determining the appropriate outcome for the test record.
  • Prior technology has focused on the central tendency, e.g., the mean, of this distribution.
  • an applicant for credit might be analyzed on the basis of one or more criteria. For example, credit might be approved or denied on the basis of revenue potential and absence of risk.
  • the mean of the participating records represents the best forecast of revenue potential.
  • the standard deviation of this group represents a forecast of the risk associated with the forecasted revenue potential.
  • a dual-criteria decision rule can be utilized which considers both measures in deciding to approve or deny credit to an applicant.
  • the system may respond with the mean outcome value for the entire data base. If the outcome is categorical, the response may be the most common category or alternatively, that there is not enough information to make a choice. When the number of participating records exceeds the pre-specified number, the system reports the mean and standard deviation for numerical or boolean outcome measures and reports the proportion of each category for categorical outcome measures.
  • FIGURES 1 is a logical flow chart illustrating the method of the present invention
  • FIGURES 2, 3 and 4 are diagrams of decision trees produced by analyses of records of a database in accordance with the method incorporating the present invention.
  • a flat file data base is a set of records in which each record contains information about, i.e., represents, a single subject of the data base, e.g., a person, household, account, or event.
  • Each of the records typically contains a plurality of fields.
  • the fields in each record represent attributes (i.e., characteristics) of the subject of the record. These attributes can be boolean (e.g., yes, no), categorical (e.g., single, married, separated, divorced, widowed), or numerical variables. In some cases, the value of the attribute will not be known (missing values) .
  • Each record also contains a field that represents a target event which is the object of the forecast, i.e., the desired outcome.
  • the target event or outcome field can also be a boolean, categorical or numerical variable.
  • Some of the records, i.e., the existing set contain known values for the target event.
  • One or more other records, i.e., the prediction or test record or set, have unknown values for the target event.
  • a method incorporating the present invention uses the derived information relating the attributes to the outcome in the training set to "predict" the best value for the outcome for each record in the prediction set.
  • a data base of existing credit records can be analyzed.
  • such records may represent information about individuals applying to receive a line of credit, often in the form of a credit card or loan, e.g., to banks, retail stores, or oil companies.
  • the credit evaluation of an applicant can be predicted in accordance with the method incorporating the present invention by comparing the attribute information from an individual's application to the corresponding information contained in the records of prior applicants whose credit record is currently known.
  • the similarity between new applicant's attributes and those of prior applicants provides a basis for approving or denying credit. For example, a similarity between the new applicant's attributes and those of prior applicants who have abused their credit privileges would provide a basis for denying credit.
  • the predictor attributes commonly consist of several numerical variables, several boolean variables, and several categorical variables.
  • numerical values include monthly income, monthly credit payments, monthly housing payment, months at current job, months at current residence, number of 60-day delinquencies, number of balances past due, and number of recent inquiries to credit bureau.
  • Examples of boolean variables include the existence or absence of a savings account, checking account, loan account, bank credit card, and oil company card, and whether certain information has been provided, e.g., a job description.
  • Examples of categorical variables include type of housing, source of application, educational background. There are typically twenty to forty fields which provide useful information.
  • an appropriate outcome measure or target event may be the net revenue (positive or negative) to be derived from the applicant if credit is issued.
  • a subset of the available fields relating to the existence and non-existence of certain financial parameters such as, e.g., bank card, department store card, oil company card, savings account, checking account, as well as housing type, e.g., owns, rents, with parents, can be grouped together.
  • the original information or data in this subset or group of attribute fields is evaluated to produce derived predictive values relative to the target event.
  • the grouping of original data fields is subjected to a binary classification procedure to produce a binary classification tree structure shown in Fig. 2 which segments the database in a useful way.
  • the top box in Fig. 2 indicates that the entire data base represents 191,293 records of credit applicants which produce an average annual net revenue of -1.28.
  • the first split of the data base is based on the existence or absence of a bank card.
  • These two groups are represented by the two boxes in the second row of Fig. 2. As shown in Fig. 2, there are 106,221 records with no bank card when they applied for credit, and there are 85,072 with one or more bank cards when they applied for credit.
  • the group without bank credit cards produced an average net revenue of -6.91, while the group with one or more bank cards produce an average net revenue of 5.76.
  • Fig. 2 indicates that the entire data base represents 191,293 records of credit applicants which produce an average annual net revenue of -1.28.
  • the first split of the data base is based on the existence or absence of a bank card.
  • a binary classification tree represents a recursive process which continues to split the subset into groups as long as meaningful splits are possible.
  • meaningful refers to the creation of two new groups which are significantly different from each other in a statistical sense.
  • each group represents a specific segment of the original data base.
  • the number of records representing individual applicants in each group varies from group to group as does the observed average net revenue produced.
  • Each of the thirty terminal groups can be characterized as follows :Thus, the terminal group in the lower left portion of the Fig. 2 represents 1956 applicants with no bank card or checking account (including no answer) , which rent housing, with no department store card, and without an answer about a savings account.
  • This first group produced an average net revenue of - 25.70.
  • the next terminal group is similar, except that the applicants in this group have savings accounts.
  • the second group produces an average net revenue of -20.45.
  • the purpose of the binary classification tree is to produce a mapping relationship such that any set of responses for the six predictor items automatically places the applicant into one and only one of the 28 terminal bins.
  • An applicant inherits the value of the derivative data assigned to the bin defined by the applicant's responses.
  • the bin values represent average net revenue produced by the individuals in the bin.
  • the binary classification tree maps each of the 972 response patterns into one of the 28 terminal bins and thereby assigns a numerical value for each of the 972 response patterns. As is apparent from Fig.2, however, some of the bins encompass more than one of the possible response patterns.
  • the analysis along any one branch of such a decision tree may be terminated when certain criteria are no longer met, e.g., the size of the group at the end of the branch falls below a selected value, or the quality of a proposed split does not meet certain criteria. Alternatively, the analysis can be forced beyond such limits if meaningful information can be extracted.
  • This value provides a derived numerical index reflecting the relative profitability of each person based on the information contained in the six predictor variables.
  • the value represents a transformation from a heterogeneous set of categorical characteristics in the reference data base to a single dimension in which the unit of measurement (revenue) has ratio scale properties. This transformation is significant since it converts heterogeneous data into a homogeneous data in equal or equivalent units of measurement which are suitable for the nearest neighbor algorithm.
  • Fig. 3 presents a second binary classification tree.
  • the outcome measure is also net revenue but the splitting variables are orthogonal to, have no overlap with, the variable of the first tree.
  • the variables used in Fig. 3 include the number of good trades (transactions) , time at current residence, income, and months at current job. This tree produces a second mapping relationship which results in the assignment of a ratio scale or derived value to each record based on the responses to the second set of predictor variables.
  • the variables have been classified in a number of different ways. Thus even though only four attributes are used, several are used more than once as a result of different values for a given attribute value.
  • the terminal bin in the lower right corner of Fig. 3 is based on attributes of more than one good trade, more than two good trades, long time at current residence, income greater than 8, not long time at current job, and income greater than 22. Applicants so classified, produce average revenue of 11.19.
  • age another group that can be evaluated on a sealer basis for similar information is age.
  • the revenue produced by age can be evaluated, and segregated by age ranges.
  • the benefit of scaling in this regard is, as indicated above, that changes as a function of age may differ for different values of age, i.e., similar age changes may or may not result in similar changes in revenue performance.
  • the average revenue produced by applicants aged 19 and under is -7.97.
  • the average revenue produced by applicants aged 20 - 28 (nine year range) is -5.31, while applicants aged 29 - 31 (three year range) produce an average income of -2.09.
  • applicants aged 32 - 37 produced an average income of 1.37 while applicants in the 38 - 48 age bracket
  • Such an analysis separates ages, not by some predetermined age model, but from the actual data that indicates the segregation as a result of actual attribute data.
  • the 20 to 40 attributes associated with each application which are predictive of the expected outcome or desired result have been converted into four to six derived attributes based on the mapping sets.
  • the new derived attributes have all of the properties that a Euclidean distance metric requires.
  • Each value is based on the same measuring unit (e.g., in the above example, revenue) .
  • Each value has true ratio scale properties.
  • Problematic issues such as variable selection, variable weighting, and variable scaling have been dealt with. Missing values are included since the binary classification tree treats missing values like other values, i.e., groups missing values with other values when they produce similar outcomes and splits them into a separate group when they are associated with unique outcomes.
  • the result of grouping variables in accordance with the present invention is to enable the use of a small number of dimensions, each of which is substantially orthogonal to the others.
  • groups of information are selected which do not share characteristics with each other and therefore lend themselves to analysis separate from the other groups.
  • a predictive response can be produced based on the similarities of the values of the test record, the record for which a prediction is sought, and the values produced by the analysis of the data base. Since the comparison is based on the analysis of the actual data, the reliability and accuracy of the response, as compared to existing techniques, can be improved.
  • a subset of records in the data base which are most similar to the test record are identified. This subset is usually about 1/2 of 1% of the records in the entire database. For example, if the database consisted of 100,000 records, the 500 records which are most similar to the target item are identified.
  • the forecast for the new applicant would be based on a statistical analysis of this subset.
  • the mean value would be the best estimate of the expected value for the new applicant.
  • the standard deviation of this subset would provide an estimate of the stability of the expected value.
  • the standard deviation of the subset provides a direct measure of risk. In performing analyses on a large collection of data, it is first appropriate to determine the nature of the information that is desired. The data is then organized or categorized in a plurality of groupings of data, each of which has the capability of providing information with respect to the desired analysis.
  • the various information provided by credit applicants can be categorized or grouped into a plurality of groups, each of which consists of categories of data capable of providing information with respect to revenue.
  • Data respecting certain like information can be aggregated and processed as a unit to provide output in the form of a set of values of various combinations of the data which are predictive and are related to the ultimate question being investigated.
  • Various combinations of data for each aggregation can be evaluated and values produced respecting the value of the combinations as a function of the outcome being processed. Initially there may be a large number of predictive variables which have a relationship to each other. These are aggregated, and processed together.
  • the predictive variables of credit cards e.g., types of credit cards, and numbers of credit cards can be aggregated and processed together to produce a set of values for various combinations of credit cards, each being a sealer value having a number which relates to the answer being sought.
  • each combination of predictors will have a value corresponding to credit worthiness or similar function.
  • the various predictors are combined in a way that will be the most helpful to producing information with respect to the desired outcome.
  • the data may be categorized in an effort to split the records substantially equally on either side of the split.
  • Each successive categorization is selected in a way to render the successive level using the predictive values that give the highest quality split.
  • This general rule can be modified so that split is made to achieve the effectiveness of the variable.
  • the number of records might be small.
  • a split based on that information might occur earlier in the tree in order that the number of records affected by that split have an effect.
  • a split based on such a small number of records might result in the effect of the particular variable on the decision tree being so minimal as to have no effect at all.
  • the split can be further modified in an effort to maintain the size of the groups.
  • Each of the splits are somewhat equal once again to avoid a grouping that is so off center or small as to ultimately be ignored.
  • the initial split can be taken utilizing a particular type of credit card in which the number of credit card holders is sufficiently small so that if a split was made later on or lower down in the decision tree, the effect of that split would be dissipated.
  • An ultimate decision tree is produced in with various combinations of credit cards having a series of values representative of revenue, as indicated above.
  • the results of each of the decision tree analyses is a set of data representing a trait plotted with respect to the outcome to be predicted.
  • each characteristic or trait is established as a function of a particular set or aggregation of data evaluated and analyzed, the units of the traits as a function of the data analyzed is the same.
  • the similarity between the test case and the individual cases making up the database can be determined based on the similarities between the values of the trait for each set of data analyzed.
  • the plot of each such trait is often referred to as a dimension of the data base.
  • data in a number of such dimensions is used to identify or characterize each of the records in the database.
  • the location of each known record is thus determined based on the correspondence of each trait of each record to the plotted values of that trait based on the corresponding decision tree analysis.
  • the location of the test case is similarly determined.
  • the test case is compared to a selected number N of most similar cases from the data base, i.e., to the N nearest neighbors.
  • the degree of similarity or the distance between the test case and the known cases is determined by known computational techniques, such as by computing the "Minkowski distance" between the test case and each of the other cases in the data base.
  • the Minkowski distance is calculated in accordance with the following formula:
  • the prediction for the test case is determined by taking the average of the values of the information being predicted, e.g., credit risk, for the "N" closest neighbors.
  • the accuracy can be improved, as discussed above, by using a weighted average in which the values are weighted as a function of the distance of each known case from the test case.
  • data can be evaluated with the purpose of predicting a selected outcome for a test case by aggregating similar data, manipulating the values of traits for combinations of each group of data to produce for each group sealer values in common units, determining the similarity between a test case and prior cases, and providing a prediction based on the central tendency, e.g., the mean value or the mode, of the target outcome for the most similar cases.
  • the central tendency e.g., the mean value or the mode
  • the reliability of the predictive values is improved as compared to predictions based on existing techniques.
  • the predictive values in accordance with the present invention are determined by comparison with the most similar records, those in the "local neighborhood" rather than on global estimates which occurs when other techniques, such as modeling, are used.
  • the predictive values produced in accordance with the present invention are further enhanced since the test case is compared to the actual data making up the data base, and since data bases are typically updated to incorporate recent records.

Abstract

Procédé d'analyse des registres d'une base de données, consistant à choisir une mesure cible associée à un résultat sélectionné, à identifier des données contenues dans des registres connus de la base de données et destinées à être utilisées comme des variables de prédiction, à regrouper certaines des variables de prédiction choisies, à produire des valeurs dérivées de la mesure cible pour différentes combinaisons des variables de prédiction pour chaque groupe, à identifier les valeurs dérivées pour un registre d'essai, à identifier un nombre choisi de registres connus qui présentent la plus grande similarité avec le registre d'essai par rapport aux valeurs dérivées, à identifier la valeur du résultat sélectionné des registres connus sélectionnés présentant la plus grande similarité, et à utiliser cette valeur pour prédire un résultat sélectionné pour le registre d'essai.
PCT/US1992/002757 1991-04-05 1992-04-06 Procede de diagnostic et de prevision base sur une analyse directe d'une base de donnees WO1992017853A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US68111591A 1991-04-05 1991-04-05
US681,115 1991-04-05

Publications (2)

Publication Number Publication Date
WO1992017853A2 true WO1992017853A2 (fr) 1992-10-15
WO1992017853A3 WO1992017853A3 (fr) 1992-11-26

Family

ID=24733895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1992/002757 WO1992017853A2 (fr) 1991-04-05 1992-04-06 Procede de diagnostic et de prevision base sur une analyse directe d'une base de donnees

Country Status (2)

Country Link
AU (1) AU1791192A (fr)
WO (1) WO1992017853A2 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997014028A2 (fr) * 1995-10-11 1997-04-17 Luminex Corporation Procedes et appareil d'analyse multiplexee de specimens cliniques
EP1411763A1 (fr) * 2001-07-06 2004-04-28 Computer Associates Think, Inc. Systeme et procede permettant de localiser rapidement des donnees relatives a un historique de performances
WO2011144531A1 (fr) * 2010-05-16 2011-11-24 International Business Machines Corporation Amélioration visuelle d'un enregistrement de données
US8148171B2 (en) 2001-10-09 2012-04-03 Luminex Corporation Multiplexed analysis of clinical specimens apparatus and methods
CN109344055A (zh) * 2018-09-07 2019-02-15 武汉达梦数据库有限公司 一种测试方法以及测试装置
US10962544B2 (en) 2015-11-25 2021-03-30 Cernostics, Inc. Methods of predicting progression of Barrett's esophagus
US11221333B2 (en) 2011-03-17 2022-01-11 Cernostics, Inc. Systems and compositions for diagnosing Barrett's esophagus and methods of using the same

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4905163A (en) * 1988-10-03 1990-02-27 Minnesota Mining & Manufacturing Company Intelligent optical navigator dynamic information presentation and navigation system
US4969114A (en) * 1988-11-14 1990-11-06 Intergraph Corporation Method for determining an intuitively defined spatial relationship among physical entities
US5041972A (en) * 1988-04-15 1991-08-20 Frost W Alan Method of measuring and evaluating consumer response for the development of consumer products
US5077807A (en) * 1985-10-10 1991-12-31 Palantir Corp. Preprocessing means for use in a pattern classification system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5077807A (en) * 1985-10-10 1991-12-31 Palantir Corp. Preprocessing means for use in a pattern classification system
US5041972A (en) * 1988-04-15 1991-08-20 Frost W Alan Method of measuring and evaluating consumer response for the development of consumer products
US4905163A (en) * 1988-10-03 1990-02-27 Minnesota Mining & Manufacturing Company Intelligent optical navigator dynamic information presentation and navigation system
US4969114A (en) * 1988-11-14 1990-11-06 Intergraph Corporation Method for determining an intuitively defined spatial relationship among physical entities

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CLASSIFICATION AND REGRESSION TREES, 1984, BREIMAN, L. et al., "WADSWORTH & BROOKS/COLE ADVANCE BOOKS AND SOFTWARE. *
COMMUNICATION OF THE ACM, Vol. 29, No. 4, (1986), STANFILL, C. et al., "Toward Memory-Based Reasoning", pp. 1213-1228. *
COMPUTER SYSTEMS THAT LEARN, 1991, WEISS, S.M. et al. *
MACHINE LEARNING, Vol. 1, No. 1, (1986), QUINLAN, J.R., "Induction of Decision Trees", pp. 81-106. *
MACHINE LEARNING, Vol. 3, No. 4, (1989), MINGERS, J., "An Empirical Comparison of Selection Measures for Decision-Tree Induction", pp. 319-342. *
MACHINE LEARNING, Vol. 4, No. 2, (1989), MINGERS, J., "An Empirical Comparison of Pruning Methods for Decision Tree Induction", pp. 227-243. *
NEAREST NEIGHBOR (NN) NORMS: NN PATTERN CLASSIFICATION TECHNIQUES, 1991, DASARATHY, B.V., IEEE Computer Society Press. *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997014028A3 (fr) * 1995-10-11 1997-09-25 Luminex Corp Procedes et appareil d'analyse multiplexee de specimens cliniques
US6524793B1 (en) 1995-10-11 2003-02-25 Luminex Corporation Multiplexed analysis of clinical specimens apparatus and method
US6939720B2 (en) 1995-10-11 2005-09-06 Luminex Corporation Multiplexed analysis of clinical specimens apparatus and method
WO1997014028A2 (fr) * 1995-10-11 1997-04-17 Luminex Corporation Procedes et appareil d'analyse multiplexee de specimens cliniques
US8676539B2 (en) 2001-07-06 2014-03-18 Ca, Inc. System and method for rapidly locating historical performance data
EP1411763A1 (fr) * 2001-07-06 2004-04-28 Computer Associates Think, Inc. Systeme et procede permettant de localiser rapidement des donnees relatives a un historique de performances
EP1411763A4 (fr) * 2001-07-06 2006-11-02 Computer Ass Think Inc Systeme et procede permettant de localiser rapidement des donnees relatives a un historique de performances
US8148171B2 (en) 2001-10-09 2012-04-03 Luminex Corporation Multiplexed analysis of clinical specimens apparatus and methods
WO2011144531A1 (fr) * 2010-05-16 2011-11-24 International Business Machines Corporation Amélioration visuelle d'un enregistrement de données
US11221333B2 (en) 2011-03-17 2022-01-11 Cernostics, Inc. Systems and compositions for diagnosing Barrett's esophagus and methods of using the same
US10962544B2 (en) 2015-11-25 2021-03-30 Cernostics, Inc. Methods of predicting progression of Barrett's esophagus
CN109344055A (zh) * 2018-09-07 2019-02-15 武汉达梦数据库有限公司 一种测试方法以及测试装置
CN109344055B (zh) * 2018-09-07 2020-05-19 武汉达梦数据库有限公司 一种测试方法以及测试装置

Also Published As

Publication number Publication date
AU1791192A (en) 1992-11-02
WO1992017853A3 (fr) 1992-11-26

Similar Documents

Publication Publication Date Title
CN112070125A (zh) 一种基于孤立森林学习的不平衡数据集的预测方法
CN110956273A (zh) 融合多种机器学习模型的征信评分方法及系统
WO2009099448A1 (fr) Procédés et systèmes de cohérence de score
CN109739844B (zh) 基于衰减权重的数据分类方法
CN113011973B (zh) 基于智能合约数据湖的金融交易监管模型的方法及设备
Chaudhuri Modified fuzzy support vector machine for credit approval classification
CN112001788B (zh) 一种基于rf-dbscan算法的信用卡违约欺诈识别方法
CN112700324A (zh) 基于CatBoost与受限玻尔兹曼机结合的用户借贷违约预测方法
Abdou et al. Prediction of financial strength ratings using machine learning and conventional techniques
Marques et al. Using clustering ensemble to identify banking business models
Zhang et al. Analysis and research on library user behavior based on apriori algorithm
WO1992017853A2 (fr) Procede de diagnostic et de prevision base sur une analyse directe d'une base de donnees
Kowalczyk et al. Modelling customer retention with rough data models
Liu The evaluation of classification models for credit scoring
CN116523301A (zh) 基于电商大数据进行风险评级预测的系统
Kirkos et al. Audit‐firm group appointment: an artificial intelligence approach
CN115936841A (zh) 一种构建信贷风险评估模型的方法及装置
CN112506930B (zh) 一种基于机器学习技术的数据洞察系统
Clemente et al. Assessing classification methods for churn prediction by composite indicators
Mahalle et al. Data Acquisition and Preparation
Díaz et al. Some experiences applying fuzzy logic to economics
Huang et al. A clustering-based method for business hall efficiency analysis
Setnes et al. Fuzzy target selection in direct marketing
CN114281994B (zh) 一种基于三层加权模型的文本聚类集成方法及系统
Piesio et al. Applying machine learning to anomaly detection in car insurance sales

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE

AK Designated states

Kind code of ref document: A3

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
COP Corrected version of pamphlet

Free format text: PAGES 14 AND 15,DESCRIPTION,REPLACED BY NEW PAGES 14 AND 15 AND PAGES 1/4-4/4,DRAWINGS,REPLACED BY NEW PAGES 1/6-6/6;AFTER THE RECTIFICATION OF OBVIOUS ERRORS AS AUTHORIZED BY THE UNITED STATES PATENT AND TRADEMARK OFFICE IN ITS CAPACITY AS INTERNATIONAL SEARCHING AUTHORITY

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: CA