WO1999062007A1 - A scalable system for clustering of large databases having mixed data attributes - Google Patents

A scalable system for clustering of large databases having mixed data attributes Download PDF

Info

Publication number
WO1999062007A1
WO1999062007A1 PCT/US1999/006717 US9906717W WO9962007A1 WO 1999062007 A1 WO1999062007 A1 WO 1999062007A1 US 9906717 W US9906717 W US 9906717W WO 9962007 A1 WO9962007 A1 WO 9962007A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cluster
database
records
model
Prior art date
Application number
PCT/US1999/006717
Other languages
English (en)
French (fr)
Inventor
Usama Fayyad
Paul S. Bradley
Cory Reina
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corporation filed Critical Microsoft Corporation
Priority to US09/700,606 priority Critical patent/US6581058B1/en
Priority to EP99914207A priority patent/EP1090362A4/en
Publication of WO1999062007A1 publication Critical patent/WO1999062007A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access

Definitions

  • the present invention concerns database analysis and more particularly concerns an apparatus and method for clustering of data into groups that capture important regularities and characteristics of the data.
  • DBMS database management systems
  • clustering Segmentation
  • clusters groupings
  • Data clustering has been used in statistics, pattern recognition, machine learning, and many other fields of science and engineering.
  • implementations and applications have historically been limited to small data sets with a small number of dimensions or fields.
  • Each data cluster includes records that are more similar to members of the same cluster than they are similar to rest of the data. For example, in a marketing application, a company may want to decide who to target for an ad campaign based on historical data about a set of customers and how they responded to previous campaigns. Employing analysts (statisticians) to build cluster models is expensive, and often not effective for large problems (large data sets with large numbers of fields). Even trained scientists can fail in the quest for reliable clusters when the problem is high-dimensional (i.e. the data has many fields, say more than 20).
  • Clustering is a necessary step in the mining of large databases as it represents a means for finding segments of the data that need to be modeled separately. This is an especially important consideration for large databases where a global model of the entire data typically makes no sense as data represents multiple populations that need to be modeled separately. Random sampling cannot help in deciding what the clusters are. Finally, clustering is an essential step if one needs to perform density estimation over the database (i.e. model the probability distribution governing the data source).
  • Clustering is numerous and include the following broad areas: data mining, data analysis in general, data visualization, sampling, indexing, prediction, and compression. Specific applications in data mining including marketing, fraud detection (in credit cards, banking, and telecommunications), customer retention and churn minimization (in all sorts of services including airlines, telecommunication services, internet services, and web information services in general), direct marketing on the web and live marketing in Electronic Commerce.
  • Clustering has been formulated in various ways. The fundamental clustering problem is that of grouping together (clustering) data items that are similar to each other. The most general approach to clustering is to view it as a density estimation problem.
  • the number of clusters K is known and the problem is to find the best parameterization of each cluster model.
  • a popular technique for estimating the model parameters is the EM algorithm (see P. Cheeseman and J. Stutz, "Bayesian Classification (AutoClass): Theory and Results", in Advances in Knowledge Discovery and Data Mining, Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy( Eds.), pp. 153-180. MIT Press, 1996; and A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM algorithm”. Journal of the Royal statistical Society, Series B, 39(1): 1-38, 1977).
  • K-Means a data item belongs to a single cluster, while in EM each data item is assumed to belong to every cluster but with a different probability. This of course affects the update step (3) of the algorithm.
  • K-Means each cluster is updated based strictly on its membership.
  • EM each cluster is updated by contributions from the entire data set according to the relative probability of membership of each data record in the various clusters.
  • the present invention concerns automated analysis of large databases to extract useful information such as models or predictors from data stored in the database.
  • clustering also known as database segmentation.
  • EM Expectation-Maximization
  • Discrete data refers to instances wherein the values of a particular field in the database are finite and not ordered. For instance, color is a discrete feature having possible values ⁇ green, blue, red, white, black ⁇ and it makes no sense to impose an ordering on these values (i.e. green > blue?).
  • the E-M process models the discrete fields of the database with Multinomial distributions and the continuous fields of the database with a Gaussian distribution.
  • the Multinomial distribution is associated with each attribute and is characterized by a set of probabilities, one probability for each possible value of the corresponding attribute in the database.
  • the Gaussian distribution for continuous data is characterized by a mean and a covariance matrix.
  • the EM process estimates the parameters of these distributions over the database as well as the mixture weights defining the probability model for the database '
  • These statistics provide essential summary statistics of the database and allow for a probabilistic interpretation regarding the membership of a given record in a particular cluster.
  • each cluster Given a desired number of clusters K, each cluster is represented by a Gaussian distribution over the continuous database variables and a Multinomial distribution for each discrete attribute (characterized by a probability of observing each value of this discrete attribute).
  • the parameters associated with each of these distributions are estimated by the EM algorithm.
  • One exemplary embodiment of a scalable clustering algorithm accesses a database of records having attributes or data fields of both enumerated discrete and ordered (continuous) values and brings a portion of the data records into a rapid access memory.
  • Each cluster of database records is represented by a table of probabilities summarizing the enumerated, discrete data fields of the data records in this cluster, and a mean and covariance matrix summarizing the continuous attributes of the records in the cluster.
  • Each entry in the probability table for the discrete attributes represents the probability of observing a specific value of a given discrete attribute in the considered cluster.
  • the mean vector and covariance matrix summarize the distribution of the values of the continuous attributes in the considered cluster.
  • the clusters are updated from the database records brought into the rapid access memory.
  • Sufficient statistics for at least some of the database records in the rapid access memory are summarized.
  • the sufficient statistics are made up of a data structure similar to the clusters, i.e. it includes a Gaussian distribution over the continuous record attributes and Multinomial distributions for the discrete attributes.
  • the sufficient statistics are stored within the rapid access memory and the database records that are used to derive these sufficient statistics are removed from rapid access memory.
  • a criteria is then evaluated to determine if further data should be accessed from the database to further cluster data records in the database. Based on this evaluation, additional database records in the database are accessed and brought into the rapid access memory for further updating of the cluster model.
  • the invention can be used in data mining to: visualize, summarize, navigate, and predict properties of the data/clusters in a database.
  • the parameters allow one to assign data points (database records) to a cluster in a probabilistic fashion (i.e. a point or database record belongs to all K clusters with a computable, and interpretable probability).
  • Probabilistic clustering also plays an important role in operations such as sampling, indexing, and increasing the efficiency of data access in a database.
  • the invention consists of a new methodology and implementation for scaling the EM algorithm to work with large databases consisting of both discrete and continuous attributes, including ones that cannot be loaded into the main memory of the computer. Without this method, clustering would require significantly more memory, or many unacceptably expensive scans of the data in the database.
  • This invention enables effective and accurate clustering in one or less database scans. Furthermore, known previous computational work on clustering with the EM algorithm addressed datasets that are either discrete or continuous. Often, if the database contained both discrete and continuous fields, the continuous fields are discretized prior to applying the clustering technique. The present invention avoids removing this natural order from certain fields of the database and explicitly address the issue of probabilistically clustering a database with both discrete and continuous attributes.
  • Figure 1 is a schematic depiction of a computer system used in practicing an exemplary embodiment of the present invention
  • FIGS. 2 and 3 are schematic depictions of software components for performing data clustering in accordance with an exemplary embodiment of the present invention
  • Figure 4 is a flowchart of the processing steps performed by the computer system in clustering data
  • Figure 5 is a depiction of three clusters over a single, continuous attribute showing their relative positions on a one dimensional scale
  • Figure 6A - 6D are data structures described in the parent application that are used in computing a model summary for data clusters based on data having only continous attributes;
  • Figure 7 A and 7B are flow charts of a preferred clustering procedure for use with data having mixed continuous and discrete data ;
  • Figures 8A - 8D are data structures used in computing a clustering model from a database having both continuous and discrete attributes; and Figures 9A and 9B are probability tables depicting sufficient statistics for discrete attributes of a data base having both discrete and continuous attribute records.
  • FIG. 1 Detailed Description of Exemplary Embodiment of the Invention
  • the exemplary embodiment of the invention is implemented by software executing on a general purpose computer 20 a schematic of which is shown in Figure 1.
  • Figures 2 and 3 depict software components that define a data mining engine 12 constructed in accordance with the present invention.
  • the data mining engine 12 clusters data records stored on a database 10.
  • the data records have multiple attributes or fields that contain both discrete and continuous data.
  • the database 12 is stored on a single fixed disk storage device or alternately can be stored on multiple distributed storage devices accessible to the computer's processing unit 21 over a network.
  • the data mining engine 12 brings data from the database 10 into a memory 22 ( Figure 1) and outputs a clustering model ( Figure 8D).
  • the invention has particular utility in clustering data from a database 12 that contains many more records than can be stored in the computer's main memory 22.
  • Data clustering is particularly important in data mining of large databases as it represents a means for finding segments or subpopulations of the data that are similar.
  • a global model of the entire database makes little sense since the data represents multiple populations that need to be modeled separately.
  • the present invention concerns method and apparatus for determining a model for each cluster that includes a set of attribute/value probabilities for the enumerated discrete data fields and a mean and covariance matrix for the ordered data fields.
  • an application program 14 acts as the client and the data mining engine 12 acts as the server.
  • the application program 14 receives an output model (Figure 8D) and makes use of that model in one of many possible ways mentioned above such as marketing studies and fraud detection etc.
  • Table 1 For example, five database records (Table 1) having three enumerated data fields. These records could be used, for example, by people making marketing evaluations of past trends for use in predicting future behavior.
  • the data records describe purchases of motor vehicles. Only five records are depicted for purposes of explanation but a typical database will have many thousands if not millions of such records.
  • the data records are read from the database 12 into a memory of a computer and the number of records stored in a memory at a given time is therefore dependent on the amount of computer memory allocated for the clustering process.
  • the number 3 can be arbitrarily assigned or can be chosen based upon an initial evaluation of the data.
  • the initial values of the attribute/value probability tables for the discrete attributes in each cluster are initialized by some other process (possibly random initialization).
  • Each record from Table 1 is assigned to each of the three clusters with a different probability or membership. This probability of membership of a data record in one of the clusters is computed based upon the attribute values of the record and the cluster attribute/value probability tables (discrete attributes).
  • Cluster 1 represents 3.0 of 10.0 total data records
  • Cluster 2 represents 4.5 of the 10.0 data records
  • Cluster 3 represents the remaining 2.5 of the 10.0 data records
  • Cluster 1 Number of records: 3.0
  • Cluster 2 Number of records: 4.5
  • Cluster 3 Number of records: 2.5 R B G W color .3 .25 .25 .2
  • Sedan sport truck style .35 .3 .35 Male female sex .45 .55
  • the attribute/value probability tables have a row for each discrete attribute and a column for each value of the corresponding attribute and that the sum of the probabilities for the values of a given attribute of a given cluster sum to 1.0.
  • the cluster attribute/value probabilities for the three clusters are updated based upon the cluster membership probabilities for each of the recently gathered data records as well as records evaluated during previous steps of gathering data from the database 12.
  • the process of updating these probabilities takes into account the number of data records that have been previously gathered from the database as well as the number of new records that were most recently extracted from the database 12. As a simplified example, assume a total often records having been classified and have been used to determine the Cluster Models shown above.
  • Cluster No. 1 represents 3.0 of the 10 data points already processed and we wish to update the attribute/value probabilities for the discrete attribute Color in cluster No. 1 based on the addition of RecordID #2, one has:
  • probability membership of a given data record in each cluster we also contemplate not taking into account the fraction of the database represented by a given cluster.
  • probability of membership of RecordID#2 in each of the 3 clusters would be proportional to the following values which do not account for the fraction of data records in each cluster:
  • the records read from the database 12 have discrete data like color and ordered (continuous) attributes such as a salary field and an age field.
  • These additional fields are continuous and it makes sense to take the mean and covariance, etc. of the values for these additional fields.
  • a Gaussian having a mean and covariance matrix
  • each record has the additional attributes of 'income' and 'age'.
  • These mixed attribute records are listed below in Table 3. Note, the female that purchased the blue sedan (Recordld #2) is now further classified with the information that she has an income of 46K and an age of 47 years.
  • the data mining engine 12 For each of the records of Table 3 the data mining engine 12 must compute the probability of membership of each data record in each of the three clusters.
  • the discrete attributes are labeled "DiscAtt#l”, “DiscAtt#2”,...,”DiscAtt#d” and let the remaining continuous attributes make up a numerical vector x.
  • the notation for determining this probability is: Prob(record
  • cluster #) p(DiscAtt#l
  • cluster #) is computed by looking up the stored probability of DiscAttr#j in the given cluster (i.e. reading the current probability from the attribute/value probability table associated with this cluster).
  • ⁇ , ⁇ of cluster #) is calculated by computing the value of x under a normal distribution with mean ⁇ and covariance matrix ⁇ :
  • EM expectation maximization
  • a Gaussian distribution of data about the centroid of each of the K clusters for the ordered dimensions For each of the data records (having mixed discrete and ordered attributes) a weighting factor is similarly determined indicating the degree of membership of this data record in a given cluster. In our example with 3 clusters, the weightings are determined by:
  • Weight in cluster 1 P(record
  • Weight in cluster 2 P(record
  • Weight in cluster 3 P(record
  • cluster #) is given as above.
  • Figure 5 depicts three Gaussian distribution curves. One dimension is plotted for simplicity, but note that the height of a given Gaussian curve is the p([income,age] I cluster #).
  • the Gaussian data distributions Gl, G2, G3 summarize data clusters having centroids or means 3c 1 , x 2 , x 3 and represent the distributions over the continuous attributes in the 3 clusters in our example.
  • the compactness of the data is generally indicated by the shape of the Gaussian and quantified by the corresponding covariance value and the average value of the cluster is given by the mean.
  • the value is computed as (fraction of data points represented by cluster l)*(P(RecordID#2
  • cluster l))*(P(DiscAtt#2 Sedan
  • cluster l)*P(DiscAtt#3 female
  • ⁇ , ⁇ of cluster 1) (3.0/10.0)*(0.2)*(0.5)*(0.7)*hl.
  • the probability that this record is in cluster 2 (fraction of data points represented by cluster 2)*P(record ID#2
  • cluster 2) (4.5/10.0)*(0.1)*(0.1)*(0.35)*h2.
  • Weight3 [(2.5/10.0)*P(record K>#3
  • the weights Weightl, Weight2 and Weight3 indicate the "degree of membership" of record ED#2 has in each of the 3 clusters. Knowing these weights the probability tables are updated as described above and the values of ⁇ and ⁇ are updated in the cluster model.
  • FIG 4 is a flow chart of the process steps performed during a scalable clustering analysis of data in accordance with the present invention.
  • the data structures shown in Figure 6 A - 6D are initialized.
  • the Figure 6 data structures are augmented with probability tables P (one for each cluster) as seen in Figures 8A - 8D.
  • Clustering is initiated by obtaining 110 a sample data portion from the database 10 and bringing that data portion into a random access memory (into RAM for example, although other forms of random access memory are contemplated) of the computer 20 shown in Figure 1.
  • the gathering ofdata can be performed using either a sequential scan that uses only a forward pointer to sequentially traverse the data or an indexed scan that provides a random sampling ofdata from the database.
  • index scan it is a requirement that data not be accessed multiple times. This can be accomplished by marking data tuples to avoid duplicates, or a random index generator that does not repeat.
  • a first iteration of sampling data be done randomly. If it is known the data is random within the database then sequential scanning is acceptable. If it is not known that the data is randomly distributed, then random sampling is needed to avoid an inaccurate representative of the database.
  • a processor unit 21 of the computer 20 next performs a clustering procedure 120 using the data brought into memory in the step 110 as well as compressed data in two data structures CS, DS ( Figures 8A, 8B).
  • the processor unit 21 assigns data contained within the portion of data brought into memory to a cluster for purposes of recalculating the cluster probabilities for the discrete data attributes and the Gaussian mean and covariance matrix for the continuous data attributes.
  • a data structure for the results or output model of the analysis for the ordered attributes is depicted in Figure 8D.
  • This model includes K data structures for each cluster.
  • the parameters represented in the data structures enable the data mining engine to assign a probability of cluster membership for every data record read from memory.
  • the scalable clustering process needs this probability to determine data record membership in the DS, CS, and RS data sets (discussed below), as part of a data compression step 130.
  • a data compression step 130 in the Figure 4 flowchart summarizes at least some of the data gathered in the present iteration. This summarization is contained in the data structures DS, CS of Figures 8A and 8B.
  • the processor 21 determines 140 whether a stopping criteria has been reached.
  • One stopping criteria that is used is whether the analysis has produced a sufficient model ( Figure 8D) by a standard that is described below.
  • a second stopping criterion has been reached if all the data in the database 10 has been used in the analysis.
  • One feature of the invention is the fact that instead of stopping the analysis, the analysis can be suspended.
  • Data in the data structures of Figure 8A - 8D can be saved (either in memory or to disk) and the scalable clustering analysis can then be resumed later.
  • This allows the database 10 to be updated and the analysis resumed to update the clustering statistics without starting from the beginning. It also allows another process to take control of the processor 21 without losing the state of the clustering analysis.
  • the suspension could also be initiated in response to a user request that the analysis be suspended by means of a user actuated control on an interface presented to the user on a monitor 47 while the Clustering analysis is being performed.
  • the present data clustering process is particularly useful for clustering large databases.
  • the process frees up memory so that more data from the database can be accessed. This is accomplished by compressing data and storing sufficient statistics for compressed data in the memory thereby freeing up memory for the gathering of more data from the database.
  • a confidence interval on the Gaussian mean is defined for each of the continuous attributes and a confidence interval is defined for each value in the attribute/value probability table for the discrete attributes.
  • Appendix A describes one process for setting up a confidence interval on the multidimensional Gaussian means associated with the continuous attributes of the K clusters.
  • the model includes probabilities for each attribute value (See figure 9A) in the range of between 0.0 and 1.0.
  • the data mining engine 12 sets up a confidence interval that brackets these probabilities.
  • Confidence intervals are also set up for the continuous attributes for each of the clusters. Assume that for cluster #1 the mean income attribute is $40,000 and the confidence interval is $1500 above and below this value. The age attribute confidence interval for cluster #1 is 45 yrs +/- 2. Now consider the second data record. As calculated above, this data record was assigned to cluster #1 with highest probability of membership.
  • the perturbation technique determines whether to compress a record into the DS data structure ( Figure 8 A) by adjusting the probabilities of the cluster to which the record is assigned so that the probability of membership in this "adjusted" cluster is decreased (lowers the attribute/value probabilities within the confidence interval for the discrete attributes and shifts the cluster mean away from the data record for the continuous attributes) and adjusts the probabilities and means of the clusters to which the data record is not assigned so that the probability of membership in these "adjusted” clusters is increased by raising the attribute value probabilities and shifting the mean toward the data record for the continuous attributes.
  • This process maximizes the possibility that the RecordID #2 will be assigned to a different cluster with highest probability of membership. With these temporary adjustments, the calculations for the data record membership are again performed.
  • the processing step 130 visits each record, attempts to compress that record and if the record can be compressed the vectors of SUM, SUMSQ, and M and the attribute/value probability tables P are all updated.
  • the tables P associated with the DS and CS data structures now contain sufficient statistics of discrete attributes for compressed records that are removed from memory.
  • a second data compression process is called thresholding.
  • An additional alternate threshold process would be to take all the data points assigned to a cluster and compress into DS all the data points where the product of the probabilities is greater than a threshold value.
  • Subclustering The subclustering is done after all possible data records have been compressed into the DS data structure.
  • the remaining candidates for summarization into the CS data structures ( Figure 8B) are first filtered to see if they are sufficiently "close" to an existing CS subcluster or, equivalently their probability of membership in an existing CS subcluster is sufficiently high. If not, a clustering is performed using random starting conditions. Subclusters lacking a requisite number ofdata points are put back in RS and the remaining subclusters are merged.
  • Harsh assignments in classic EM can be accomplished by assigning a data record with weight 1.0 to the subcluster with highest probability of membership and not assigning it to any other subcluster. This procedure will determine k' candidate subclusters.
  • Set up a new data structure CS_New to contain the set of sufficient statistics, including attribute/value probability tables for the discrete attributes of records associated with the k' candidate subclusters determined in this manner. For each set of sufficient statistics in CS_New, if the number ofdata points represented by these sufficient statistics is below a given threshold, remove the set of sufficient statistics from CSJMew and leave the data points generating these sufficient statistics in RS.
  • ⁇ /2 ⁇ in the range [0,1]
  • Set CS_Temp CS_New CS. Augment the set of previously computed sufficient statistics CS with the new ones surviving the filtering in steps 6 and 7. For each set of sufficient statistics s (corresponding to a sub-cluster) in CS_Temp Determine the s' , the set of sufficient statistics in CSJTemp with highest probability of membership in the subcluster represented by s.
  • merge(-? , s' ) If the subcluster formed by merging s and s' , denoted by merge(-? , s' ) is such that the maximum standard deviation along any continuous dimension is less than ⁇ or the maximum standard deviation of an entry in the attribute/value probability table is greater than ⁇ /2 ( ⁇ in the range [0, 1]), then add merge( s , s' ) to CSJTemp and remove s and s' from CSJTemp.
  • An output or result of the clustering analysis is a data structure designated MODEL which includes an array 152 of pointers to a first vector 154 of n elements (floats) designated 'SUM', a second vector 156 of n elements (floats) designated 'SUMSQ', and a single floating point number 158 designated 'M' and an attribute/value probability table P (entries are floats) such as the table of Figure 9A.
  • the number M represents the number of database records represented by a given cluster.
  • the model includes K entries, one for each cluster.
  • the vector 'SUM' represents the sum of the weighted contribution of each of the n continuous database record attributes that have been read in from the database.
  • a typical record will have a value of the ith dimension which contributes to each of the K clusters. Therefore the i-th dimension of that record contributes a weighted component to each of the k SUM vectors.
  • a second vector 'SUMSQ' is the sum of the squared components of each record which allows straightforward computation of the diagonal elements of the covariance matrix.
  • the SUMSQ could be a full n x n matrix, allowing the computation of a full n x n covariance matrix. It is assumed for the disclosed exemplary embodiment that the off diagonal elements are zero.
  • a third component of the model is a floating point number 'M'.
  • the number 'M' is determined by totaling the probability of membership for a given cluster over all data points.
  • An additional data structure designated DS in Figure 8 A includes an array of pointers 160 that point to a group of K vectors (the cluster number) of the n continuous attribute elements 162 designated 'SUM', a second group of K vectors 164 designated 'SUMSQ', a group 166 of k floats designated M, and an attribute/value probability table P such as the table shown in Figure 9A.
  • This data structure is similar to the data structure of Figure 8D that describes the MODEL. It contains sufficient statistics of the continuous and discrete attribute values for a number ofdata records that have been compressed into the Figure 8 A data structure shown rather than maintained as individual records (Fig 8C) in memory.
  • a further data structure designated CS in Figure 6B is an array of c pointers where each pointer points to an element which consists of a vector of n elements
  • an attribute/value probability table summarizes the discrete attributes of the points compressed into CS elements.
  • the data structure CS also summarizes multiple data points into structures similar to the MODEL data structure and represents a subcluster ofdata records.
  • D d + n dimensions that includes both continuous attributes (n continuous attributes) and discrete attributes (d discrete attributes).
  • the extended clustering procedure 120 ( Figures 7A and 7B) takes the contents of the three data structures RS, DS, CS, stored in the data structures of figures 8A, 8B, 8C and produces a new model.
  • the new model (including the updated cluster attribute/value probability tables P) is then stored in place of the old model ( Figure 8D).
  • the data structures of Figure 8A-8D are initialized 100 ( Figure 4) before any data is read from the database 10.
  • the MODEL data structure of Figure 8D that is copied into the Old_Model data structure is therefore not null.
  • Arbitrary values are chosen as diagonal members (SUMSQ) of the starting model's K covariance matrices.
  • the diagonal values of the starting matrices are chosen to range in size from .8 to 1.2 for data in. which the continuous attributes have been normalized into a range [-5,5].
  • An initial attribute/value probability table for each cluster is also arbitrarily assigned on the first iteration. One approach may be to set all attribute values equally likely.
  • the clustering procedure 120 starts by copying the existing model to create 202 an Old_Model in a data structure like that of Figure 8D.
  • the process next determines 204 the length of the pointer arrays of Figures 8C - 8C , and computes the total number ofdata records summarized by the Old Model, and computes 206 means and covariance matrices from the Old Model SUM, SUMSQ and M data from the continuous data.
  • the set of Old Model means and covariance matrices that are derived from this calculation are stored as a list of length K where each element of the list includes two parts:
  • n a vector of length n (called the "mean") which stores the mean of the corresponding Gaussian or cluster 2) a matrix of size n x n (called the “CVMatrix”) which stores the values of a covariance matrix of the corresponding Gaussian or cluster.
  • the means and covariance matrices are referred to below as "Old SuffStats".
  • the clustering procedure computes an outer product defined for 2 vectors OUTERPROD(vectorl,vector2).
  • OUTERPROD operation takes 2 vectors of length n and returns their outer product, or the n x n matrix with an entry in row h and column j being vector l(h)*vector2(j).
  • DETERMINANT function computes the determinant of a matrix.
  • the step 206 also uses a function, INVERSE that computes the inverse of a matrix.
  • a function INVERSE that computes the inverse of a matrix.
  • TRANSPOSE returns the transpose of a vector (i.e. changes a column vector to a row vector).
  • a function EXP(z) computes the exponential e z .
  • a function 'ConvertSuffStats' calculates 206 the mean and covariance matrix from the sufficient statistics stored in a cluster model (SUM,SUMSQ,M) for the continuous attributes.
  • each point of the RS data structure ofdata records is accessed 210 and used to update the values of Sum, Sumsq, M (summarizing continuous data attributes) and the attribute/value probability tables derived from the discrete attributes that make up the data structures of the New_Model.
  • a contribution to each of the K clusters is determined for each of the data points in RS by determining the weight (equivalent to the probability of membership in each cluster) of each point under the old model.
  • a weight vector has K elements weight(l), weight(2), ... weight(K) where each element indicates the normalized or fractional assignment of the data point to the corresponding cluster.
  • each data record contributes to all of the K clusters that were set up during initialization.
  • the attribute/value probability tables provide the contribution to the weight (probability of membership) calculation over the discrete attributes. This height contribution and the contribution from the attribute/value probability table is then scaled to form a weight contribution that takes into account the fraction ofdata points assigned to cluster 1: Mi / MTotal where MTotal is the total number ofdata records read thus far from the database.
  • Normalizing the weight factor is performed at a step 214 of the procedure.
  • an outer product is calculated for the relevant vector data point in RS.
  • An update step 218 loops through all K clusters to update the new model data structure by adding the contribution for each data point:
  • New_Model(j).SUM New_Model(j).SUM + weight(j)*center;
  • New_ModelO).M New_Model(j).M + Weight(j);
  • a branch 220 is taken to begin the process of updating the New_Model data structure for all the subclusters in CS.
  • a contribution is determined for each subcluster CS (denoted as CSJElem) by determining 230 the weight of each subcluster under the old model (determining probability of membership of the CS sub-cluster in a given model cluster).
  • CSJElem the weight of each subcluster under the old model
  • First a center vector for the subcluster for the continuous attributes is determined 230 from the relation center (l/CS_Elem.M)*CS_Elem.SUM.
  • a weight vector has K elements weight(l), weight(2), ... weight(K) where each element indicates the normalized or fractional assignment of a given subcluster to a cluster. This weight is determined 232 for each cluster and the weight factor is normalized at a step 234 of the procedure. An update step 238 for the subcluster of CS loops through all K clusters:
  • New_Model(j) M New_Model(j).M + weight(j) *
  • Attribute ValueJTable is combined with the updated probability table for the new model, New ModelQ). Attribute ValueJTable based on the number of records summarized by the subcluster and the values in the subcluster attribute/value probability table..
  • a branch 240 is taken to update the New Model using the contents of the data structure DS.
  • a weight of this DS structure is then determined under the Old Model in exactly the same fashion as the weights were determined for the CS structures above and the weight is normalized 254.
  • New_Model(j).SUM New_Model(j).SUM + Weight(j)*DS_Elem. SUM; New_ModelG).
  • SUMSQ New_ModelG).
  • SUMSQ; New_ModelG) Attribute_Value_Table(row i, column h)
  • New_ModelG New_ModelG
  • weight ⁇ * DS Elem.M
  • CVDist CV_dist + distance(Old_SuffStatsG).CVMatrix
  • Ptable_dist Ptable_dist + distance(OldJModelG) . Attribute JValue Table, New_ModelG).Attribute_Value_Table).
  • the distance between attribute/value probability tables may be computed by summing the absolute values of the differences of the table entries and dividing by the total number of entries in the attribute/value table.
  • the stopping criteria determines whether the sum of these two numbers and the absolute difference in probability tables summarizing the discrete attributes is less than a stopping criteria value: [(l/(3*k))*(mean_dist+CV_dist + Ptable dist)] ⁇ stop ol
  • the New Model becomes the Model and the procedure returns 268 . Otherwise the New Model becomes the old model and the procedure branches 270 back to recalculate another New model from the then existing sufficient statistics in RS, DS, and CS.
  • the scalable Expectation Maximization analysis is stopped (rather than suspended) and a resultant model output produced when the test 140 of Figure 4 indicates the Model is good enough.
  • Two alternate stopping criteria (other than a scan of the entire database) are used.
  • a first stopping criteria defines a probability function p(x) to be the quantity
  • K p(x) ⁇ ⁇ (g(x I /)) where x is a data point or vector sampled from the database
  • M(l) is the scalar weight for the 1th cluster
  • N is the total number ofdata points or vectors sampled thus far
  • l) is the probability function for the data point for the 1th cluster.
  • l) is the product of the height of the Gaussian distribution for cluster 1 evaluated over the continuous attribute values times the product of the values of the attribute/value table associated with cluster 1 taking the values of the attributes appearing in x.
  • M f(iter) ⁇ log/?(x;)
  • the probability function of a subcluster is determined by calculating the weighting factor in a manner similar to the calculation at step 232.
  • the weighting factor for the k elements of DS are calculated in a manner similar to the step 252 in Figure 8B.
  • a difference parameter d z f z - f z . ⁇ .
  • a second stopping criteria is the same as the stopping criteria outlined earlier.
  • K cluster means and covariance matrices are determined and the attribute/value probability tables for the cluster are updated.
  • the variables CV dist, mean dist and Ptable dist are initialized.
  • For each cluster k the newly determined covariance matrix, mean, and attribute/value probability table are compared with a previous iteration for these parameters.
  • a distance between the old mean and the new mean as well as a distance between the new and old covariance matrices and distance between old and new attribute/value probability tables are determined.
  • CVDist CV_dist + distance(Old_SuffStatsG).CVMatrix
  • Ptable_dist Ptable_dist + distance(Old_ModelG).Attribute_Value_Table, New_ModelG) Attribute_Value_Table).
  • the stopping criteria determines whether the sum of these numbers is less than a stopping criteria value:
  • an exemplary data processing system for practicing the disclosed data mining engine invention includes a general purpose computing device in the form of a conventional computer 20, including one or more processing units 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21.
  • the system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • the system memory includes read only memory (ROM) 24 and random access memory (RAM) 25.
  • ROM read only memory
  • RAM random access memory
  • a basic input/output system "26 (BIOS) containing the basic routines that helps to transfer information between elements within the computer 20, such as during start-up, is stored in ROM 24.
  • the computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD ROM or other optical media.
  • the hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively.
  • the drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computer 20.
  • exemplary environment described herein employs a hard disk, a removable magnetic disk 29 and a removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memories (ROM), and the like, may also be used in the exemplary operating environment.
  • RAMs random access memories
  • ROM read only memories
  • a number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38.
  • a user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48.
  • personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
  • the computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49.
  • the remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated in Figure 1.
  • the logical connections depicted in Figure 1 include a local area network (LAN) 51 and a wide area network (WAN) 52.
  • LAN local area network
  • WAN wide area network
  • the computer 20 When used in a LAN networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet.
  • the modem 54 which may be internal or external, is connected to the system bus 23 via the serial port interface 46.
  • program modules depicted relative to the computer 20, or portions thereof may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • E ⁇ be the event that the j -th element of the /-th current mean lies in the between the values L l ⁇ (lower bound) and (upper bound), or specifically, li ⁇ ⁇ x j ' ⁇ U ⁇ ' .
  • x' is the j -th element of the / -th current mean.
  • the values of L' j and define the 100(1 - a /)% confidence interval on x ⁇ ' which is computed as:
  • N is the number of singleton data points represented by cluster /, including those that have already been compressed in earlier iterations and uncompressed data points.
  • S ⁇ ' is an estimate of the variance of the / -th cluster along dimension j .
  • L' ,U' e R" be the vectors of lower and upper bounds on the mean of cluster / .
  • the invention assigns data points (the continuous attributes ofdata records) to Gaussians in a probabilistic fashion. Two different techniques are proposed for determining the integer N , the number of singleton data points over which the
  • Gaussian mean is computed.
  • the first way is motivated by the ⁇ M Gaussian center update formula which is computed over all of the data processed so far (whether it has been compressed or not), hence in the first variant of the Bonferoni CI computation we take N to be the number ofdata elements processed by the Scalable ⁇ M algorithm so far.
  • the second variant is motivated by the fact that although the ⁇ M Gaussian center update is over all data points, each data point is assigned probabilistically to a given Gaussian in the mixture model, hence in the second variant of the Bonferoni computations we take N to be the rounded integer of the sum of the probabilistic assignments over all data points processed so far.
  • the Bonferoni CI formulation assumes that the Gaussian centers, computed
  • P(;) ( ,
  • Pi ⁇ - ⁇ l L_e ⁇ pU( ⁇ ' -x ⁇ s'Y ⁇ x- - * ')!.
  • the perturbation becomes a more general optimization problem and the procedure used in the K-mean case is a special case of the solution of this problem when 0/1 assignments are made between points and clusters.
PCT/US1999/006717 1998-05-22 1999-03-29 A scalable system for clustering of large databases having mixed data attributes WO1999062007A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US09/700,606 US6581058B1 (en) 1998-05-22 1999-03-29 Scalable system for clustering of large databases having mixed data attributes
EP99914207A EP1090362A4 (en) 1998-05-22 1999-03-29 VARIABLE SCALE SYSTEM FOR GROUPING LARGE MIXED DATA ATTRIBUTES DATABASES

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US60/086,410 1998-05-22
US09/083,906 US6263337B1 (en) 1998-03-17 1998-05-22 Scalable system for expectation maximization clustering of large databases
US09/083,906 1998-05-22

Publications (1)

Publication Number Publication Date
WO1999062007A1 true WO1999062007A1 (en) 1999-12-02

Family

ID=22181416

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/006717 WO1999062007A1 (en) 1998-05-22 1999-03-29 A scalable system for clustering of large databases having mixed data attributes

Country Status (2)

Country Link
US (1) US6263337B1 (US06263337-20010717-M00006.png)
WO (1) WO1999062007A1 (US06263337-20010717-M00006.png)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1172740A2 (en) * 2000-06-12 2002-01-16 Ncr International Inc. SQL-based analytic algorithm for cluster analysis
WO2002101581A2 (de) * 2001-06-08 2002-12-19 Siemens Aktiengesellschaft Statistische modelle zur performanzsteigerung von datenbankoperationen
US7050932B2 (en) 2002-08-23 2006-05-23 International Business Machines Corporation Method, system, and computer program product for outlier detection
CN100371900C (zh) * 2006-01-19 2008-02-27 华为技术有限公司 数据同步的方法和系统
CN103077253A (zh) * 2013-01-25 2013-05-01 西安电子科技大学 Hadoop框架下高维海量数据GMM聚类方法
CN104156463A (zh) * 2014-08-21 2014-11-19 南京信息工程大学 一种基于MapReduce的大数据聚类集成方法
CN106991436A (zh) * 2017-03-09 2017-07-28 东软集团股份有限公司 噪声点检测方法及装置
WO2018017439A1 (en) * 2016-07-22 2018-01-25 Microsoft Technology Licensing, Llc Clustering applications data for query processing

Families Citing this family (85)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1115897A (ja) * 1997-06-20 1999-01-22 Fujitsu Ltd 対話型データ分析支援装置及び対話型データ分析支援プログラムを記録した媒体
US6449612B1 (en) * 1998-03-17 2002-09-10 Microsoft Corporation Varying cluster number in a scalable clustering system for use with large databases
US6397166B1 (en) * 1998-11-06 2002-05-28 International Business Machines Corporation Method and system for model-based clustering and signal-bearing medium for storing program of same
US7035855B1 (en) * 2000-07-06 2006-04-25 Experian Marketing Solutions, Inc. Process and system for integrating information from disparate databases for purposes of predicting consumer behavior
US6513065B1 (en) * 1999-03-04 2003-01-28 Bmc Software, Inc. Enterprise management system and method which includes summarization having a plurality of levels of varying granularity
US6549907B1 (en) * 1999-04-22 2003-04-15 Microsoft Corporation Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions
WO2001004801A1 (en) * 1999-07-09 2001-01-18 Wild File, Inc. Optimized disk storage defragmentation with swapping capabilities
US6389418B1 (en) * 1999-10-01 2002-05-14 Sandia Corporation Patent data mining method and apparatus
US6931403B1 (en) * 2000-01-19 2005-08-16 International Business Machines Corporation System and architecture for privacy-preserving data mining
US6847924B1 (en) * 2000-06-19 2005-01-25 Ncr Corporation Method and system for aggregating data distribution models
US20050021499A1 (en) * 2000-03-31 2005-01-27 Microsoft Corporation Cluster-and descriptor-based recommendations
US6920458B1 (en) 2000-09-22 2005-07-19 Sas Institute Inc. Model repository
US6665661B1 (en) * 2000-09-29 2003-12-16 Battelle Memorial Institute System and method for use in text analysis of documents and records
US6922660B2 (en) 2000-12-01 2005-07-26 Microsoft Corporation Determining near-optimal block size for incremental-type expectation maximization (EM) algorithms
US6947878B2 (en) * 2000-12-18 2005-09-20 Ncr Corporation Analysis of retail transactions using gaussian mixture models in a data mining system
US20020078064A1 (en) * 2000-12-18 2002-06-20 Ncr Corporation Data model for analysis of retail transactions using gaussian mixture models in a data mining system
US20020099702A1 (en) * 2001-01-19 2002-07-25 Oddo Anthony Scott Method and apparatus for data clustering
US7039638B2 (en) * 2001-04-27 2006-05-02 Hewlett-Packard Development Company, L.P. Distributed data clustering system and method
US20030028504A1 (en) * 2001-05-08 2003-02-06 Burgoon David A. Method and system for isolating features of defined clusters
US7469246B1 (en) 2001-05-18 2008-12-23 Stratify, Inc. Method and system for classifying or clustering one item into multiple categories
US7945600B1 (en) 2001-05-18 2011-05-17 Stratify, Inc. Techniques for organizing data to support efficient review and analysis
US7308451B1 (en) 2001-09-04 2007-12-11 Stratify, Inc. Method and system for guided cluster based processing on prototypes
US7246125B2 (en) * 2001-06-21 2007-07-17 Microsoft Corporation Clustering of databases having mixed data attributes
US7478103B2 (en) * 2001-08-24 2009-01-13 Rightnow Technologies, Inc. Method for clustering automation and classification techniques
US7039622B2 (en) * 2001-09-12 2006-05-02 Sas Institute Inc. Computer-implemented knowledge repository interface system and method
KR100451940B1 (ko) * 2001-09-26 2004-10-08 (주)프리즘엠아이텍 고객 관리 기능을 갖는 데이터 분석 시스템 및 그 방법
KR20030060521A (ko) * 2002-01-09 2003-07-16 콤텔시스템(주) 고객데이터 분석 시스템
US7031969B2 (en) * 2002-02-20 2006-04-18 Lawrence Technologies, Llc System and method for identifying relationships between database records
US6963870B2 (en) * 2002-05-14 2005-11-08 Microsoft Corporation System and method for processing a large data set using a prediction model having a feature selection capability
US7158983B2 (en) 2002-09-23 2007-01-02 Battelle Memorial Institute Text analysis technique
US7133811B2 (en) * 2002-10-15 2006-11-07 Microsoft Corporation Staged mixture modeling
US8321420B1 (en) * 2003-12-10 2012-11-27 Teradata Us, Inc. Partition elimination on indexed row IDs
US20060047655A1 (en) * 2004-08-24 2006-03-02 William Peter Fast unsupervised clustering algorithm
US7415487B2 (en) * 2004-12-17 2008-08-19 Amazon Technologies, Inc. Apparatus and method for data warehousing
EP1831804A1 (de) * 2004-12-24 2007-09-12 Panoratio Database Images GmbH Relationale komprimierte datenbank-abbilder (zur beschleunigten abfrage von datenbanken)
US8175889B1 (en) 2005-04-06 2012-05-08 Experian Information Solutions, Inc. Systems and methods for tracking changes of address based on service disconnect/connect data
US7908242B1 (en) 2005-04-11 2011-03-15 Experian Information Solutions, Inc. Systems and methods for optimizing database queries
US8185547B1 (en) * 2005-07-29 2012-05-22 Teradata Us, Inc. Data analysis based on manipulation of large matrices on a persistent storage medium
US7975044B1 (en) * 2005-12-27 2011-07-05 At&T Intellectual Property I, L.P. Automated disambiguation of fixed-serverport-based applications from ephemeral applications
US8032675B2 (en) * 2005-12-28 2011-10-04 Intel Corporation Dynamic memory buffer allocation method and system
WO2008022289A2 (en) 2006-08-17 2008-02-21 Experian Information Services, Inc. System and method for providing a score for a used vehicle
US8788701B1 (en) * 2006-08-25 2014-07-22 Fair Isaac Corporation Systems and methods for real-time determination of the semantics of a data stream
US7912865B2 (en) 2006-09-26 2011-03-22 Experian Marketing Solutions, Inc. System and method for linking multiple entities in a business database
US8036979B1 (en) 2006-10-05 2011-10-11 Experian Information Solutions, Inc. System and method for generating a finance attribute from tradeline data
US8606666B1 (en) 2007-01-31 2013-12-10 Experian Information Solutions, Inc. System and method for providing an aggregation tool
US8606626B1 (en) * 2007-01-31 2013-12-10 Experian Information Solutions, Inc. Systems and methods for providing a direct marketing campaign planning environment
US8285656B1 (en) 2007-03-30 2012-10-09 Consumerinfo.Com, Inc. Systems and methods for data verification
US7742982B2 (en) 2007-04-12 2010-06-22 Experian Marketing Solutions, Inc. Systems and methods for determining thin-file records and determining thin-file risk levels
US20080294540A1 (en) 2007-05-25 2008-11-27 Celka Christopher J System and method for automated detection of never-pay data sets
US8301574B2 (en) 2007-09-17 2012-10-30 Experian Marketing Solutions, Inc. Multimedia engagement study
US9690820B1 (en) 2007-09-27 2017-06-27 Experian Information Solutions, Inc. Database system for triggering event notifications based on updates to database records
US8312033B1 (en) 2008-06-26 2012-11-13 Experian Marketing Solutions, Inc. Systems and methods for providing an integrated identifier
US7991689B1 (en) 2008-07-23 2011-08-02 Experian Information Solutions, Inc. Systems and methods for detecting bust out fraud using credit data
US20100332292A1 (en) 2009-06-30 2010-12-30 Experian Information Solutions, Inc. System and method for evaluating vehicle purchase loyalty
US8364518B1 (en) 2009-07-08 2013-01-29 Experian Ltd. Systems and methods for forecasting household economics
US8788499B2 (en) * 2009-08-27 2014-07-22 Yahoo! Inc. System and method for finding top N pairs in a map-reduce setup
US8402027B1 (en) * 2010-02-11 2013-03-19 Disney Enterprises, Inc. System and method for hybrid hierarchical segmentation
US9652802B1 (en) 2010-03-24 2017-05-16 Consumerinfo.Com, Inc. Indirect monitoring and reporting of a user's credit data
US8725613B1 (en) 2010-04-27 2014-05-13 Experian Information Solutions, Inc. Systems and methods for early account score and notification
US9152727B1 (en) 2010-08-23 2015-10-06 Experian Marketing Solutions, Inc. Systems and methods for processing consumer information for targeted marketing applications
US8639616B1 (en) 2010-10-01 2014-01-28 Experian Information Solutions, Inc. Business to contact linkage system
US9147042B1 (en) 2010-11-22 2015-09-29 Experian Information Solutions, Inc. Systems and methods for data verification
US8484212B2 (en) * 2011-01-21 2013-07-09 Cisco Technology, Inc. Providing reconstructed data based on stored aggregate data in response to queries for unavailable data
US20120209880A1 (en) * 2011-02-15 2012-08-16 General Electric Company Method of constructing a mixture model
US9026591B2 (en) 2011-02-28 2015-05-05 Avaya Inc. System and method for advanced communication thread analysis
US8990047B2 (en) 2011-03-21 2015-03-24 Becton, Dickinson And Company Neighborhood thresholding in mixed model density gating
US9483484B1 (en) * 2011-05-05 2016-11-01 Veritas Technologies Llc Techniques for deduplicated data access statistics management
US9483606B1 (en) 2011-07-08 2016-11-01 Consumerinfo.Com, Inc. Lifescore
AU2012281182B2 (en) 2011-07-12 2015-07-09 Experian Information Solutions, Inc. Systems and methods for a large-scale credit data processing architecture
US9853959B1 (en) 2012-05-07 2017-12-26 Consumerinfo.Com, Inc. Storage and maintenance of personal data
US9697263B1 (en) 2013-03-04 2017-07-04 Experian Information Solutions, Inc. Consumer data request fulfillment system
US10102536B1 (en) 2013-11-15 2018-10-16 Experian Information Solutions, Inc. Micro-geographic aggregation system
US9529851B1 (en) 2013-12-02 2016-12-27 Experian Information Solutions, Inc. Server architecture for electronic data quality processing
US10262362B1 (en) 2014-02-14 2019-04-16 Experian Information Solutions, Inc. Automatic generation of code for attributes
US9576030B1 (en) 2014-05-07 2017-02-21 Consumerinfo.Com, Inc. Keeping up with the joneses
US10445152B1 (en) 2014-12-19 2019-10-15 Experian Information Solutions, Inc. Systems and methods for dynamic report generation based on automatic modeling of complex data structures
US10789547B2 (en) * 2016-03-14 2020-09-29 Business Objects Software Ltd. Predictive modeling optimization
WO2018039377A1 (en) 2016-08-24 2018-03-01 Experian Information Solutions, Inc. Disambiguation and authentication of device users
US20180096018A1 (en) * 2016-09-30 2018-04-05 Microsoft Technology Licensing, Llc Reducing processing for comparing large metadata sets
WO2018144612A1 (en) 2017-01-31 2018-08-09 Experian Information Solutions, Inc. Massive scale heterogeneous data ingestion and user resolution
US10963434B1 (en) 2018-09-07 2021-03-30 Experian Information Solutions, Inc. Data architecture for supporting multiple search models
US11941065B1 (en) 2019-09-13 2024-03-26 Experian Information Solutions, Inc. Single identifier platform for storing entity data
US11775408B2 (en) * 2020-08-03 2023-10-03 Adp, Inc. Sparse intent clustering through deep context encoders
US11880377B1 (en) 2021-03-26 2024-01-23 Experian Information Solutions, Inc. Systems and methods for entity resolution
WO2022269370A1 (en) * 2021-06-25 2022-12-29 L&T Technology Services Limited Method and system for clustering data samples

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5329596A (en) * 1991-09-11 1994-07-12 Hitachi, Ltd. Automatic clustering method
US5619709A (en) 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US5832182A (en) * 1996-04-24 1998-11-03 Wisconsin Alumni Research Foundation Method and system for data clustering for very large databases
US5875285A (en) * 1996-11-22 1999-02-23 Chang; Hou-Mei Henry Object-oriented data mining and decision making system
US5884305A (en) * 1997-06-13 1999-03-16 International Business Machines Corporation System and method for data mining from relational data by sieving through iterated relational reinforcement

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832482A (en) * 1997-02-20 1998-11-03 International Business Machines Corporation Method for mining causality rules with applications to electronic commerce

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5329596A (en) * 1991-09-11 1994-07-12 Hitachi, Ltd. Automatic clustering method
US5619709A (en) 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US5832182A (en) * 1996-04-24 1998-11-03 Wisconsin Alumni Research Foundation Method and system for data clustering for very large databases
US5875285A (en) * 1996-11-22 1999-02-23 Chang; Hou-Mei Henry Object-oriented data mining and decision making system
US5884305A (en) * 1997-06-13 1999-03-16 International Business Machines Corporation System and method for data mining from relational data by sieving through iterated relational reinforcement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
G. A. F. SEBER: "Multivariate Observations", 1984, JOHN WILEY & SONS
See also references of EP1090362A4

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1172740A3 (en) * 2000-06-12 2005-05-11 Ncr International Inc. SQL-based analytic algorithm for cluster analysis
JP2002092009A (ja) * 2000-06-12 2002-03-29 Ncr Internatl Inc Sqlベースの分析的アルゴリズムに基づくデータ検索方法及び装置
EP1172740A2 (en) * 2000-06-12 2002-01-16 Ncr International Inc. SQL-based analytic algorithm for cluster analysis
US7149649B2 (en) 2001-06-08 2006-12-12 Panoratio Database Images Gmbh Statistical models for improving the performance of database operations
WO2002101581A3 (de) * 2001-06-08 2003-09-12 Siemens Ag Statistische modelle zur performanzsteigerung von datenbankoperationen
WO2002101581A2 (de) * 2001-06-08 2002-12-19 Siemens Aktiengesellschaft Statistische modelle zur performanzsteigerung von datenbankoperationen
US7050932B2 (en) 2002-08-23 2006-05-23 International Business Machines Corporation Method, system, and computer program product for outlier detection
CN100371900C (zh) * 2006-01-19 2008-02-27 华为技术有限公司 数据同步的方法和系统
CN103077253A (zh) * 2013-01-25 2013-05-01 西安电子科技大学 Hadoop框架下高维海量数据GMM聚类方法
CN103077253B (zh) * 2013-01-25 2015-09-30 西安电子科技大学 Hadoop框架下高维海量数据GMM聚类方法
CN104156463A (zh) * 2014-08-21 2014-11-19 南京信息工程大学 一种基于MapReduce的大数据聚类集成方法
WO2018017439A1 (en) * 2016-07-22 2018-01-25 Microsoft Technology Licensing, Llc Clustering applications data for query processing
CN106991436A (zh) * 2017-03-09 2017-07-28 东软集团股份有限公司 噪声点检测方法及装置

Also Published As

Publication number Publication date
US6263337B1 (en) 2001-07-17

Similar Documents

Publication Publication Date Title
US6581058B1 (en) Scalable system for clustering of large databases having mixed data attributes
WO1999062007A1 (en) A scalable system for clustering of large databases having mixed data attributes
US6449612B1 (en) Varying cluster number in a scalable clustering system for use with large databases
US6374251B1 (en) Scalable system for clustering of large databases
US6012058A (en) Scalable system for K-means clustering of large databases
US6633882B1 (en) Multi-dimensional database record compression utilizing optimized cluster models
US7246125B2 (en) Clustering of databases having mixed data attributes
US6115708A (en) Method for refining the initial conditions for clustering with applications to small and large database clustering
Bradley et al. Scaling EM (expectation-maximization) clustering to large databases
US7113958B1 (en) Three-dimensional display of document set
US6263334B1 (en) Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
Halkidi et al. Quality scheme assessment in the clustering process
Harikumar et al. K-medoid clustering for heterogeneous datasets
Zhang et al. BIRCH: A new data clustering algorithm and its applications
Govaert et al. An EM algorithm for the block mixture model
US6584220B2 (en) Three-dimensional display of document set
WO2000065479A1 (en) Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions
CN109886334B (zh) 一种隐私保护的共享近邻密度峰聚类方法
EP1145184A2 (en) Method and apparatus for scalable probabilistic clustering using decision trees
CN113298230A (zh) 一种基于生成对抗网络的不平衡数据集的预测方法
Chhikara The state of the art in credit evaluation
Palpanas et al. Using datacube aggregates for approximate querying and deviation detection
EP1090362A1 (en) A scalable system for clustering of large databases having mixed data attributes
Ye et al. Fast search in large-scale image database using vector quantization
Christen et al. Scalable parallel algorithms for surface fitting and data mining

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 1999914207

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09700606

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 1999914207

Country of ref document: EP