US8452770B2 - Constrained nonnegative tensor factorization for clustering - Google Patents
Constrained nonnegative tensor factorization for clustering Download PDFInfo
- Publication number
- US8452770B2 US8452770B2 US12/837,021 US83702110A US8452770B2 US 8452770 B2 US8452770 B2 US 8452770B2 US 83702110 A US83702110 A US 83702110A US 8452770 B2 US8452770 B2 US 8452770B2
- Authority
- US
- United States
- Prior art keywords
- nonnegative
- objective function
- determining
- matrix
- tensor factorization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G06F17/30598—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2133—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on naturality criteria, e.g. with non-negative factorisation or negative correlation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Definitions
- Clustering is a well known data mining construct used to identify groups of similar objects. Unsupervised clustering splits data purely based on distance between data points, which can make clustering results unreliable.
- prior knowledge or domain knowledge can be used to constrain or guide a clustering process in order to produce more acceptable data partitions.
- semi-supervised clustering uses limited prior knowledge together with unlabeled data to achieve better clustering performance.
- Semi-supervised clustering typically employs two types of prior knowledge, class labels and pairwise constraints, to improve upon results obtained from unsupervised clustering.
- Nonnegative Matrix Factorization penalizes its objective function using constraints. NMF factorizes an input nonnegative matrix into a product of two new matrices of lower rank. Semi-supervised clustering through matrix factorization has been shown to largely improve clustering accuracy by incorporating prior knowledge into the factorization process.
- a method of clustering a plurality of information items using nonnegative tensor factorization may include receiving, by a processing device, one or more class labels, where each class label corresponds to an information item, receiving, by the processing device, a selection for a nonnegative tensor factorization model having an associated objective function, receiving, by the processing device, one or more parameter values, where each parameter value corresponds to one of one or more penalty constraints, determining, by the processing device, a constrained objective function including the one or more penalty constraints, where the constrained objective function is based on the objective function associated with the selected nonnegative tensor factorization model, the one or more parameter values and the one or more class labels, and determining, by the processing device, clusters for the plurality of information items by evaluating the constrained objective function.
- a method of clustering a plurality of information items using nonnegative tensor factorization may include receiving, by a processing device, one or more pairwise constraints, where each pairwise constraint corresponds to a plurality of information items, receiving, by the processing device, a selection for a nonnegative tensor factorization model having an associated objective function, receiving, by the processing device, one or more parameter values, where at least one parameter value corresponds to each of one or more penalty constraints, determining, by the processing device, a constrained objective function including the one or more penalty constraints, where the constrained objective function is based on the objective function associated with the selected nonnegative tensor factorization model, the one or more parameter values and the one or more pairwise constraints, and determining, by the processing device, clusters for the plurality of information items by evaluating the constrained objective function.
- a system for clustering information using nonnegative tensor factorization may include a processor, and a processor-readable storage medium in communication with the processor.
- the processor-readable storage medium may contain one or more programming instructions for performing the following when executed by the processor: receiving one or more class labels, where each class label corresponds to an information item, receiving a selection for a nonnegative tensor factorization model having an associated objective function, receiving one or more parameter values, where each parameter value corresponds to one of one or more penalty constraints, determining a constrained objective function including the one or more penalty constraints, where the constrained objective function is based on the objective function associated with the selected nonnegative tensor factorization model, the one or more parameter values and the one or more class labels, and determining clusters for the plurality of nodes by evaluating the constrained objective function.
- a system for clustering information using nonnegative tensor factorization may include a processor, and a processor-readable storage medium in communication with the processor.
- the processor-readable storage medium may contain one or more programming instructions for performing the following when executed by the processor: receiving one or more pairwise constraints, where each pairwise constraint corresponds to a plurality of information items, receiving a selection for a nonnegative tensor factorization model having an associated objective function, receiving one or more parameter values, where at least one parameter value corresponds to each of one or more penalty constraints, determining a constrained objective function including the one or more penalty constraints, where the constrained objective function is based on the objective function associated with the selected nonnegative tensor factorization model, the one or more parameter values and the one or more pairwise constraints, and determining clusters for the plurality of information items by evaluating the constrained objective function.
- FIG. 1 depicts a flow chart of an exemplary method of performing nonnegative tensor factorization with partial class label constraints according to an embodiment.
- FIG. 2 depicts a flow chart of an exemplary method of performing nonnegative tensor factorization with pairwise constraints according to an embodiment.
- FIG. 3 depicts a block diagram of exemplary internal hardware that may be used to contain or implement program instructions according to an embodiment.
- FIGS. 4A-4D depict the clustering performance of nonnegative tensor factorization with pairwise constraints of authors according to an example with respect to two datasets.
- FIGS. 5A and 5B depict the clustering performance of nonnegative tensor factorization with pairwise constraints of words according to an example.
- An “information item” is a data element corresponding to an object or event.
- a document may be described by an information item having author information, term information and publication date (i.e., time) information.
- an email communication may have an information item having sender, receiver and time information.
- Each information item may have a plurality of associated pieces of information.
- a “cluster” is a group of information items that are similar in some way.
- a “matrix” is an array of values having two dimensions.
- a “tensor” is an array of values having three or more dimensions.
- a “factorization model” is a mathematical model used to cluster items.
- a “nonnegative tensor factorization model” is a factorization model for which the input tensor entries and the output component matrices or tensors are nonnegative.
- An “objective function” is a mathematical function to be maximized or minimized in optimization theory.
- a “constrained” objective function is an objective function having one or more constraints.
- a “class label” represents prior knowledge with respect to whether particular information is associated with a particular class. For example, prior knowledge that an email was sent within a particular time range may cause the email to be classified within a particular class associated with the time range based on such prior knowledge. Initial classification of the email based on the time information class label may be updated based on other information associated with the email.
- a “pairwise constraint” is a constraint between two or more elements to be clustered.
- Types of pairwise constraints may include, for example and without limitation, a “must-link” constraint or a “cannot link” constraint.
- a must-link constraint imposes a penalty to break the link between the linked elements by not placing the elements in the same class.
- a cannot-link constraint imposes a similar penalty to break the link between the linked elements (i.e., by placing the linked elements in the same class).
- a “penalty constraint” is a mathematical constraint used to weight prior knowledge associated with an information type.
- Matrix factorization is limited because it cannot account for multi-way data factorization.
- publications over different time periods can be represented as a three-way dataset as authors ⁇ terms ⁇ time.
- email communications the emails can be represented as sender ⁇ receiver ⁇ time.
- Other clustering environments may include web page personalization (user ⁇ query word ⁇ webpage), high-order web link analysis (web page ⁇ web page ⁇ anchor text) and/or the like.
- web page personalization user ⁇ query word ⁇ webpage
- web link analysis web page ⁇ web page ⁇ anchor text
- multi-way data analysis methods e.g., tensor factorization
- Parafac is a multi-linear form of decomposition for an objective tensor. Each entry of, for example, a three-way tensor is approximated by a linear combination of three vectors.
- the Tucker model is a multi-way component analysis that attempts to provide an optimal low rank approximation of a tensor in given dimensions. Many multi-way models are extensions or modifications of the general models.
- NTF constrained nonnegative tensor factorization
- Scalar values are represented by lowercase letters (e.g., x), and vectors are represented using boldface lowercase letters (e.g., x).
- Matrices are represented by boldface uppercase letters (e.g., X), where the i th column of X is x i and the (i, j)-th entry is x ij .
- Tensors are represented by boldface underlined capital letters (e.g., X ) which can be unfolded in the n th mode to form a matrix by X (n) .
- the c-th frontal slice of X is formed by holding the last mode of the multi-way array fixed at c.
- the symbol represents the Kronecker product.
- the Kronecker product of the matrix A ⁇ axb and the matrix B ⁇ cxd is a matrix C ⁇ acxbd , where each entry in C is the product of entries from A and B, respectively.
- the symbol denotes the Khatri-Rao product. This product assumes the partitions of the matrices are their columns.
- A is a m-by-n matrix and B is a p-by-n matrix
- A*B is a mp-by-n matrix of which each column is the Kronecker product of the corresponding columns of A and B.
- Nonnegative tensor factorization One characteristic of a nonnegative tensor factorization is that the entries of the input tensor and the output component tensors are nonnegative.
- the objective function may be represented as follows:
- nonnegative Parafac model is modified to incorporate penalty constraint terms for breaking constraints. It will be apparent to one of ordinary skill in the art that penalty constraint terms disclosed below may be applied to the Tucker3 nonnegative tensor factorization model or any other nonnegative tensor factorization model in a corresponding manner.
- FIG. 1 depicts a flow chart of an exemplary method of clustering a plurality of information items using nonnegative tensor factorization with partial class label constraints according to an embodiment.
- one or more class labels may be identified 105 .
- Each class label may correspond to an information item to be clustered.
- Class labels for a first information type may be represented as a matrix U 0 , where the rows represent information of the first information type, and the columns represent classes.
- Each entry of matrix U 0 may represent the probability of the information corresponding to the row of the entry belonging to a class corresponding to the column of the entry based on prior knowledge.
- a row of U 0 having all zeros may represent information for which no knowledge regarding a class label is present.
- class labels for second and third information types may be represented as matrices V 0 and S 0 , respectively.
- each entry of U 0 may represent the probability that an author (represented by the row) is associated with a particular class
- each entry of V 0 may represent the probability that a term is associated with a particular class
- each entry of S 0 may represent the probability that a time is associated with a particular class.
- identifying 105 one or more class labels may be performed by receiving one or more class labels by a processing device, such as the one described in reference to FIG. 3 below.
- a nonnegative tensor factorization model may then be determined 110 .
- the nonnegative tensor factorization model is associated with an objective function.
- the objective function may be the nonnegative Parafac tensor factorization model.
- the objective function may be the nonnegative Tucker3 tensor factorization model. Alternate models may also be used within the scope of this disclosure.
- determining 110 the nonnegative tensor factorization model may be performed by receiving a selection for the nonnegative tensor factorization model by a processing device, such as the one described in reference to FIG. 3 .
- One or more parameter values, each corresponding to a penalty constraint, may be determined 115 with respect to the class labels.
- the penalty constraints may be used to incorporate class label information into a tensor factorization model.
- a penalty constraint may be determined with respect to each information type.
- determining 115 one or more penalty constraints may be performed by receiving one or more parameter values by a processing device, such as the one described in reference to FIG. 3 .
- each parameter value may correspond to one of one or more penalty constraints.
- a constrained objective function including the one or more penalty constraints may be determined 120 .
- the constrained objective function may be based on the objective function for the selected nonnegative tensor factorization model, the one or more parameter values and the one or more class labels.
- a constrained objective function based on the nonnegative Parafac tensor factorization model including one or more penalty constraints may be represented as follows:
- Each of the ⁇ 1 ⁇ E u U ⁇ U 0 ⁇ F 2 , ⁇ 2 ⁇ E v V ⁇ V 0 ⁇ F 2 , and ⁇ 3 ⁇ E s S ⁇ S 0 ⁇ F 2 terms are penalty constraints where U 0 ⁇ nxk , V 0 ⁇ mxk , and S 0 ⁇ pxk represent partial prior knowledge of class labels of information on rows (mode 1), columns (mode 2) and occasions (mode 3), respectively.
- E u , E v and E s are diagonal matrices in which a value of 1 represents that prior knowledge exists for the corresponding information. Such matrices can be derived from U 0 , V 0 and S 0 .
- ⁇ 1 ⁇ 0, ⁇ 2 ⁇ 0, and ⁇ 3 ⁇ 0 are parameter values used to weight the influence of the penalty constraints. It is noted that the constrained objective function above can be rewritten as follows:
- determining 120 a constrained objective function may be performed by a processing device, such as the one described in reference to FIG. 3 .
- Clusters may be determined 125 for the plurality of modes by resolving the constrained objective function.
- determining 125 clusters may be performed by applying a nonnegative multiplicative least square algorithm to the constrained objective function.
- the nonnegative multiplicative least square algorithm may update cluster information for one information type at a time while the other information types remain fixed.
- the following computations may be performed iteratively until convergence or for an identified number of iterations in order to determine 125 clusters:
- u ij u ij ⁇ ( X _ ( 1 ) ⁇ ( S * V ) + ⁇ 1 ⁇ U 0 ) ij ( U ⁇ ( S * V ) T ⁇ ( S * V ) + ⁇ 1 ⁇ E u ⁇ U ) ij
- v ij v ij ⁇ ( X _ ( 2 ) ⁇ ( S * U ) + ⁇ 2 ⁇ V 0 ) ij ( V ⁇ ( S * U ) T ⁇ ( S * U ) + ⁇ 2 ⁇ E v ⁇ V ) ij
- ⁇ ⁇ s ij s ij ⁇ ( X _ ( 3 ) ⁇ ( V * U ) + ⁇ 3 ⁇ S 0 ) ij ( S ⁇ ( V * U ) T ⁇ ( V * U ) + ⁇ 3 ⁇ E s ⁇ S ) ij .
- FIG. 2 depicts a flow chart of an exemplary method of clustering a plurality of information items using nonnegative tensor factorization with pairwise constraints according to an embodiment.
- one or more pairwise constraints may be identified 205 for a plurality of information items to be clustered.
- the pairwise constraints may include one or more of “must-link” constraints and “cannot-link” constraints.
- identifying 205 one or more pairwise constraints may be performed by receiving one or more pairwise constraints by a processing device, such as the one described in reference to FIG. 3 below.
- Must-link pairwise constraints pertain to a plurality of information items intended to be in the same cluster according to prior knowledge.
- Must-link constraints can be represented using a pairwise constraint matrix M′, where entries of M′ with values of 1 indicate that the corresponding row information item and column information item tend to belong to the same cluster and entries of M′ with values of 0 indicate that no defined relationship is known between the corresponding information items.
- Cannot-link constraints pertain to pairs of information items intended to be in different clusters according to prior knowledge.
- Cannot-link constraints can be represented by a pairwise constraint matrix N′, where entries of N′ with values of 1 indicate that the corresponding row information item and column information item tend to belong to different clusters and entries of N′ with values of 0 indicate that no defined relationship is known between the corresponding information items.
- a nonnegative tensor factorization model may then be determined 210 .
- the nonnegative tensor factorization model is associated with an objective function.
- the objective function may be the nonnegative Parafac tensor factorization model.
- the objective function may be the nonnegative Tucker3 tensor factorization model. Alternate models may also be used within the scope of this disclosure.
- determining 210 the nonnegative tensor factorization model may be performed by receiving a selection for the nonnegative tensor factorization model by a processing device, such as the one described in reference to FIG. 3 .
- determining 215 one or more penalty constraint functions may be performed by receiving one or more parameter values by a processing device, such as the one described in reference to FIG. 3 .
- one or more parameter values may correspond to each of one or more penalty constraint functions.
- a constrained objective function including the one or more penalty constraint functions may be determined 220 .
- the constrained objective function may be based on the objective function for the selected nonnegative tensor factorization model, the one or more parameter values and the one or more pairwise constraints.
- a constrained objective function based on the nonnegative Parafac tensor factorization model including the pairwise constraint penalty functions may be represented as follows:
- Clusters may be determined 225 for the plurality of modes by resolving the constrained objective function.
- determining 225 clusters may be performed by applying a nonnegative multiplicative least square algorithm to the constrained objective function.
- the nonnegative multiplicative least square algorithm may update cluster information for one information type at a time while the other information types remain fixed.
- the following computations may be performed iteratively until convergence or for an identified number of iterations in order to determine 225 clusters:
- FIG. 3 depicts a block diagram of exemplary internal hardware that may be used to contain or implement program instructions according to an embodiment.
- a bus 300 serves as the main information highway interconnecting the other illustrated components of the hardware.
- CPU 305 is the central processing unit of the system, performing calculations and logic operations required to execute a program.
- CPU 305 is an exemplary processing device, computing device or processor as such terms are used within this disclosure.
- Read only memory (ROM) 310 and random access memory (RAM) 315 constitute exemplary memory devices.
- a controller 320 interfaces with one or more optional memory devices 325 to the system bus 300 .
- These memory devices 325 may include, for example, an external or internal DVD drive, a CD ROM drive, a hard drive, flash memory, a USB drive or the like. As indicated previously, these various drives and controllers are optional devices.
- Program instructions may be stored in the ROM 310 and/or the RAM 315 .
- program instructions may be stored on a tangible computer readable storage medium such as a compact disk, a digital disk, flash memory, a memory card, a USB drive, an optical disc storage medium, such as Blu-rayTM disc, and/or other recording medium.
- An optional display interface 330 may permit information from the bus 300 to be displayed on the display 335 in audio, visual, graphic or alphanumeric format. Communication with external devices may occur using various communication ports 340 .
- An exemplary communication port 340 may be attached to a communications network, such as the Internet or an intranet.
- the hardware may also include an interface 345 which allows for receipt of data from input devices such as a keyboard 350 or other input device 355 such as a mouse, a joystick, a touch screen, a remote control, a pointing device, a video input device and/or an audio input device.
- input devices such as a keyboard 350 or other input device 355 such as a mouse, a joystick, a touch screen, a remote control, a pointing device, a video input device and/or an audio input device.
- An embedded system such as a sub-system within a printing device or xerographic device, may optionally be used to perform one, some or all of the operations described herein.
- a multiprocessor system may optionally be used to perform one, some or all of the operations described herein.
- Real-world data sets from the DBLP computer science bibliography were used to conduct simulations of the aforementioned nonnegative tensor factorization methods with respect to nonnegative factorization methods.
- Author names, publication titles and publication years were extracted from the bibliography for each of a plurality of publications. 1000 active researchers with their publication titles for the years from 1988 through 2007 were selected for the simulations. The researchers and their publications were divided into 9 different research areas based on the authors' activities. These research areas served as class labels.
- the data were preprocessed using standard text preprocessing techniques. For each year, a binary matrix with each entry denoting a co-occurrence of the corresponding author and the term in that year was constructed. As such, the data was organized as a three-way array with the author, term and year modes.
- a first set of simulations were conducted with all 9 research areas (database, data mining, software engineering, theory, computer vision, operating system, machine learning, networking, and natural language processing) for 20 years of publication titles of all 1000 authors utilizing 1000 key terms with the highest frequency of occurrence.
- a second set of simulations were conducted with 20 years of publication titles and 250 authors randomly selected from the 1000 authors in 4 research areas (data mining, software engineering, theory and computer vision) and 200 key terms.
- NTF-PCL Nonnegative Tensor Factorization with Partial Class Label method disclosed herein
- 5 tensor factorization methods Parafac, Tucker3, Nonnegative Parafac (NParafac), Nonnegative Tucker3 (NTucker3), and NParafac with V initialized with partial word class labels (NTF-Ini)
- K-Means on the sum-up matrix (authors ⁇ terms) (KMeans(sum)
- K-Means on the unfolded matrix of the three way (KMeans(ext)
- PCA on the unfolded matrix followed by K-Means (KMeans(pca)
- Information theoretic co-clustering algorithm on the sum-up matrix InfoCo
- Euclidean co-clustering algorithm on the sum-up matrix EuclCo
- MinSqCo Minimum squared residue co-clustering algorithm on the sum-up matrix
- Accuracy identifies one-to-one relationships between clusters and classes and measures the maximal extent to which each cluster contains data points from the corresponding class. Accuracy sums up the whole matching degree between all class-cluster pairs. Generally, a larger accuracy value indicates better clustering performance. Accuracy can be represented by the following:
- T(C k , L m ) is the number of entities that belong to class m and are assigned to cluster k.
- Accuracy determines the maximum sum of T(C k , L m ) for all pairs of clusters and classes, and these pairs have no overlaps. Accuracy has a value between [0, 1].
- Normalized Mutual Information is the mutual information between clustering and class knowledge divided by the maximum value of clustering entropy and class entropy.
- NMI has a value between [0, 1]. In general, a larger NMI value indicates better clustering quality.
- the NMI of the entire clustering solution is represented by the following:
- NMI ⁇ i , j ⁇ P ⁇ ( i , j ) ⁇ log 2 ⁇ P ⁇ ( i , j ) P ⁇ ( i ) ⁇ P ⁇ ( j ) max ( ⁇ i ⁇ - P ⁇ ( i ) ⁇ log 2 ⁇ P ⁇ ( i ) , ⁇ j ⁇ - P ⁇ ( j ) ⁇ log 2 ⁇ P ⁇ ( j ) ) , where P(i) is the probability that an arbitrary data point belongs to cluster i, P(j) is the probability that an arbitrary data point belongs to class j, and P(i, j) is the joint probability that an arbitrary data point belongs to both cluster i and class j.
- FIGS. 5A and 5B depict the author clustering performance with respect to the number of word pairwise constraints. As shown in FIGS. 5A and 5B , the author clustering performance generally improved as the number of word constraints increased. Clustering performance substantially stabilized for 100 or more pairwise word constraints.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Optimization (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Finally, ∥X∥F=√{square root over (Σijxij 2)} is the Frobenius norm of the matrix X.
where Xε nxmxp, Uε nxk, Vε mxk, Sε pxk, Sl is a diagonal matrix with the l-th row of S on the diagonal and uip≧0, vip≧0, and slp≧0. It is noted that
X (1) =UG (1)(S V)T, where Uε nxk
Each of the α1∥EuU−U0∥F 2, α2∥EvV−V0∥F 2, and α3∥EsS−S0∥F 2 terms are penalty constraints where U0ε nxk, V0ε mxk, and S0ε pxk represent partial prior knowledge of class labels of information on rows (mode 1), columns (mode 2) and occasions (mode 3), respectively. Eu, Ev and Es are diagonal matrices in which a value of 1 represents that prior knowledge exists for the corresponding information. Such matrices can be derived from U0, V0 and S0. α1≧0, α2≧0, and α3≧0 are parameter values used to weight the influence of the penalty constraints. It is noted that the constrained objective function above can be rewritten as follows:
where α1Eu, α2Ev and α3Es may be regarded as weights of an overall influence of prior knowledge. In an embodiment, determining 120 a constrained objective function may be performed by a processing device, such as the one described in reference to
In an embodiment, clusters may be determined 125 by a processing device, such as the one described in reference to
Tr(−αU T M′U+βU T N′U)=Tr(−αU T M′U)+Tr(βU T N′U)=−αΣij m′ ij(U T U)ij+βΣij n′ ij(U T U)ij
, where α and β are parameter values used to adjust the influence of the penalty terms. Weighting matrices M and N may be used to simplify the penalty constraint functions, where M=αM′ and N=βN′. In an embodiment, determining 215 one or more penalty constraint functions may be performed by receiving one or more parameter values by a processing device, such as the one described in reference to
where Mu, Nuε nxn, Mv, Nvε mxm, and Ms, Nsε pxp are must-link and cannot-link weighting matrices of information on rows (mode 1), columns (mode 2) and occasions (mode 3), respectively.
where A+ and A− are a matrix with positive values and a matrix with negative values of A, respectively, such that A=A+−A−. In an embodiment, clusters may be determined by a processing device, such as the one described in reference to
TABLE 1 |
Clustering Performance Comparisons on DBLP |
DBLP4 | DBLP9 |
Methods | ACC | NMI | ACC | NMI |
KMeans(sum) | 0.68 | 0.48 | 0.39 | 0.32 |
KMeans(ext) | 0.40 | 0.23 | 0.25 | 0.01 |
KMeans(pca) | 0.55 | 0.33 | 0.42 | 0.34 |
Parafac | 0.80 | 0.61 | 0.39 | 0.19 |
NParafac | 0.83 | 0.61 | 0.52 | 0.47 |
Tucker3 | 0.70 | 0.54 | 0.41 | 0.20 |
NTucker3 | 0.71 | 0.53 | 0.49 | 0.25 |
ClusterAgg | 0.82 | 0.65 | 0.31 | 0.19 |
InfoCo | 0.78 | 0.51 | 0.42 | 0.25 |
EuclCo | 0.68 | 0.55 | 0.36 | 0.21 |
MinSqCo | 0.60 | 0.41 | 0.41 | 0.32 |
NTF-Ini | 0.85 | 0.63 | 0.53 | 0.47 |
NTF-PCL | 0.88 | 0.68 | 0.55 | 0.49 |
where Ck denotes the k-th cluster, and Lm denotes the m-th class. T(Ck, Lm) is the number of entities that belong to class m and are assigned to cluster k. Accuracy determines the maximum sum of T(Ck, Lm) for all pairs of clusters and classes, and these pairs have no overlaps. Accuracy has a value between [0, 1].
where P(i) is the probability that an arbitrary data point belongs to cluster i, P(j) is the probability that an arbitrary data point belongs to class j, and P(i, j) is the joint probability that an arbitrary data point belongs to both cluster i and class j.
Claims (17)
−αΣij m′ ij(U T U)ij+βΣij n′ ij(U T U)ij,
−αΣij m′ ij(U T U)ij+βΣij n′ ij(U T U)ij,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/837,021 US8452770B2 (en) | 2010-07-15 | 2010-07-15 | Constrained nonnegative tensor factorization for clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/837,021 US8452770B2 (en) | 2010-07-15 | 2010-07-15 | Constrained nonnegative tensor factorization for clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
US20120016878A1 US20120016878A1 (en) | 2012-01-19 |
US8452770B2 true US8452770B2 (en) | 2013-05-28 |
Family
ID=45467736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/837,021 Expired - Fee Related US8452770B2 (en) | 2010-07-15 | 2010-07-15 | Constrained nonnegative tensor factorization for clustering |
Country Status (1)
Country | Link |
---|---|
US (1) | US8452770B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160203209A1 (en) * | 2015-01-12 | 2016-07-14 | Xerox Corporation | Joint approach to feature and document labeling |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8706782B2 (en) * | 2011-06-12 | 2014-04-22 | International Business Machines Corporation | Self-contained placement of data objects in a data storage system |
CN104216920B (en) * | 2013-06-05 | 2017-11-21 | 北京齐尔布莱特科技有限公司 | Data classification method based on cluster and Hungary Algorithm |
JP6050212B2 (en) * | 2013-11-01 | 2016-12-21 | 日本電信電話株式会社 | Data analysis apparatus, method, and program |
JP6058065B2 (en) * | 2015-01-23 | 2017-01-11 | 日本電信電話株式会社 | Tensor data calculation device, tensor data calculation method, and program |
CN104951518B (en) * | 2015-06-04 | 2018-06-05 | 中国人民大学 | One kind recommends method based on the newer context of dynamic increment |
JP6635418B2 (en) * | 2016-06-07 | 2020-01-22 | 日本電信電話株式会社 | Flow rate prediction device, pattern estimation device, flow rate prediction method, pattern estimation method, and program |
US10606952B2 (en) | 2016-06-24 | 2020-03-31 | Elemental Cognition Llc | Architecture and processes for computer learning and understanding |
US11574204B2 (en) * | 2017-12-06 | 2023-02-07 | Accenture Global Solutions Limited | Integrity evaluation of unstructured processes using artificial intelligence (AI) techniques |
JP7091930B2 (en) * | 2018-08-16 | 2022-06-28 | 日本電信電話株式会社 | Tensor data calculator, tensor data calculation method and program |
CN109614581B (en) * | 2018-10-19 | 2023-09-22 | 江苏理工学院 | Non-negative matrix factorization clustering method based on dual local learning |
CN109918615B (en) * | 2018-12-25 | 2022-12-27 | 华中科技大学鄂州工业技术研究院 | Multi-mode recommendation method and device |
CN110941793B (en) * | 2019-11-21 | 2023-10-27 | 湖南大学 | Network traffic data filling method, device, equipment and storage medium |
US11568153B2 (en) * | 2020-03-05 | 2023-01-31 | Bank Of America Corporation | Narrative evaluator |
CN114461961B (en) * | 2021-12-30 | 2024-08-20 | 大连理工大学 | Incomplete multi-mode media data clustering method based on NMF and low-rank tensor |
CN114491293B (en) * | 2022-01-28 | 2024-12-20 | 南通大学 | A unified semi-supervised community detection method integrating content information |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020145425A1 (en) * | 2000-12-22 | 2002-10-10 | Ebbels Timothy Mark David | Methods for spectral analysis and their applications: spectral replacement |
US20090055139A1 (en) * | 2007-08-20 | 2009-02-26 | Yahoo! Inc. | Predictive discrete latent factor models for large scale dyadic data |
US20090290802A1 (en) * | 2008-05-22 | 2009-11-26 | Microsoft Corporation | Concurrent multiple-instance learning for image categorization |
US20090299705A1 (en) * | 2008-05-28 | 2009-12-03 | Nec Laboratories America, Inc. | Systems and Methods for Processing High-Dimensional Data |
US20090306932A1 (en) * | 2008-06-10 | 2009-12-10 | National University Of Ireland, Galway | Similarity index: a rapid classification method for multivariate data arrays |
US20100185578A1 (en) * | 2009-01-22 | 2010-07-22 | Nec Laboratories America, Inc. | Social network analysis with prior knowledge and non-negative tensor factorization |
US20110055379A1 (en) * | 2009-09-02 | 2011-03-03 | International Business Machines Corporation | Content-based and time-evolving social network analysis |
US20110295903A1 (en) * | 2010-05-28 | 2011-12-01 | Drexel University | System and method for automatically generating systematic reviews of a scientific field |
US8090665B2 (en) * | 2008-09-24 | 2012-01-03 | Nec Laboratories America, Inc. | Finding communities and their evolutions in dynamic social network |
-
2010
- 2010-07-15 US US12/837,021 patent/US8452770B2/en not_active Expired - Fee Related
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020145425A1 (en) * | 2000-12-22 | 2002-10-10 | Ebbels Timothy Mark David | Methods for spectral analysis and their applications: spectral replacement |
US20090055139A1 (en) * | 2007-08-20 | 2009-02-26 | Yahoo! Inc. | Predictive discrete latent factor models for large scale dyadic data |
US20090290802A1 (en) * | 2008-05-22 | 2009-11-26 | Microsoft Corporation | Concurrent multiple-instance learning for image categorization |
US20090299705A1 (en) * | 2008-05-28 | 2009-12-03 | Nec Laboratories America, Inc. | Systems and Methods for Processing High-Dimensional Data |
US8099381B2 (en) * | 2008-05-28 | 2012-01-17 | Nec Laboratories America, Inc. | Processing high-dimensional data via EM-style iterative algorithm |
US20090306932A1 (en) * | 2008-06-10 | 2009-12-10 | National University Of Ireland, Galway | Similarity index: a rapid classification method for multivariate data arrays |
US8090665B2 (en) * | 2008-09-24 | 2012-01-03 | Nec Laboratories America, Inc. | Finding communities and their evolutions in dynamic social network |
US20100185578A1 (en) * | 2009-01-22 | 2010-07-22 | Nec Laboratories America, Inc. | Social network analysis with prior knowledge and non-negative tensor factorization |
US20110055379A1 (en) * | 2009-09-02 | 2011-03-03 | International Business Machines Corporation | Content-based and time-evolving social network analysis |
US8204988B2 (en) * | 2009-09-02 | 2012-06-19 | International Business Machines Corporation | Content-based and time-evolving social network analysis |
US20110295903A1 (en) * | 2010-05-28 | 2011-12-01 | Drexel University | System and method for automatically generating systematic reviews of a scientific field |
Non-Patent Citations (15)
Title |
---|
Cho et al., "Minimum Sum-Squared Residue Co-clustering of Gene Expression Data", SIAM 2002, Department of Computer Sciences, University of Texas, Austin, TX, pp. 1-12, 2004. |
Christos Faloutsos et al., "Mining Large Time-Evolving Data Using Matrix and Tensor Tools," International Conference on Data Mining, 2007, slides. * |
Dhillon et al., "Information-Theoretic Co-clustering", In "Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ", Washington, DC, Aug. 24-27, 2003, pp. 89-98. |
Ding et al., "Convex and Semi-Nonnegative Matrix Factorizations", Department of Computer Science and Engineering, University of Texas, Arlington, TX, Oct. 24, 2008, pp. 1-26. |
Harshman, "Foundations of the Parafac Procedures: Models and Conditions for an "Explanatory" Multimodal Factor Analysis", UCLA, Working Papers in Phonetics, Dec. 16, 1970, pp. 1-84, University Microfilms, Ann Arbor, MI. |
Harshman, Richard, PARAFAC: Parallel factor analysis, Computational Statistics and Data Analysis 18, 1994, pp. 39-72. * |
Jos M.F. Ten Berge, Simplicity and typical rank of three way arrays, with applications to Tucker-3 analysis with simple cores, Journal of Chemometrics, 2004, pp. 17-21. * |
Kim et al., "Nonnegative Tucker Decomposition", Department of Computer Science , POSTECH, Korea, pp. 1-8. |
Kolda, Tamara, Tensor Decompositions and Applications, SIAM Review, Jun. 10, 2008, pp. 1-47. * |
Lee et al., "Algorithms for Non-negative Matrix Factorization", In "Advances in Neural Information Processing Systems", vol. 13, MIT Press, 2001. |
Li et al., "Solving Consensus and Semi-supervised Clustering Problems using Nonnegative Matrix Factorization", in ICDM, pp. 577-582, 2007. |
Li et al.,"Knowledge Transformation from Word Space to Document Space", SIGIR 08, Jul. 20-24, 2008, Singapore, pp. 187-194. |
Shashua et al., "Non-Negative Tensor Factorization with Applications to Statistics and Computer Vision", Proceedings of the 22nd International Conference of Machine Learning, Bonn, Germany 2005, pp. 1-8. |
Strehl et al., "Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions", Journal of Machine Learning Research, 3, 2002, pp. 583-617. |
Wang et al., "Semi-Supervised Clustering via Matrix Factorization", In "Proceedings of 2008 Siam International Conference on Data Mining", 2008, pp. 1-12. |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160203209A1 (en) * | 2015-01-12 | 2016-07-14 | Xerox Corporation | Joint approach to feature and document labeling |
US10055479B2 (en) * | 2015-01-12 | 2018-08-21 | Xerox Corporation | Joint approach to feature and document labeling |
Also Published As
Publication number | Publication date |
---|---|
US20120016878A1 (en) | 2012-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8452770B2 (en) | Constrained nonnegative tensor factorization for clustering | |
Wang et al. | GMC: Graph-based multi-view clustering | |
Zhou et al. | Unsupervised feature selection with adaptive multiple graph learning | |
Cheplygina et al. | Multiple instance learning with bag dissimilarities | |
Tong et al. | Random walk with restart: fast solutions and applications | |
Li et al. | Leveraging implicit relative labeling-importance information for effective multi-label learning | |
Sriperumbudur et al. | A majorization-minimization approach to the sparse generalized eigenvalue problem | |
US8533195B2 (en) | Regularized latent semantic indexing for topic modeling | |
US8849790B2 (en) | Rapid iterative development of classifiers | |
US8856050B2 (en) | System and method for domain adaption with partial observation | |
US8326785B2 (en) | Joint ranking model for multilingual web search | |
US10942939B2 (en) | Systems and methods for unsupervised streaming feature selection in social media | |
Wu et al. | A two-stage framework for cross-domain sentiment classification | |
Traganitis et al. | Blind multiclass ensemble classification | |
Zhang et al. | Non-negative tri-factor tensor decomposition with applications | |
Zhao et al. | An balanced, and scalable graph-based multiview clustering method | |
Ermis et al. | A Bayesian tensor factorization model via variational inference for link prediction | |
Zhang et al. | Multi-modal kernel ridge regression for social image classification | |
Han et al. | Generalizing long short-term memory network for deep learning from generic data | |
Jo | NTSO (Neural Text Self Organizer): A new neural network for text clustering | |
Ye et al. | Feature selection for adaptive dual-graph regularized concept factorization for data representation | |
Wang et al. | Semi‐Supervised Multi‐View Clustering with Weighted Anchor Graph Embedding | |
Ramezani | Modern statistical modeling in machine learning and big data analytics: Statistical models for continuous and categorical variables | |
Bao et al. | Joint local and global consistency on interdocument and interword relationships for co-clustering | |
Yin et al. | Multi-view clustering via spectral embedding fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: XEROX CORPORATION, CONNECTICUT Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PENG, WEI;REEL/FRAME:024693/0110 Effective date: 20100628 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210528 |