US20190370651A1 - Deep Co-Clustering - Google Patents

Deep Co-Clustering Download PDF

Info

Publication number
US20190370651A1
US20190370651A1 US16/429,425 US201916429425A US2019370651A1 US 20190370651 A1 US20190370651 A1 US 20190370651A1 US 201916429425 A US201916429425 A US 201916429425A US 2019370651 A1 US2019370651 A1 US 2019370651A1
Authority
US
United States
Prior art keywords
features
instances
loss
cross
mutual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/429,425
Inventor
Wei Cheng
Haifeng Chen
Jingchao Ni
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Laboratories America Inc
Original Assignee
NEC Laboratories America Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories America Inc filed Critical NEC Laboratories America Inc
Priority to US16/429,425 priority Critical patent/US20190370651A1/en
Assigned to NEC LABORATORIES AMERICA, INC. reassignment NEC LABORATORIES AMERICA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NI, JINGCHAO, CHEN, HAIFENG, CHENG, WEI
Publication of US20190370651A1 publication Critical patent/US20190370651A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • G06F18/21342Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis using statistical independence, i.e. minimising mutual information or maximising non-gaussianity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present invention relates to co-clustering data and, more particularly, to co-clustering that uses neural networks.
  • a method for co-clustering data includes reducing dimensionality for instances and features of an input dataset independently of one another.
  • a mutual information loss is determined for the instances and the features independently of one another.
  • the instances and the features are cross-correlated, based on the mutual information loss, to determine a cross-correlation loss.
  • Co-clusters in the input data are determined based on the cross-correlation loss.
  • a data co-clustering system includes an instance autoencoder configured to reduce a dimensionality for instances of an input dataset.
  • a feature autoencoder is configured to reduce a dimensionality for features of an input dataset.
  • An instance mutual information loss branch is configured to determining a mutual information loss for the instances.
  • a feature mutual information loss branch is configured to determine a mutual information loss for the features.
  • a processor is configured to cross-correlate the instances and the features based on the mutual information loss, to determine a cross-correlation loss and to determine co-clusters in the input data based on the cross-correlation loss.
  • FIG. 1 is a block/flow diagram of a method/system for co-clustering data in accordance with an embodiment of the present invention
  • FIG. 2 is a diagram of an exemplary neural network in accordance with an embodiment of the present invention.
  • FIG. 3 is a block/flow diagram of a method for classifying documents based on co-clustering in accordance with an embodiment of the present invention
  • FIG. 4 is a block diagram of a data co-clustering system in accordance with an embodiment of the present invention.
  • FIG. 5 is a block diagram of a processing system in accordance with an embodiment of the present invention.
  • systems and methods that perform co-clustering using deep neural networks.
  • the present embodiments use a deep autoencoder to generate low-dimensional representations for instances and features, which are then used as input to respective inference paths, each including an inference network and a Gaussian mixture model (GMM).
  • the GMM outputs are cross-correlated using mutual information loss.
  • the present embodiments can optimize the parameters of the deep-autoencoder, the inference neural network, and the GMM jointly.
  • Co-clustering is particularly advantageous for its identification of feature clusters based on instance clusters.
  • One exemplary application for co-clustering is in text document classification, particularly when training labels are not used.
  • Co-clustering identifies word clusters for each document cluster, making it easy to know the category of each document cluster from the words in the corresponding word cluster. Thus, once the major words in a document have been identified, co-clustering makes it possible to identify the category that a new document belongs to.
  • FIG. 1 a block diagram is shown that illustrates the steps performed by the present embodiments.
  • the instances and features are provided to separate paths.
  • the duality between instances and features indicates that instances can be grouped based on features and that features can be grouped based on instances.
  • the raw input is provided to a deep autoencoder 102 that reduces the dimensionality of the input.
  • the deep autoencoder 102 performs an encoding from the original high-dimensional space to a low-dimensional space.
  • the deep autoencoder 102 then decodes the low-dimensional encoding to reproduce the high-dimensional input to verify that the low-dimensional encoding maintains the information of the original input data.
  • the encoded instances and features are then output by their respective autoencoders 102 .
  • An inference network 104 and a GMM 106 provides cluster assignments for the instances and the features, providing a mutual information loss.
  • Cross-correlation block 108 uses the mutual information loss to correlate the instances with the features, providing the co-clustered output.
  • text document data can represent the documents as instances and the words within the documents as features Similar documents usually share similar word distributions, so that the instances of text document data can be grouped into clusters based on the features, while similar words often exist in similar documents. The features can then be clustered based on the instances.
  • the instances and features can be represented as a data matrix.
  • the instances and features can be reorganized into homogeneous blocks referred to herein as co-clusters.
  • Co-clusters are subsets of an original data matrix and are characterized as a set of instances and a set of features, with values in a given subset being similar.
  • Co-clusters reflect the structural information in the original data and can indicate relationships between instances and features.
  • the present embodiments can be of particular use in fields relating to bioinformatics, recommendation systems, and image segmentation. Co-clustering is superior to traditional clustering in these fields because of its ability to use the relationships between instances and features.
  • These instances and features are clustered into g instance clusters and g feature clusters. Co-clustering in the present embodiments therefore finds maps C r and C c :
  • r and c designate rows (instances) and columns (features).
  • the instances can be reordered such that instances that are grouped into the same cluster are arranged to be adjacent. Similar arrangements can be applied to features.
  • the first step in performing co-clustering is to reduce the dimension of input data in block 102 .
  • Some embodiments of the present invention use deep stacked autoencoders that perform unsupervised representation learning.
  • the autoencoders 102 reduce both instances and features separately. Given the i th instance and the i th feature as x i and y j , the lower-dimension representations are denoted herein as:
  • f r and f c denote encoding functions for instances and features, respectively, and ⁇ r and ⁇ c denote parameters of the autoencoders 102 .
  • the encoding functions can be linear or nonlinear, depending on the domain data.
  • the reconstruction losses of x i and y j are denoted as l(x i , g r (z i ; ⁇ r )) and l(y j , g c (w j ; ⁇ c )) separately, where g r and g c are decoding functions for instances and features, respectively.
  • Deep neural networks are used as the inference neural networks 104 , using the low-dimensional representations as inputs.
  • the outputs of the inference networks 104 are new representations of instances x i and y j , denoted as:
  • v j ( v j1 , . . . ,v jm ) T
  • g and m are the cluster numbers of instances and features, respectively. These representations can also be considered as clustering assignment probabilities when a softmax function is deployed as the last layer of the inference network.
  • the posterior clustering assignment probability distributions of h i and v j are denoted as P ⁇ r (k
  • the clustering assignment distributions of instances and features, based on the inference neural network 104 are denoted as Q ⁇ r (k
  • the present embodiments jointly train the inference neural network 104 and GMM 106 in an end-to-end fashion. Similar training can be performed for both instances and features. Given the output of the autoencoders 102 , new representations based on the inference neural network 104 can be expressed as:
  • Inf indicates the inference neural network 104 .
  • N r n is the number of instances
  • h ik is the value on the k th dimensionality of h i . If ⁇ r k , ⁇ r k , and ⁇ r k are given, the clustering probability of i th instance belonging to the k th cluster is:
  • the present embodiments maximize the variational lower bound on the log-likelihood.
  • the benefits are two-fold, making the distribution Q ⁇ r a better approximation to the distribution P ⁇ r by minimizing the KL divergence between them, and tightening the bound of the log-likelihood function to make the training process more effective.
  • the variational lower bound on log-likelihood, r is defined as:
  • h i ) ⁇ E Q (log(Q(k
  • the clustering assignment probability for the j th feature belonging to the k th cluster is expressed as:
  • ⁇ c k , ⁇ c k , and ⁇ c k are the mixture probability, mean, and covariance of the k th component in the GMM for the features, and m is the number of feature clusters.
  • the variational lower bound on log-likelihood for features is:
  • N c d is the number of features
  • P ⁇ c and Q ⁇ c are denoted as P and Q for brevity.
  • the present embodiments take ⁇ r and ⁇ c as the losses for clustering assignment of instances and features.
  • the cross-loss block 108 uses mutual information to correlate the trainings of instances and features. Based on the clustering assignments, the present embodiments construct a joint probability distribution between instances and features as p(X, Y) and the joint probability distribution between instance clusters and feature clusters as p( ⁇ circumflex over (X) ⁇ , ⁇ ). Block 108 penalizes the mutual information loss be-tween the two joint probability distributions.
  • the joint probability between the s th instance cluster, ⁇ circumflex over (x) ⁇ s , and the t th feature cluster, ⁇ t is calculated as:
  • the dot product can be used for (•) because many use cases have equal numbers of instances and features and because there is a corresponding relationship between instance clusters and feature clusters, where similar instances share similar features.
  • the function can be any appropriate function according to the needs of the application.
  • KL(•) is the Kullback-Liebler divergence
  • q ⁇ ( x i , y j ) p ⁇ ( x ⁇ s , y ⁇ t ) ⁇ ( p ⁇ ( x i ) p ⁇ ( x ⁇ s ) ) ⁇ ( p ⁇ ( y j ) p ⁇ ( y ⁇ t ) ) .
  • each joint probability distribution is also greater than equal to zero, leaving the instance-feature cross loss as:
  • the cross loss term shows that the difference between the joint probability distributions should not be significant for an optimal co-clustering.
  • Co-clustering is then performed in block 110 using the cross loss. Co-clustering optimizes an objective function,
  • J 3 ⁇ 8 ⁇ ( 1 - I ⁇ ( X ⁇ ; Y ⁇ ) I ⁇ ( X ; Y ) )
  • l(x i , g r (z i )) and l(y j , g c (w j )) are reconstruction losses for the autoencoders 102
  • P ae ( ⁇ r ) and P ae ( ⁇ r ) are the penalties for the parameters of the autoencoders 102
  • the ⁇ factors are parameters used to balance different parts of the loss function
  • r and c are the variational lower bounds.
  • the A parameters are optimized by cross-validation.
  • P inf ( ⁇ r ) and P int ( ⁇ c ) are the sum of the inverse of the diagonal entries of covariance matrices:
  • d r and d c are the data dimensionality of the outputs of the autoencoders 102 .
  • the P inf terms are used to avoid trivial solutions where diagonal entries in covariance matrices degenerate to zero.
  • the output of the optimization is the clustering assignments of both samples and features.
  • an artificial neural network (ANN) architecture 200 is shown. It should be understood that the present architecture is purely exemplary and that other architectures or types of neural network may be used instead. In the context of the present embodiments, it should be understood that additional layers will be used for the autoencoders 102 , inference networks 104 , and GMM networks 106 .
  • the ANN embodiment described herein is included with the intent of illustrating general principles of neural network computation at a high level of generality and should not be construed as limiting in any way.
  • layers of neurons described below and the weights connecting them are described in a general manner and can be replaced by any type of neural network layers with any appropriate degree or type of interconnectivity.
  • layers can include convolutional layers, pooling layers, fully connected layers, softmax layers, or any other appropriate type of neural network layer.
  • layers can be added or removed as needed and the weights can be omitted for more complicated forms of interconnection.
  • a set of input neurons 202 each provide an input signal in parallel to a respective row of weights 204 .
  • the weights 204 each have a respective settable value, such that a weight output passes from the weight 204 to a respective hidden neuron 206 to represent the weighted input to the hidden neuron 206 .
  • the weights 204 may simply be represented as coefficient values that are multiplied against the relevant signals. The signals from each weight adds column-wise and flows to a hidden neuron 206 .
  • the hidden neurons 206 use the signals from the array of weights 204 to perform some calculation.
  • the hidden neurons 206 then output a signal of their own to another array of weights 204 .
  • This array performs in the same way, with a column of weights 204 receiving a signal from their respective hidden neuron 206 to produce a weighted signal output that adds row-wise and is provided to the output neuron 208 .
  • any number of these stages may be implemented, by interposing additional layers of arrays and hidden neurons 206 . It should also be noted that some neurons may be constant neurons 209 , which provide a constant output to the array. The constant neurons 209 can be present among the input neurons 202 and/or hidden neurons 206 and are only used during feed-forward operation.
  • the output neurons 208 provide a signal back across the array of weights 204 .
  • the output layer compares the generated network response to training data and computes an error.
  • the error signal can be made proportional to the error value.
  • a row of weights 204 receives a signal from a respective output neuron 208 in parallel and produces an output which adds column-wise to provide an input to hidden neurons 206 .
  • the hidden neurons 206 combine the weighted feedback signal with a derivative of its feed-forward calculation and stores an error value before outputting a feedback signal to its respective column of weights 204 . This back propagation travels through the entire network 200 until all hidden neurons 206 and the input neurons 202 have stored an error value.
  • the stored error values are used to update the settable values of the weights 204 .
  • the weights 204 can be trained to adapt the neural network 200 to errors in its processing. It should be noted that the three modes of operation, feed forward, back propagation, and weight update, do not overlap with one another.
  • Block 202 trains a co-clustering network in an end-to-end fashion.
  • the network is described above, with separate branches being trained for the respective instances and features using an autoencoder 102 , an inference network 104 , and a GMM network 106 .
  • the two branches are then cross-correlated to in block 108 and the cross correlation loss information is used in co-clustering to generate an output.
  • the training process uses training data that includes a set of known inputs and their corresponding known co-clustered outputs, which can be supplied by any appropriate means.
  • the training 302 uses discrepancies between the network's generated output and the expected output to provide adjustments to the weights 204 of the network.
  • the entire co-clustering process is trained end-to-end, rather than training each segment in a piecewise fashion. This advantageously prevents the training process from stopping in local optima in the autoencoders 102 , helping improve overall co-clustering performance.
  • Block 304 uses the trained network to perform clustering on input data that has dependencies between its rows and columns. As noted above, block 304 reduces the dimensionality of the data and then performs inferences on the rows and the columns before identifying a mutual information loss between the rows and the columns that can be used to co-cluster them.
  • the output can be, for example, a matrix having one or more co-clusters within it, with the co-clusters representing groupings of data that have relationships between their column and row information.
  • Block 306 uses the trained co-clustering network to identify clustered features of a new document.
  • the new document can represent textual data, but it should be understood that other embodiments can include documents that represent any kind of data, such as graphical data, audio data, binary data, executable data, etc.
  • Block 308 uses the network to identify document clusters based on how the identified features of the new document aligns with known feature clusters.
  • the words in a text document can be mapped to word clusters for known documents. The word clusters thereby identify corresponding co-clustered document clusters, such that block 308 finds a classification for the new document.
  • Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements.
  • the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • the medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
  • Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution.
  • I/O devices including but not limited to keyboards, displays, pointing devices, etc. may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
  • Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • the system 400 includes a hardware processor 402 and memory 404 .
  • a co-clustering neural network 406 is implemented as described above, with autoencoders 102 , inference networks 104 , and GMM networks 106 .
  • the co-clustering neural network 406 also includes static functions, such as the cross-loss block 108 and the joint optimization performed by co-clustering 110 .
  • a training module 408 can be implemented as software that is stored in the memory 404 and that is executed by the hardware processor. In other embodiments, the training module 408 can be implemented in one or more discrete hardware components such as, e.g., an application-specific integrated chip or a field programmable gate array. The training module 408 trains the neural network 406 in an end-to-end fashion using a provided set of training data.
  • the processing system 500 includes at least one processor (CPU) 504 operatively coupled to other components via a system bus 502 .
  • a cache 506 operatively coupled to the system bus 502 .
  • ROM Read Only Memory
  • RAM Random Access Memory
  • I/O input/output
  • sound adapter 530 operatively coupled to the system bus 502 .
  • network adapter 540 operatively coupled to the system bus 502 .
  • user interface adapter 550 operatively coupled to the system bus 502 .
  • display adapter 560 are operatively coupled to the system bus 502 .
  • a first storage device 522 is operatively coupled to system bus 502 by the I/O adapter 520 .
  • the storage device 522 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth.
  • the storage device 522 can be the same type of storage device or different types of storage devices.
  • a speaker 532 is operatively coupled to system bus 502 by the sound adapter 530 .
  • a transceiver 542 is operatively coupled to system bus 502 by network adapter 540 .
  • a display device 562 is operatively coupled to system bus 502 by display adapter 560 .
  • a first user input device 552 is operatively coupled to system bus 502 by user interface adapter 550 .
  • the user input device 552 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles.
  • the user input device 522 can be the same type of user input device or different types of user input devices.
  • the user input device 552 is used to input and output information to and from system 500 .
  • processing system 500 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements.
  • various other input devices and/or output devices can be included in processing system 500 , depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art.
  • various types of wireless and/or wired input and/or output devices can be used.
  • additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods and systems for co-clustering data include reducing dimensionality for instances and features of an input dataset independently of one another. A mutual information loss is determined for the instances and the features independently of one another. The instances and the features are cross-correlated, based on the mutual information loss, to determine a cross-correlation loss. Co-clusters in the input data are determined based on the cross-correlation loss.

Description

    RELATED APPLICATION INFORMATION
  • This application claims priority to U.S. Provisional Patent Application No. 62/679,749, filed on Jun. 1, 2018, incorporated herein by reference herein its entirety.
  • BACKGROUND Technical Field
  • The present invention relates to co-clustering data and, more particularly, to co-clustering that uses neural networks.
  • Description of the Related Art
  • Co-clustering clusters both instances and features simultaneously. For example, when rating movies, people and their rating values can be considered as instances and features, respectively. Seen another way, data expressed in the rows and columns of a matrix can represent respective instances and features. The duality between instances and features indicates that instances can be grouped based on features, and that features can be grouped based on instances.
  • SUMMARY
  • A method for co-clustering data includes reducing dimensionality for instances and features of an input dataset independently of one another. A mutual information loss is determined for the instances and the features independently of one another. The instances and the features are cross-correlated, based on the mutual information loss, to determine a cross-correlation loss. Co-clusters in the input data are determined based on the cross-correlation loss.
  • A data co-clustering system includes an instance autoencoder configured to reduce a dimensionality for instances of an input dataset. A feature autoencoder is configured to reduce a dimensionality for features of an input dataset. An instance mutual information loss branch is configured to determining a mutual information loss for the instances. A feature mutual information loss branch is configured to determine a mutual information loss for the features. A processor is configured to cross-correlate the instances and the features based on the mutual information loss, to determine a cross-correlation loss and to determine co-clusters in the input data based on the cross-correlation loss.
  • These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
  • FIG. 1 is a block/flow diagram of a method/system for co-clustering data in accordance with an embodiment of the present invention;
  • FIG. 2 is a diagram of an exemplary neural network in accordance with an embodiment of the present invention;
  • FIG. 3 is a block/flow diagram of a method for classifying documents based on co-clustering in accordance with an embodiment of the present invention;
  • FIG. 4 is a block diagram of a data co-clustering system in accordance with an embodiment of the present invention; and
  • FIG. 5 is a block diagram of a processing system in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • In accordance with the present invention, systems and methods are provided that perform co-clustering using deep neural networks. The present embodiments use a deep autoencoder to generate low-dimensional representations for instances and features, which are then used as input to respective inference paths, each including an inference network and a Gaussian mixture model (GMM). The GMM outputs are cross-correlated using mutual information loss. The present embodiments can optimize the parameters of the deep-autoencoder, the inference neural network, and the GMM jointly.
  • Co-clustering, as described herein, is particularly advantageous for its identification of feature clusters based on instance clusters. One exemplary application for co-clustering is in text document classification, particularly when training labels are not used. Co-clustering identifies word clusters for each document cluster, making it easy to know the category of each document cluster from the words in the corresponding word cluster. Thus, once the major words in a document have been identified, co-clustering makes it possible to identify the category that a new document belongs to.
  • Referring now to FIG. 1, a block diagram is shown that illustrates the steps performed by the present embodiments. The instances and features are provided to separate paths. The duality between instances and features indicates that instances can be grouped based on features and that features can be grouped based on instances.
  • In each path, the raw input is provided to a deep autoencoder 102 that reduces the dimensionality of the input. The deep autoencoder 102 performs an encoding from the original high-dimensional space to a low-dimensional space. The deep autoencoder 102 then decodes the low-dimensional encoding to reproduce the high-dimensional input to verify that the low-dimensional encoding maintains the information of the original input data. The encoded instances and features are then output by their respective autoencoders 102.
  • An inference network 104 and a GMM 106 provides cluster assignments for the instances and the features, providing a mutual information loss. Cross-correlation block 108 uses the mutual information loss to correlate the instances with the features, providing the co-clustered output.
  • To use one example, text document data can represent the documents as instances and the words within the documents as features Similar documents usually share similar word distributions, so that the instances of text document data can be grouped into clusters based on the features, while similar words often exist in similar documents. The features can then be clustered based on the instances.
  • In some embodiments, the instances and features can be represented as a data matrix. After clustering, the instances and features can be reorganized into homogeneous blocks referred to herein as co-clusters. Co-clusters are subsets of an original data matrix and are characterized as a set of instances and a set of features, with values in a given subset being similar. Co-clusters reflect the structural information in the original data and can indicate relationships between instances and features. Besides identifying similar documents, the present embodiments can be of particular use in fields relating to bioinformatics, recommendation systems, and image segmentation. Co-clustering is superior to traditional clustering in these fields because of its ability to use the relationships between instances and features.
  • In the present embodiments, the instances are represented as {xi}i=1 n={x1, . . . , xn} and the features are represented as {yi}j=1 d={y1, . . . , yd}, with n being a number of instances and d being a number of features. These instances and features are clustered into g instance clusters and g feature clusters. Co-clustering in the present embodiments therefore finds maps Cr and Cc:

  • C r :{x 1 , . . . ,x n }→{{circumflex over (x)} 1 , . . . ,{circumflex over (x)} g}

  • C c :{y 1 , . . . ,y d }→{ŷ 1 , . . . ,ŷ m}
  • where r and c designate rows (instances) and columns (features). The instances can be reordered such that instances that are grouped into the same cluster are arranged to be adjacent. Similar arrangements can be applied to features.
  • The new data structure includes blocks of similar instances and features, referred to herein as co-clusters. If X and Y are two discrete, random variables taking values from the sets {xi}i=1 n and {yi}j=1 d separately, then the joint probability distribution between X and Y is denoted herein as p(X, Y). Similarly, if {circumflex over (X)} and Ŷ are two discrete random variables from the sets {{circumflex over (x)}i}s=1 g and {ŷi}t=1 m, where {{circumflex over (x)}i}s=1 g={{circumflex over (x)}1, . . . ,{circumflex over (x)}g} and {ŷi}t=1 m={ŷ1, . . . ,ŷm}, the joint probability distribution between {circumflex over (X)} and Ŷ is denoted as p({circumflex over (X)},Ŷ). {circumflex over (X)} and Ŷ indicate the partitions induced by X and Y−{circumflex over (X)}=Cr(X) and Ŷ=Cc(Y).
  • As described above, the first step in performing co-clustering is to reduce the dimension of input data in block 102. Some embodiments of the present invention use deep stacked autoencoders that perform unsupervised representation learning. The autoencoders 102 reduce both instances and features separately. Given the ith instance and the ith feature as xi and yj, the lower-dimension representations are denoted herein as:

  • z i =f r(x ir)

  • w j =f c(y jc)
  • where fr and fc denote encoding functions for instances and features, respectively, and θr and θc denote parameters of the autoencoders 102. The encoding functions can be linear or nonlinear, depending on the domain data. The reconstruction losses of xi and yj are denoted as l(xi, gr(zi; θr)) and l(yj, gc(wjc)) separately, where gr and gc are decoding functions for instances and features, respectively.
  • Using the low-dimensional representations produced by the autoencoders 102, the present embodiments use variational inference to produce clustering assignment probabilities. Deep neural networks are used as the inference neural networks 104, using the low-dimensional representations as inputs. The outputs of the inference networks 104 are new representations of instances xi and yj, denoted as:

  • h i =h i1 , . . . ,h ig)T

  • v j=(v j1 , . . . ,v jm)T
  • where g and m are the cluster numbers of instances and features, respectively. These representations can also be considered as clustering assignment probabilities when a softmax function is deployed as the last layer of the inference network.
  • These outputs are also generated by GMM blocks 106. The posterior clustering assignment probability distributions of hi and vj, based on GMM, are denoted as Pϕ r (k|hi) and Pϕ c (k|vj), where ϕr and ϕc are the parameters of GMM when dealing with instances and features separately. The clustering assignment distributions of instances and features, based on the inference neural network 104, are denoted as Qη r (k|hi) and Qη r (k|hj), where ηr and ηc denote the parameters of the inference networks 104.
  • Instead of applying a two-step strategy for GMM, the present embodiments jointly train the inference neural network 104 and GMM 106 in an end-to-end fashion. Similar training can be performed for both instances and features. Given the output of the autoencoders 102, new representations based on the inference neural network 104 can be expressed as:

  • h i=softmax(Inf(z ir)
  • where Inf indicates the inference neural network 104. The mixture probability, mean, and covariance of the kth component in the GMM (ϕr={πr kr kr k) for instances can be estimated as:
  • π r k = N r k / N r μ r k = 1 N r k i = 1 N r k h ik h i Σ r k = 1 N r k i = 1 N r h ik ( h i - μ r k ) ( h i - μ r k ) T
  • where Nr=n is the number of instances, Nr ki=1 N r hik, and hik is the value on the kth dimensionality of hi. If πr k, μr k, and Σr k are given, the clustering probability of ith instance belonging to the kth cluster is:
  • γ r ( i ) k = π r k ( h i μ r k , Σ r k ) k = 1 g π r k ( h i μ R k , Σ r k )
  • where
    Figure US20190370651A1-20191205-P00001
    (•) is the normal distribution probability density function. The log-likelihood can then be written as:
  • log { i = 1 N r P φ r ( h i ) } = i = 1 N r log P φ r ( h i ) = i = 1 N r log { k = 1 K π r k ( h i μ r k , Σ r k ) }
  • Instead of maximizing the log-likelihood function directly, the present embodiments maximize the variational lower bound on the log-likelihood. The benefits are two-fold, making the distribution Qη r a better approximation to the distribution Pϕ r by minimizing the KL divergence between them, and tightening the bound of the log-likelihood function to make the training process more effective. The variational lower bound on log-likelihood,
    Figure US20190370651A1-20191205-P00002
    r is defined as:
  • i = 1 N r { E Q [ log ( P ( k h i ) ) ] + H ( k h i ) }
  • where H(k|hi)=−EQ(log(Q(k|hi))) is the Shannon entropy and Pϕ r and Qη r are represented as P and Q for brevity.
  • The clustering assignment probability for the jth feature belonging to the kth cluster is expressed as:
  • γ c ( j ) k = π c k ( v j μ c k , Σ c k ) k = 1 m π c k ( v j μ c k , Σ c k )
  • where πc k, μc k, and Σc k are the mixture probability, mean, and covariance of the kth component in the GMM for the features, and m is the number of feature clusters. The variational lower bound on log-likelihood for features is:
  • c = j = 1 N c { E Q [ log ( P ( k , v j ) ) ] - E Q ( log ( Q ( k v j ) ) )
  • where Nc=d is the number of features, and Pϕ c and Qη c are denoted as P and Q for brevity. Finally, the present embodiments take −
    Figure US20190370651A1-20191205-P00002
    r and −
    Figure US20190370651A1-20191205-P00002
    c as the losses for clustering assignment of instances and features.
  • The cross-loss block 108 uses mutual information to correlate the trainings of instances and features. Based on the clustering assignments, the present embodiments construct a joint probability distribution between instances and features as p(X, Y) and the joint probability distribution between instance clusters and feature clusters as p({circumflex over (X)}, Ŷ). Block 108 penalizes the mutual information loss be-tween the two joint probability distributions.
  • Given the clustering assignment probability of the ith instance as γr(i)=(γr(i) 1, . . . ,γr(i) g)T and the jth feature as γc(j)=(γc(j) 1, . . . ,γc(j) g)T, the joint probability between the ith instance and the jth feature is denoted as p(xi,yj)=
    Figure US20190370651A1-20191205-P00003
    r(i)c(j)), where
    Figure US20190370651A1-20191205-P00003
    (•) is a function to calculate the joint probability, such as the dot product. The joint probability between the sth instance cluster, {circumflex over (x)}s, and the tth feature cluster, ŷt, is calculated as:

  • p({circumflex over (x)} s t)=Σ{p(x i ,y j)|x i ∈{circumflex over (x)} s ,y j ∈ŷ t}
  • The dot product can be used for
    Figure US20190370651A1-20191205-P00003
    (•) because many use cases have equal numbers of instances and features and because there is a corresponding relationship between instance clusters and feature clusters, where similar instances share similar features. Although the dot product is specifically contemplated, the function can be any appropriate function according to the needs of the application.
  • Given the joint probability distributions p(X, Y) and p({circumflex over (X)}, Ŷ), the mutual information between X and Y and between {circumflex over (X)} and Ŷ are calculated as:
  • I ( X ; Y ) = x i y j p ( x i , y j ) log ( p ( x i , y j ) p ( x i ) p ( y j ) I ( X ^ ; Y ^ ) = x ^ s y ^ t p ( x ^ s , y ^ t ) log ( p ( x ^ s , y ^ t ) p ( x ^ s ) p ( y ^ t )
  • where p(xi)=Σy j p(xi,yj), p(yj)=Σx i p(xi,yj), p({circumflex over (x)}s)=Σŷ t p({circumflex over (x)}st), and p(ŷt)=Σ{circumflex over (x)} s p({circumflex over (x)}st) The difference between I(X; Y)−I({circumflex over (X)}; Ŷ) is:

  • KL(p(X;Y)∥q(X,Y))
  • where KL(•) is the Kullback-Liebler divergence and
  • q ( x i , y j ) = p ( x ^ s , y ^ t ) ( p ( x i ) p ( x ^ s ) ) ( p ( y j ) p ( y ^ t ) ) .
  • The difference is greater than equal to zero, and each joint probability distribution is also greater than equal to zero, leaving the instance-feature cross loss as:
  • 1 - I ( X ^ ; Y ^ ) I ( X ; Y )
  • The cross loss term shows that the difference between the joint probability distributions should not be significant for an optimal co-clustering.
  • Co-clustering is then performed in block 110 using the cross loss. Co-clustering optimizes an objective function,
  • min θ r , θ c , η r , η c J = J 1 + J 2 + J 3 ,
  • to tend the parameters θr, θc, ηr, ηc, where J1 and J2 are the losses for the trainings of instances and feature, respectively, J3 is the instance-feature cross loss, θr and θc are the parameters of the autoencoders 102, and ηr and ηc are the parameters of the inference neural networks 104. The parts of the objective function are broken down as follows:
  • J 1 = 1 n i = 1 n l ( x i , g r ( z i ) ) + λ 1 P ae ( θ r ) + λ 2 ( - r ) + λ e P inf ( Σ r ) J 2 = λ 4 d j = 1 d l ( y j , g c ( w j ) ) + λ 5 P ae ( θ c ) + λ 6 ( - c ) + λ 7 P inf ( Σ c ) J 3 = λ 8 ( 1 - I ( X ^ ; Y ^ ) I ( X ; Y ) )
  • where l(xi, gr (zi)) and l(yj, gc(wj)) are reconstruction losses for the autoencoders 102, Paer) and Pae r) are the penalties for the parameters of the autoencoders 102, the λ factors are parameters used to balance different parts of the loss function, and
    Figure US20190370651A1-20191205-P00002
    r and
    Figure US20190370651A1-20191205-P00002
    c are the variational lower bounds. The A parameters are optimized by cross-validation. The terms Pinfr) and Pintc) are the sum of the inverse of the diagonal entries of covariance matrices:
  • P inf ( Σ r ) = k = 1 g i = 1 d r 1 Σ r ii k P inf ( Σ c ) = k = 1 m j = 1 d c 1 Σ c jj k
  • where dr and dc are the data dimensionality of the outputs of the autoencoders 102. The Pinf terms are used to avoid trivial solutions where diagonal entries in covariance matrices degenerate to zero. The output of the optimization is the clustering assignments of both samples and features.
  • Referring now to FIG. 2, an artificial neural network (ANN) architecture 200 is shown. It should be understood that the present architecture is purely exemplary and that other architectures or types of neural network may be used instead. In the context of the present embodiments, it should be understood that additional layers will be used for the autoencoders 102, inference networks 104, and GMM networks 106. The ANN embodiment described herein is included with the intent of illustrating general principles of neural network computation at a high level of generality and should not be construed as limiting in any way.
  • Furthermore, the layers of neurons described below and the weights connecting them are described in a general manner and can be replaced by any type of neural network layers with any appropriate degree or type of interconnectivity. For example, layers can include convolutional layers, pooling layers, fully connected layers, softmax layers, or any other appropriate type of neural network layer. Furthermore, layers can be added or removed as needed and the weights can be omitted for more complicated forms of interconnection.
  • During feed-forward operation, a set of input neurons 202 each provide an input signal in parallel to a respective row of weights 204. In the hardware embodiment described herein, the weights 204 each have a respective settable value, such that a weight output passes from the weight 204 to a respective hidden neuron 206 to represent the weighted input to the hidden neuron 206. In software embodiments, the weights 204 may simply be represented as coefficient values that are multiplied against the relevant signals. The signals from each weight adds column-wise and flows to a hidden neuron 206.
  • The hidden neurons 206 use the signals from the array of weights 204 to perform some calculation. The hidden neurons 206 then output a signal of their own to another array of weights 204. This array performs in the same way, with a column of weights 204 receiving a signal from their respective hidden neuron 206 to produce a weighted signal output that adds row-wise and is provided to the output neuron 208.
  • It should be understood that any number of these stages may be implemented, by interposing additional layers of arrays and hidden neurons 206. It should also be noted that some neurons may be constant neurons 209, which provide a constant output to the array. The constant neurons 209 can be present among the input neurons 202 and/or hidden neurons 206 and are only used during feed-forward operation.
  • During back propagation, the output neurons 208 provide a signal back across the array of weights 204. The output layer compares the generated network response to training data and computes an error. The error signal can be made proportional to the error value. In this example, a row of weights 204 receives a signal from a respective output neuron 208 in parallel and produces an output which adds column-wise to provide an input to hidden neurons 206. The hidden neurons 206 combine the weighted feedback signal with a derivative of its feed-forward calculation and stores an error value before outputting a feedback signal to its respective column of weights 204. This back propagation travels through the entire network 200 until all hidden neurons 206 and the input neurons 202 have stored an error value.
  • During weight updates, the stored error values are used to update the settable values of the weights 204. In this manner the weights 204 can be trained to adapt the neural network 200 to errors in its processing. It should be noted that the three modes of operation, feed forward, back propagation, and weight update, do not overlap with one another.
  • Referring now to FIG. 3, a method for co-clustering data is shown. Block 202 trains a co-clustering network in an end-to-end fashion. The network is described above, with separate branches being trained for the respective instances and features using an autoencoder 102, an inference network 104, and a GMM network 106. The two branches are then cross-correlated to in block 108 and the cross correlation loss information is used in co-clustering to generate an output. The training process uses training data that includes a set of known inputs and their corresponding known co-clustered outputs, which can be supplied by any appropriate means. The training 302 uses discrepancies between the network's generated output and the expected output to provide adjustments to the weights 204 of the network.
  • It is specifically contemplated that the entire co-clustering process is trained end-to-end, rather than training each segment in a piecewise fashion. This advantageously prevents the training process from stopping in local optima in the autoencoders 102, helping improve overall co-clustering performance.
  • Block 304 then uses the trained network to perform clustering on input data that has dependencies between its rows and columns. As noted above, block 304 reduces the dimensionality of the data and then performs inferences on the rows and the columns before identifying a mutual information loss between the rows and the columns that can be used to co-cluster them. The output can be, for example, a matrix having one or more co-clusters within it, with the co-clusters representing groupings of data that have relationships between their column and row information.
  • Block 306 then uses the trained co-clustering network to identify clustered features of a new document. In some embodiments, the new document can represent textual data, but it should be understood that other embodiments can include documents that represent any kind of data, such as graphical data, audio data, binary data, executable data, etc. Block 308 uses the network to identify document clusters based on how the identified features of the new document aligns with known feature clusters. Thus, in one example, the words in a text document can be mapped to word clusters for known documents. The word clusters thereby identify corresponding co-clustered document clusters, such that block 308 finds a classification for the new document.
  • Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
  • Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • Referring now to FIG. 4, a co-clustering system 400 is shown. The system 400 includes a hardware processor 402 and memory 404. A co-clustering neural network 406 is implemented as described above, with autoencoders 102, inference networks 104, and GMM networks 106. The co-clustering neural network 406 also includes static functions, such as the cross-loss block 108 and the joint optimization performed by co-clustering 110.
  • A training module 408 can be implemented as software that is stored in the memory 404 and that is executed by the hardware processor. In other embodiments, the training module 408 can be implemented in one or more discrete hardware components such as, e.g., an application-specific integrated chip or a field programmable gate array. The training module 408 trains the neural network 406 in an end-to-end fashion using a provided set of training data.
  • Referring now to FIG. 5, an exemplary processing system 500 is shown which may represent the co-clustering system 400. The processing system 500 includes at least one processor (CPU) 504 operatively coupled to other components via a system bus 502. A cache 506, a Read Only Memory (ROM) 508, a Random Access Memory (RAM) 510, an input/output (I/O) adapter 520, a sound adapter 530, a network adapter 540, a user interface adapter 550, and a display adapter 560, are operatively coupled to the system bus 502.
  • A first storage device 522 is operatively coupled to system bus 502 by the I/O adapter 520. The storage device 522 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage device 522 can be the same type of storage device or different types of storage devices.
  • A speaker 532 is operatively coupled to system bus 502 by the sound adapter 530. A transceiver 542 is operatively coupled to system bus 502 by network adapter 540. A display device 562 is operatively coupled to system bus 502 by display adapter 560.
  • A first user input device 552 is operatively coupled to system bus 502 by user interface adapter 550. The user input device 552 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input device 522 can be the same type of user input device or different types of user input devices. The user input device 552 is used to input and output information to and from system 500.
  • Of course, the processing system 500 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 500, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 500 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.
  • The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Claims (19)

What is claimed is:
1. A method for co-clustering data, comprising:
reducing dimensionality for instances and features of an input dataset independently of one another;
determining a mutual information loss for the instances and the features independently of one another;
cross-correlating the instances and the features, using a processor, based on the mutual information loss, to determine a cross-correlation loss; and
determining co-clusters in the input data based on the cross-correlation loss.
2. The method of claim 1, further comprising classifying a new instance based on associated new features.
3. The method of claim 1, wherein the instances include documents and the features include words associated with respective documents.
4. The method of claim 1, wherein determining the mutual information loss includes an inference neural network step and a Gaussian mixture model step.
5. The method of claim 4, further comprising an inference neural network and a Gaussian mixture model in an end-to-end fashion.
6. The method of claim 1, wherein determining co-clusters includes optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross-correlation loss term that includes the determined cross-correlation loss.
7. The method of claim 6, wherein the objective function is:
min θ r , θ c , η r , η c J = J 1 + J 2 + J 3
where J1 is the reconstruction loss term for the instances, J2 is the reconstruction loss term for the features, J3 is the cross-correlation loss term, θr and θc are dimension reduction parameters for the instances and the features, respectively, and ηr and ηc are mutual information loss parameters for the instances and the features, respectively.
8. The method of claim 6, wherein reducing the dimensionality of the instances and the features comprises applying respective autoencoders to the input data.
9. The method of claim 8, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality.
10. The method of claim 1, further comprising performing text classification using the determined co-clusters.
11. A data co-clustering system, comprising:
an instance autoencoder configured to reduce a dimensionality for instances of an input dataset;
a feature autoencoder configured to reduce a dimensionality for features of an input dataset;
an instance mutual information loss branch configured to determining a mutual information loss for the instances;
a feature mutual information loss branch configured to determine a mutual information loss for the features;
a processor configured to cross-correlate the instances and the features based on the mutual information loss, to determine a cross-correlation loss and to determine co-clusters in the input data based on the cross-correlation loss.
12. The system of claim 11, wherein the processor is further configured to classify a new instance based on associated new features.
13. The system of claim 11, wherein the instances include documents and the features include words associated with respective documents.
14. The system of claim 11, wherein the input dataset comprises a matrix having columns that represent one of the features and the instances and rows that represent the other of the features and the instances.
15. The system of claim 11, wherein each mutual information loss branch determines a respective mutual information loss using an inference neural network and a Gaussian mixture model.
16. The system of claim 15, further comprising a training module configured to train the inference neural network and a Gaussian mixture model in an end-to-end fashion.
17. The system of claim 11, wherein the processor is further configured to determine co-clusters using optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross-correlation loss term that includes the determined cross-correlation loss.
18. The system of claim 17, wherein the objective function is:
min θ r , θ c , η r , η c J = J 1 + J 2 + J 3
where J1 is the reconstruction loss term for the instances, J2 is the reconstruction loss term for the features, J3 is the cross-correlation loss term, θr and θc are dimension reduction parameters for the instances and the features, respectively, and ηr and ηc are mutual information loss parameters for the instances and the features, respectively.
20. The method of claim 17, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality.
US16/429,425 2018-06-01 2019-06-03 Deep Co-Clustering Abandoned US20190370651A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/429,425 US20190370651A1 (en) 2018-06-01 2019-06-03 Deep Co-Clustering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862679749P 2018-06-01 2018-06-01
US16/429,425 US20190370651A1 (en) 2018-06-01 2019-06-03 Deep Co-Clustering

Publications (1)

Publication Number Publication Date
US20190370651A1 true US20190370651A1 (en) 2019-12-05

Family

ID=68693557

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/429,425 Abandoned US20190370651A1 (en) 2018-06-01 2019-06-03 Deep Co-Clustering

Country Status (1)

Country Link
US (1) US20190370651A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898635A (en) * 2020-06-24 2020-11-06 华为技术有限公司 Neural network training method, data acquisition method and device
US20220019888A1 (en) * 2020-07-20 2022-01-20 Adobe Inc. Unified framework for dynamic clustering and discrete time event prediction

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300547A1 (en) * 2008-05-30 2009-12-03 Kibboko, Inc. Recommender system for on-line articles and documents
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US20180129906A1 (en) * 2016-11-07 2018-05-10 Qualcomm Incorporated Deep cross-correlation learning for object tracking
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis
US20190228312A1 (en) * 2018-01-25 2019-07-25 SparkCognition, Inc. Unsupervised model building for clustering and anomaly detection
US20200065656A1 (en) * 2016-11-15 2020-02-27 Google Llc Training neural networks using a clustering loss
US20200327404A1 (en) * 2016-03-28 2020-10-15 Icahn School Of Medicine At Mount Sinai Systems and methods for applying deep learning to data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300547A1 (en) * 2008-05-30 2009-12-03 Kibboko, Inc. Recommender system for on-line articles and documents
US20200327404A1 (en) * 2016-03-28 2020-10-15 Icahn School Of Medicine At Mount Sinai Systems and methods for applying deep learning to data
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US20180129906A1 (en) * 2016-11-07 2018-05-10 Qualcomm Incorporated Deep cross-correlation learning for object tracking
US20200065656A1 (en) * 2016-11-15 2020-02-27 Google Llc Training neural networks using a clustering loss
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis
US20190228312A1 (en) * 2018-01-25 2019-07-25 SparkCognition, Inc. Unsupervised model building for clustering and anomaly detection

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BADINO, L. et al., "An Auto-encoder based Approach to Unsupervised Learning of Subword Units" (Year: 2014) *
PENG, H. et al., "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy" (Year: 2005) *
QIU, G., "Image and Feature Co-Clustering" (Year: 2004) *
USAMA, M. et al., "Unsupervised Machine Learning for Networking: Techniques, Applications and Research Challenges" (Year: 2017) *
VINCENT, P. et al., "Extracting and Composing Robust Features with Denoising Autoencoders" (Year: 2008) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898635A (en) * 2020-06-24 2020-11-06 华为技术有限公司 Neural network training method, data acquisition method and device
US20220019888A1 (en) * 2020-07-20 2022-01-20 Adobe Inc. Unified framework for dynamic clustering and discrete time event prediction

Similar Documents

Publication Publication Date Title
US11770571B2 (en) Matrix completion and recommendation provision with deep learning
US11941523B2 (en) Stochastic gradient boosting for deep neural networks
US20200184339A1 (en) Representation learning for input classification via topic sparse autoencoder and entity embedding
Gribonval et al. Sample complexity of dictionary learning and other matrix factorizations
US8489529B2 (en) Deep convex network with joint use of nonlinear random projection, Restricted Boltzmann Machine and batch-based parallelizable optimization
CN112966114B (en) Literature classification method and device based on symmetrical graph convolutional neural network
US11996116B2 (en) Methods and systems for implementing on-device non-semantic representation fine-tuning for speech classification
CN109063719B (en) Image classification method combining structure similarity and class information
US20200065656A1 (en) Training neural networks using a clustering loss
CN108921342B (en) Logistics customer loss prediction method, medium and system
US20220207352A1 (en) Methods and systems for generating recommendations for counterfactual explanations of computer alerts that are automatically detected by a machine learning algorithm
CN112861936A (en) Graph node classification method and device based on graph neural network knowledge distillation
CN111125520B (en) Event line extraction method based on deep clustering model for news text
US11636667B2 (en) Pattern recognition apparatus, pattern recognition method, and computer program product
US20150161232A1 (en) Noise-enhanced clustering and competitive learning
US11475236B2 (en) Minimum-example/maximum-batch entropy-based clustering with neural networks
US20190370651A1 (en) Deep Co-Clustering
US11886955B2 (en) Self-supervised data obfuscation in foundation models
Vialatte et al. A study of deep learning robustness against computation failures
CN114329233A (en) Cross-region cross-scoring collaborative filtering recommendation method and system
CN112148931A (en) Meta path learning method for high-order abnormal picture classification
WO2023000165A1 (en) Method and apparatus for classifying nodes of a graph
US20220309292A1 (en) Growing labels from semi-supervised learning
US20220207353A1 (en) Methods and systems for generating recommendations for counterfactual explanations of computer alerts that are automatically detected by a machine learning algorithm
US20220367051A1 (en) Methods and systems for estimating causal effects from knowledge graphs

Legal Events

Date Code Title Description
AS Assignment

Owner name: NEC LABORATORIES AMERICA, INC., NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHENG, WEI;CHEN, HAIFENG;NI, JINGCHAO;SIGNING DATES FROM 20190529 TO 20190530;REEL/FRAME:049346/0199

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION