WO2019219198A1 - Dispositif et procédé pour regrouper des données d'entrée - Google Patents

Dispositif et procédé pour regrouper des données d'entrée Download PDF

Info

Publication number
WO2019219198A1
WO2019219198A1 PCT/EP2018/062961 EP2018062961W WO2019219198A1 WO 2019219198 A1 WO2019219198 A1 WO 2019219198A1 EP 2018062961 W EP2018062961 W EP 2018062961W WO 2019219198 A1 WO2019219198 A1 WO 2019219198A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
clustering
auto
cluster
points
Prior art date
Application number
PCT/EP2018/062961
Other languages
English (en)
Inventor
Elad TZOREFF
Olga KOGAN
Yoni CHOUKROUN
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN201880093500.0A priority Critical patent/CN112154453A/zh
Priority to PCT/EP2018/062961 priority patent/WO2019219198A1/fr
Publication of WO2019219198A1 publication Critical patent/WO2019219198A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the present invention relates to the field of machine learning and clustering, i.e. a process of discovering similarity structures in large data-sets. More specifically, the present invention relates to a device for clustering of input data and a corresponding method, wherein the device comprises an auto-encoding unit and a clustering unit.
  • clustering is one of the most fundamental unsupervised machine learning problems. Its main goal is to separate data-sets of input data into clusters of similar data-points. For example, clustering can be used to cluster users based on user behavior, e.g. for cybersecurity purposes, event clustering for IT operations, clustering and anomaly detection for healthcare applications or industrial monitoring applications. Besides these applications, clustering is beneficial for multiple other fundamental tasks. For instance, it can serve for automatic data labeling for supervised learning and as a pre-processing step for data visualization and analysis. In the prior art, dimensionality reduction and feature extraction have been used along with clustering, to map input data into a feature space in which separation into clusters is easier with respect to a present problem’s context. Using deep neural networks (DNNs), it is possible to leam non-linear mappings, allowing to transform input data into more clustering- friendly representations.
  • DNNs deep neural networks
  • dimension reduction/feature selection and clustering are treated separately in a two-phase process, as e.g. illustrated in Fig. 7.
  • the auto-encoder chooses features that are optimized for reconstruction with low loss, of all variations in the input data, while clustering requires features that can reduce all data variations to a single template (i.e. a single class, or a single cluster).
  • the auto-encoder output (obtained in the first phase) loses features that are important for the clustering (which is done in the second phase).
  • the present invention aims to improve the conventional clustering devices.
  • the present invention has the objective to provide a device for clustering of input data.
  • the device comprises an auto-encoding unit and a clustering unit.
  • the auto-encoding unit employs an auto-encoding algorithm that reaches high separability of data (resulting in higher clustering accuracy) by optimizing data even before it is processed in a clustering step in the clustering unit.
  • the auto-encoding unit receives input data (which can be regarded as a data-set comprising data-points) and provides optimized output data, being low dimensional output data, to the clustering unit.
  • the auto- encoder in particular aims at subspace dimension maximization regularized by reconstruction loss.
  • a dimension of the input data is reduced, based on a reconstruction loss parameter of the input data.
  • the dimension is only reduced to such an amount that reconstruction loss in the reduced data is minimized and that the low dimensional data forwarded by the auto-encoding unit, to the clustering unit, is optimized for achieving high clustering accuracy.
  • the encoder output is used to obtain at least one cluster, based on the low dimensional data, and associate each data-point in the low dimensional data to one cluster of the at least one cluster.
  • the second phase in a two-step alternating maximization of clustering and encoder parameters (i.e. of operating parameters of the clustering unit and the auto-encoding unit), higher accuracy can be achieved compared to the results reached by prior art solutions.
  • a similarity maximization of data-points is achieved by subspace dimension maximization, i.e. by further maximizing the dimension of the low dimensional data output by the auto-encoding unit.
  • This step is in particular performed in a clustering phase by the clustering unit.
  • similarity maximization regularized by cross-cluster similarity minimization and within-cluster similarity maximization is achieved. In other words, similarities of data-points associated with different clusters are minimized, and similarities of data-points associated with a same cluster are maximized.
  • a first aspect of the present invention provides a device for clustering of input data being a data-set comprising data-points, wherein the device comprises an auto-encoding unit, configured to, in a first operating phase of the device, reduce a dimension of the input data and/or extract features relevant for clustering from the input data, thereby creating low dimensional data; and a clustering unit, configured to, in a second operating phase of the device, obtain at least one cluster, based on the low dimensional data, and configured to associate each data-point in the low dimensional data to one cluster of the at least one cluster; wherein the low dimensional data is optimized, by the auto-encoding unit, for being reconstructed without loss.
  • the low dimensional data comprises linear independent code rows, thereby minimizing reconstruction loss.
  • Linear independent code rows in the low dimensional data output by the auto-encoder ensure that reconstruction loss, which occurs when reconstructing the input data from the low dimensional data, is minimized. This also helps to increase clustering accuracy of the clustering unit.
  • the dimensional reduction of the input data includes applying a first function to the input data, which is configured to minimize pairwise similarity of data-points in the input data, to provide the low dimensional data.
  • Minimizing pairwise similarity of data-points in the auto-encoding step helps to improve accuracy of clustering in the clustering unit, and overall clustering accuracy.
  • a similarity metric is applied by the first function to the data-points in the input data.
  • Using a similarity metric by the first function ensures that minimizing the pairwise similarity can be precisely controlled according to the similarity metric.
  • the similarity metric applied by the first function is a cosine similarity.
  • the device further comprises a decoder, configured to decode the low dimensional data and compare it to the input data, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit, to minimize reconstruction loss.
  • a decoder configured to decode the low dimensional data and compare it to the input data, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit, to minimize reconstruction loss.
  • the clustering unit is further configured to obtain a centroid parameter for each cluster.
  • the clustering unit is further configured to determine, to which cluster a data-point is assigned, based on a centroid parameter of the cluster.
  • the clustering unit is further configured to apply a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.
  • a similarity metric applied by the second function is a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.
  • the clustering unit is further configured to minimize similarities of data-points that are associated to different clusters.
  • the clustering unit is further configured to maximize similarities of data-points that are associated to a same cluster.
  • the device in the second phase, is further configured to optimize operating parameters of the auto-encoding unit based on operating parameters of the clustering unit, and optimize the operating parameters of the clustering unit based on the operating parameters of the auto encoding unit.
  • the device is further configured to simultaneously optimize the operating parameters of the auto-encoding unit and the operating parameters of the clustering unit.
  • a second aspect of the present invention provides a method for clustering of input data being a data-set comprising data-points, wherein the method comprises the steps of, in a first operating phase of the device, reducing, by an auto-encoding unit, a dimension of the input data and/or extracting, by the auto-encoding unit, features relevant for clustering from the input data, thereby creating low dimensional data; and, in a second operating phase of the device, obtaining, by a clustering unit, at least one cluster, based on the low dimensional data, and associating, by the clustering unit, each data-point in the low dimensional data to one cluster of the at least one clusters; wherein the low dimensional data is optimized, by the auto-encoding unit, for being reconstructed without loss.
  • the low dimensional data comprises linear independent code rows, thereby minimizing reconstruction loss.
  • the dimensional reduction of the input data includes applying a first function to the input data, which is configured to minimize pairwise similarity of data-points in the input data, to provide the low dimensional data.
  • a similarity metric is applied by the first function to the data-points in the input data.
  • the similarity metric applied by the first function is a cosine similarity.
  • the method further comprises decoding, by a decoder, the low dimensional data, and comparing it to the input data, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit, to minimize reconstruction loss.
  • the method further includes obtaining, by the clustering unit, a centroid parameter for each cluster. In an implementation form of the second aspect, the method further includes determining, by the clustering unit, to which cluster a data-point is assigned, based on a centroid parameter of the cluster.
  • the method further includes applying, by the clustering unit, a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.
  • a similarity metric applied by the second function is a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.
  • the method further includes minimizing, by the clustering unit, similarities of data-points that are associated to different clusters.
  • the method further includes, by the clustering unit, maximizing similarities of data-points that are associated to a same cluster.
  • the method further includes optimizing operating parameters of the auto-encoding unit based on operating parameters of the clustering unit, and optimizing the operating parameters of the clustering unit based on the operating parameters of the auto encoding unit.
  • the method further includes simultaneously optimizing the operating parameters of the auto-encoding unit and the operating parameters of the clustering unit.
  • the method of the second aspect and its implementation forms include the same advantages as the service orchestrator according to the first aspect and its implementation forms.
  • FIG. 1 shows a schematic view of a clustering device according to an embodiment of the present invention.
  • FIG. 2 shows a schematic view of a clustering device according to an embodiment of the present invention in more detail.
  • FIG. 3 shows a schematic view of a method according to an embodiment of the present invention.
  • FIG. 4 shows a schematic view of an auto-encoder.
  • FIG. 5 shows a schematic view of algorithms implemented by the present invention.
  • FIG. 6 shows a schematic view of operating results of the clustering device according to the present invention, compared to the prior art.
  • FIG. 7 shows a schematic view of clustering according to the prior art.
  • FIG. 8 shows another schematic view of clustering according to the prior art.
  • FIG. 1 shows a schematic view of a clustering device 100 according to an embodiment of the present invention.
  • the device 100 is configured for clustering of input data 101 being a data- set comprising data-points.
  • the device 100 comprises an auto-encoding unit 102 and a clustering unit 104.
  • the auto-encoding unit 102 is configured to, in a first operating phase of the device 100, reduce a dimension of the input data 101 and/or extract features relevant for clustering from the input data 101, thereby creating low dimensional data 103.
  • the low dimensional data 103 specifically includes a data-set with data-points of reduced dimension.
  • the auto-encoding unit 102 allows to extract a discriminative and informative latent space (i.e. features) that is clustering oriented, from the input data 101. That is, in the first operating phase, the input data 101 is reduced to a low dimensional space, providing good initial conditions for processing in the clustering unit 104. In particular, keeping linear independency between code rows in the low dimensional data 103 helps in preserving essential features for clustering in the clustering unit 104.
  • the low dimensional data 103 optionally can comprise linear independent code rows, thereby minimizing reconstruction loss.
  • the low dimensional data 103 includes a data-set with data-points of reduced dimension.
  • the first operating phase can also be called a learning phase of the auto-encoding unit 102.
  • the device 100 further includes the clustering unit 104, which is configured to, in a second operating phase of the device 100, obtain at least one cluster 105, based on the low dimensional data 103, and associate each data-point in the low dimensional data 103 to one cluster of the at least one cluster 105.
  • the low dimensional data 103 that is processed in the clustering unit 104, is optimized, by the auto-encoding unit 102, for being reconstructed without loss.
  • the present invention provides an increased overall clustering accuracy, by means of the device 100.
  • clusters 105 there are three clusters 105 shown in Fig. 1, there can be an arbitrary number of clusters obtained by the device 100, as long as there is at least one cluster 105.
  • clustering in the clustering unit 104 relates to grouping data-points (obtained from the low dimensional data 103) with similar patterns, based on their latent space representation, obtained in the first phase in the auto-encoding unit 102, to obtain several centroids (which may also be called centroid parameters) that characterize the data-set, such that each data-point is represented by a single centroid to which it is assigned.
  • the clustering unit 104 is further configured to obtain a centroid parameter for each cluster 105. In this manner, each new data-point can be compared exclusively to the centroids, to decide to which cluster it is assigned, instead of comparing it to the entire data-set. That is, the clustering unit 104 is further configured to determine, to which cluster a data-point is assigned, based on a centroid parameter of a cluster. This improves efficiency of classification of new data-points.
  • the first and the second operating phase may also be regarded training phases of the device 100.
  • the first operating phase may be called a training phase of the auto-encoding unit 102, while the second phase may be called a clustering phase.
  • the training phase of the auto-encoding unit 102 is motivated by the need to provide the clustering phase with a good initial separation of data-points in the low dimensional data 103.
  • a latent space with discriminative attributes is learned by using a discriminative loss function.
  • the discriminative loss function enables minimization of pairwise similarity between each pair of data-points in the data-set (which is comprised by the input data 101).
  • the dimensional reduction of the input data 101 includes applying a first function to the input data 101, which is configured to minimize pairwise similarity of data-points in the input data 101, to provide the low dimensional data 103.
  • the first function can in particular be the discriminative loss function.
  • a similarity metric can be applied by the first function to the data-points in the input data, to precisely adjust an outcome of the first function.
  • the similarity metric applied by the first function in particular can be a cosine similarity.
  • the clustering phase is initialized with the latent space obtained by the auto-encoding unit 102 in the previous phase.
  • parameters of centroids are now jointly optimized together with the auto encoder parameters (i.e. the operating parameters of the auto-encoding unit 102).
  • the latent space is improved to achieve a higher accuracy of the overall clustering task in the device 100 while optimizing for the best representative centroids of each cluster.
  • a metric chosen as an objective for the clustering task is the sum of cosine similarities between each data-point and the centroid to which the data-point is assigned.
  • the processing of the clustering unit 104 in particular can comprise the following two optional steps: In the first step, cross-cluster separation of data- points is maximized by controlling operating parameters of the auto-encoding unit 102 by the clustering unit 104. In the second step, cross-cluster separation and within-cluster similarity are maximized by controlling operating parameters of the clustering unit 104.
  • the clustering unit 104 being further configured to apply a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.
  • the second function can either be applied to data-points in the input data 101, or to data-points in the low dimensional data 103.
  • a similarity metric applied by the second function can be a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.
  • the centroids and cluster assignments obtained during the optimization in the auto-encoding unit 102 are not trusted in, yet. Therefore, regularizing the clustering task with the pairwise discriminative loss function from the auto-encoder training phase continues. That is, no distinction is made between pairs of data-points that are considered similar (i.e. that are assigned to a same cluster), and the ones that are consider dissimilar (i.e. that are assigned to different clusters).
  • the discriminative loss function i.e. the second function
  • pairwise similarities of“between cluster” data-points i.e. data-points associated to different clusters
  • pairwise similarities of“within cluster” data-points i.e. data-points that are associated to a same cluster
  • the clustering unit 104 optionally can be further configured to minimize similarities of data-points that are associated to different clusters 105.
  • the clustering unit 104 can also optionally be configured to maximize similarities of data-points that are associated to a same cluster 105.
  • the device 100 can also be configured to perform a third operating phase, which can be called an inference phase.
  • a third operating phase which can be called an inference phase.
  • each new data-point that arrives at the device 100 passes through the auto-encoding unit 102 in order to extract the new data-point’ s latent representation.
  • a cosine similarity between the latent representations of the new data- point is calculated.
  • the new data-point is then assigned to the cluster with the highest cosine similarity. If a low cosine similarity with respect to all clusters is observed (based on typical values of similarity obtained from the training set), the data-point shall be considered an anomaly.
  • FIG. 2 shows a schematic view of the clustering device 100 according to an embodiment of the present invention in more detail.
  • the device 100 of Fig. 2 is based on the device 100 of Fig. 1 and therefore includes all of its function and features. To this end, identical features are labelled with identical reference signs. All features that are now going to be described in view of Fig. 2 are optional features of the device 100.
  • the device 100 optionally further can comprise a decoder 201, configured to decode the low dimensional data 103 and compare it to the input data 101, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit 102, to minimize reconstruction loss.
  • a decoder 201 configured to decode the low dimensional data 103 and compare it to the input data 101, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit 102, to minimize reconstruction loss.
  • the device 100 optionally enables for joint learning of clustering and auto-encoder parameters (i.e. operating parameters of the clustering unit 104 and the auto-encoding unit 102). That is, in particular in the second phase, the device 100 is optionally further configured to optimize the operating parameters of the auto-encoding unit 102 based on operating parameters of the clustering unit 104, and optimize the operating parameters of the clustering unit 104 based on the operating parameters of the auto encoding unit 102. Further optionally, the device 100 can be configured to simultaneously optimize the operating parameters of the auto-encoding unit 102 and the operating parameters of the clustering unit 104.
  • auto-encoder parameters i.e. operating parameters of the clustering unit 104 and the auto-encoding unit 102
  • the following loss function is minimized with respect to the parameters q e , q ⁇ of the auto encoding unit 102.
  • q b , q ⁇ are operating parameters of the encoder and a decoder, respectively.
  • X L denotes a raw data representation.
  • Z j (0 e ) denotes a latent space representation (i.e. the output of the auto-encoding unit 102).
  • d is a distance between a data-point and its reconstruction (which e.g. can be L 2 , L- L or any distance measure). 1 is a regularization strength.
  • the first component of the loss function L(6 e , 6 d ) stands for the reconstruction loss
  • ) of the loss function stands for the maximal separability (which is obtained by minimizing the pairwise cosine similarity among all data-points’ latent representation).
  • the solution of the present invention maximizes the dimensionality of the encoder sub-space by minimizing an L norm of the cosine similarities between all pairs of samples in a batch matrix (i.e. a matrix comprising input data 101 for the device 100). This encourages the linear independency between matrix rows in the low dimensional data 103, therefore maximize a rank of the matrix.
  • clusters pairs versus ⁇ cross cluster pairs.
  • the loss function is composed of three (3) components (from left to right): A cosine similarity between centroids and their assigned data-points which is to be maximized. A reconstruction loss. A separability (which is obtained by minimizing the pairwise cosine similarity among all data-points’ latent representation, similar to the term that is used in the training phase of the auto-encoding unit 102).
  • the centroids and assignments is trusted in, and the separability term is divided into two terms: within cluster similarity, which is to be maximized, and between cluster similarity, which is to be minimized. That is, the loss function of the second step in the second phase becomes:
  • FIG. 3 shows a schematic view of a method 300 according to an embodiment of the present invention.
  • the method 300 is for operating a device 100 and thus enables for clustering of input data 101.
  • the method 300 comprises a first step of, in a first operating phase of the device 100, reducing 301, by an auto-encoding unit 102 of the device 100, a dimension of the input data 101 and/or extracting 301, by the auto-encoding unit 102, features relevant for clustering from the input data 101, thereby creating low dimensional data 103.
  • the method 300 further comprises a second step of, in a second operating phase of the device 100, obtaining 302, by a clustering unit 104 of the device 100, at least one cluster 105, based on the low dimensional data 103, and associating 302, by the clustering unit 104, each data- point in the low dimensional data 103 to one cluster of the at least one clusters 105; wherein the low dimensional data 103 is optimized, by the auto-encoding unit 102, for being reconstructed without loss.
  • the present invention teaches a paradigm in which a latent space representation of a certain data-set is trained jointly with clustering parameters.
  • This end-to-end approach achieves a clustering oriented representation and therefore better clustering accuracy.
  • a conventional auto-encoding unit is first trained to minimize reconstruction loss and then applied as an initial condition to the joint optimization problem in which both clustering and encoder parameters are jointly optimized to minimize the clustering error.
  • a major attention is dedicated to the clustering phase, it can be observed that in most cases the improvement of the clustering phase over the initial phase amounts a maximum of 15% to 20% of accuracy. This leads to the conclusion that the initial phase has a significant effect on the overall accuracy and therefore a focus shall be put on this step.
  • L d (z.i, Z j ) sim(z i , z / ), and sim( ⁇ ) stands for any similarity measure between a pair of data-points.
  • formula (4) shall be approximated using the batch matrix of the latent representation Z £ R l B l xd where
  • a k-nearest neighbors graph is constructed and a set of A B is determined, yielding the following approximation to formula (4):
  • L(X: 9 e .9 d ) L d (Z; 9 e ) + X L r (X. X) (6)
  • l stands for the regularization strength
  • L r denotes the reconstruction loss
  • X stands for the raw input batch matrix
  • F stands for the Frobenius norm.
  • a natural candidate for the objective function for the clustering phase is the cosine similarity between the learned centroids and each data-point, since the cosine similarity is applied to discriminate between pairs of data-points in the initial phase.
  • the primary goal of the clustering phase is to maximize the following objective function where S stands for the assignment matrix and S ik £ ⁇ 0,1 ⁇ are the hard decisions of the clustering procedure.
  • the clustering phase is divided into two steps. In a first step the clustering assignments are not trusted in, and it is continued to regularize them with formula (5) and the reconstruction loss. In a second step in which the clustering assignments are trusted in, and the discriminative objective is split into between cluster and within cluster terms, such that the assignments of each data-point to the associated cluster, are derived from the assignments matrix S. Therefore, the optimization problem solved in the first step is given by where l ⁇ , l n stands for the regularization strength of the discriminative and reconstruction losses, respectively.
  • the device 100 comprises two parts: an auto-encoder (or encoder, or auto-encoding unit 102) and clustering (or clustering unit 104).
  • An auto-encoder network of the auto-encoder can be a fully convolutional neural network with alternating convolution with batch normalization and max pooling layers.
  • the decoder applies unsampling to higher resolution using resize with nearest-neighbor extrapolation followed by convolution layers also with layer- wise batch normalization.
  • the auto-encoder architecture is depicted in Fig. 4.
  • Training the auto-encoder in the initial phase begins with the minimization of formula (6).
  • the regularization strength of the discriminative loss is an hyper parameter that is determined in the region l £ (0, 1]
  • the value of l differs among different data-sets, such that data-sets that are more complex, require more aggressive discrimination while maintaining the strength of the reconstruction loss is constant.
  • the training is done on large batches to ensure
  • the X is increased in order to optimize for a discrimination in directions that enable a reasonable reconstruction.
  • the training scheme for the auto encoder is summarized by the algorithm shown in Fig. 5 A.
  • An alternating maximization scheme is applied, in which each set of variables is optimized while the other sets remain fix.
  • the optimization procedure iterates until convergence.
  • the clustering phase is divided into two stages (i.e. the two steps of the second phase) which differ from each other by the objective functions they aim to maximize.
  • the pseudo-code for the first stage (oh the second phase) is summarized in the algorithm show in Fig. 5B.
  • the first stage it is optimized for formula (8), while using relative large regularization strength for the discriminative loss X d £ [1, 5] and lower regularization strength X r £ (0,1] for the reconstruction loss.
  • the while loop in the algorithm refers to the alternating maximization scheme in which each set of parameters is maximized over several epochs where the optimization is carried out using back-propagation for each Z, X £ B. Termination of the inner loop occurs either when the maximal number of iteration N t is exceeded or when the clustering objective L c does not improve over consecutive iterations above a predefined tolerance tol. Note that l ⁇ , X r , X min , b are hyper-parameters and are data-set dependent.
  • the maximization of each set of variables is carried out using back-propagation of large batches over several epochs. In both stages, large batches are used as in the auto encoder training phase and for several epochs.
  • the entire procedure of the clustering step is similar to the pseudo-code of the algorithm shown in Fig.
  • FIG. 6 shows a schematic view of operating results of the clustering device 100 according to the present invention, compared to the prior art. Assuming that an output of the auto-encoder phase can be evaluated, and a singular value histogram of batch output can be calculated, it can be demonstrated how the device 100 performs when being run on an MNIST data-set.
  • the solution according to the present invention demonstrates an easily identifiable signature of singular value decomposition values as shown in Figs. 6B and 6C.
  • the results of the present invention can be detected and are identifiable regardless of a layer size of the auto-encoding unit 102. In Fig.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un dispositif (100) pour regrouper des données d'entrée (101) qui sont un ensemble de données qui comprend des points de données, le dispositif (100) comprenant une unité d'auto-encodage (102), configurée pour, dans une première phase de fonctionnement du dispositif (100), réduire une dimension des données d'entrée (101) et/ou extraire des caractéristiques pertinentes pour le regroupement à partir des données d'entrée (101), créant ainsi des données de faible dimension (103) et une unité de regroupement (104), configurée pour, dans une seconde phase de fonctionnement du dispositif (100), obtenir au moins un groupe (105), sur la base des données de faible dimension (103), et associer chaque point de données dans les données de faible dimension (103) à un groupe du ou des groupes (105), les données de faible dimension (103) étant optimisées, par l'unité d'auto-encodage (102), pour être reconstruites sans perte.
PCT/EP2018/062961 2018-05-17 2018-05-17 Dispositif et procédé pour regrouper des données d'entrée WO2019219198A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880093500.0A CN112154453A (zh) 2018-05-17 2018-05-17 用于对输入数据进行聚类的设备和方法
PCT/EP2018/062961 WO2019219198A1 (fr) 2018-05-17 2018-05-17 Dispositif et procédé pour regrouper des données d'entrée

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/062961 WO2019219198A1 (fr) 2018-05-17 2018-05-17 Dispositif et procédé pour regrouper des données d'entrée

Publications (1)

Publication Number Publication Date
WO2019219198A1 true WO2019219198A1 (fr) 2019-11-21

Family

ID=62222659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/062961 WO2019219198A1 (fr) 2018-05-17 2018-05-17 Dispositif et procédé pour regrouper des données d'entrée

Country Status (2)

Country Link
CN (1) CN112154453A (fr)
WO (1) WO2019219198A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113610107A (zh) * 2021-07-02 2021-11-05 同盾科技有限公司 特征优化方法及装置
WO2023065696A1 (fr) * 2021-10-21 2023-04-27 深圳云天励飞技术股份有限公司 Procédé et appareil de recherche de plus proche voisin, terminal et support de stockage

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6134541A (en) * 1997-10-31 2000-10-17 International Business Machines Corporation Searching multidimensional indexes using associated clustering and dimension reduction information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6134541A (en) * 1997-10-31 2000-10-17 International Business Machines Corporation Searching multidimensional indexes using associated clustering and dimension reduction information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BO YANG ET AL: "Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering", COPYRIGHT, 13 June 2017 (2017-06-13), XP055452200, Retrieved from the Internet <URL:https://arxiv.org/pdf/1610.04794.pdf> [retrieved on 20180219] *
ELIE ALJALBOUT ET AL: "Clustering with Deep Learning: Taxonomy and New Methods", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2018 (2018-01-23), XP080854112 *
LEYLI-ABADI MILAD ET AL: "Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data", 23 April 2017, MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION - MICCAI 2015 : 18TH INTERNATIONAL CONFERENCE, MUNICH, GERMANY, OCTOBER 5-9, 2015; PROCEEDINGS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CH, ISBN: 978-3-642-38287-1, ISSN: 0302-9743, XP047410529 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113610107A (zh) * 2021-07-02 2021-11-05 同盾科技有限公司 特征优化方法及装置
WO2023065696A1 (fr) * 2021-10-21 2023-04-27 深圳云天励飞技术股份有限公司 Procédé et appareil de recherche de plus proche voisin, terminal et support de stockage

Also Published As

Publication number Publication date
CN112154453A (zh) 2020-12-29

Similar Documents

Publication Publication Date Title
Liu et al. Localized sparse incomplete multi-view clustering
Zhu et al. Robust joint graph sparse coding for unsupervised spectral feature selection
Deng et al. A survey on soft subspace clustering
Ding et al. Low-rank common subspace for multi-view learning
Cheng et al. k NN algorithm with data-driven k value
Peng et al. Learning locality-constrained collaborative representation for robust face recognition
Sun et al. Task-driven dictionary learning for hyperspectral image classification with structured sparsity constraints
CN109543727B (zh) 一种基于竞争重构学习的半监督异常检测方法
US20150039538A1 (en) Method for processing a large-scale data set, and associated apparatus
JP2011008631A (ja) 画像変換方法及び装置並びにパターン識別方法及び装置
Peng et al. Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction
Wu et al. Heterogeneous feature selection by group lasso with logistic regression
Bahrami et al. Joint auto-weighted graph fusion and scalable semi-supervised learning
WO2019219198A1 (fr) Dispositif et procédé pour regrouper des données d&#39;entrée
Mautz et al. Non-redundant subspace clusterings with nr-kmeans and nr-dipmeans
Shi et al. Personalized pca: Decoupling shared and unique features
Hoang et al. Simultaneous compression and quantization: A joint approach for efficient unsupervised hashing
Liu et al. Class specific centralized dictionary learning for face recognition
Tran et al. Improving the face recognition accuracy under varying illumination conditions for local binary patterns and local ternary patterns based on weber-face and singular value decomposition
Wei et al. Self-regularized fixed-rank representation for subspace segmentation
Bhattacharya et al. A Generic Active Learning Framework for Class Imbalance Applications.
Asteris et al. The sparse principal component of a constant-rank matrix
Du et al. Consensus graph weighting via trace ratio criterion for multi-view unsupervised feature selection
Islam et al. Class aware auto encoders for better feature extraction
Gunasundari et al. Ensemble classifier with hybrid feature transformation for high dimensional data in healthcare

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18726429

Country of ref document: EP

Kind code of ref document: A1