CN112154453A

CN112154453A - Apparatus and method for clustering input data

Info

Publication number: CN112154453A
Application number: CN201880093500.0A
Authority: CN
Inventors: 埃拉德·佐里夫; 奥尔加·科根; 尤尼·乔克伦
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2020-12-29
Also published as: WO2019219198A1

Abstract

The invention provides a device (100) for clustering input data (101). The input data is a data set comprising data points. The apparatus (100) comprises: -an automatic encoding unit (102) for, in a first operating phase of the device (100), reducing the dimensionality of the input data (101) and/or extracting features related to clustering from the input data (101) so as to generate low-dimensional data (103); a clustering unit (104) for obtaining at least one cluster (105) from the low-dimensional data (103) and associating each data point in the low-dimensional data (103) with one of the at least one cluster (105) in a second operational phase of the apparatus (100), wherein the automatic encoding unit (102) optimizes the low-dimensional data (103) for lossless reconstruction of the low-dimensional data 103.

Description

Apparatus and method for clustering input data

Technical Field

The present invention relates to the field of machine learning and clustering, i.e., the process of finding similar structures in large datasets. More particularly, the present invention relates to an apparatus for clustering input data and a corresponding method, wherein the apparatus comprises an automatic encoding unit and a clustering unit.

Background

Clustering is currently one of the most basic unsupervised machine learning problems. Its main goal is to divide the data set of input data into clusters comprising similar data points. For example, clustering may be used to cluster users according to user behavior, such as clustering of events for network security purposes, clustering for IT operations, clustering for healthcare applications or industrial monitoring applications, and anomaly detection. In addition to these applications, clustering is also beneficial for a variety of other basic tasks. For example, clustering can be used for automatic data labeling for supervised learning, and can also be used as a preprocessing step for data visualization and analysis. In the prior art, dimensionality reduction and feature extraction are used with clustering to map input data into a feature space. In the feature space, separation into clusters is easier to achieve in view of the background of the current problem. The non-linear mapping can be learned using Deep Neural Networks (DNNs), so that the input data can be converted into a more easily clustered representation.

In the prior art, dimensionality reduction/feature selection and clustering are separately processed in a two-stage process, as shown in FIG. 7. First, dimensionality of the input data is reduced and informative features are extracted by the auto-encoder. Second, these features are clustered. However, there is a conflict between the auto-encoder component and the clustering component itself: the auto-encoder selects features in all changes of the input data that can be optimized for lossless reconstruction, while clustering requires features that can reduce all data changes to a single template (i.e., a single class or a single cluster).

In many cases, the auto-encoder output (obtained in the first stage) loses features important to clustering (performed in the second stage). Once this information is lost, the accuracy of the overall clustering deteriorates. For example, as described with reference to fig. 8, when running an autoencoder on a Modified National Institute of Standards and Technology, MNIST, database (i.e., dataset), important features are lost that are critical to, for example, distinguishing "9" from "4" (as shown in fig. 8A). The MNIST database is a large hand-written digital database that is commonly used to train various image processing systems. Therefore, the reconstruction of the centroid when running clustering on data generated by a conventional auto-encoder (see fig. 8B) shows: there are two centroids for "9" and no centroid for "4". T-Distributed Stochastic Neighbor Embedding (T-SNE) visualization also proves that this error exists.

That is, the prior art requires a clustering scheme with higher accuracy.

Disclosure of Invention

In view of the above problems and disadvantages, the present invention is directed to improving the conventional clustering apparatus. The invention aims to provide a device for clustering input data. The apparatus includes an automatic encoding unit and a clustering unit. The automatic coding unit adopts an automatic coding algorithm. The algorithm optimizes the data even before the data is processed in the clustering step of the clustering unit, achieving high separability of the data. To this end, the automatic encoding unit receives input data (which may be considered as a data set comprising data points) and provides optimized output data, i.e. low dimensional output data, to the clustering unit. The auto-encoder is specifically aimed at achieving subspace dimension maximization regularized by reconstruction losses. In other words, the dimensionality of the input data is reduced according to the reconstruction loss parameters of the input data. The dimensionality is only reduced to minimize reconstruction loss of simplified data and to optimize the low-dimensional data forwarded by the automatic encoding unit to the clustering unit to achieve high clustering accuracy.

In a clustering step performed in the subsequent clustering unit, at least one cluster is obtained from the low-dimensional data using an encoder output, and each data point in the low-dimensional data is associated with one of the at least one cluster. In particular, in the second phase, in the two-step alternating maximization of the clustering parameters and the encoder parameters (i.e. the operating parameters of the clustering unit and the automatic encoding unit), a higher accuracy can be obtained compared to the results obtained by the prior art solutions. In a first optional step, the similarity maximization of the data points is achieved by subspace dimension maximization, i.e. by further maximizing the dimensions of the low dimensional data output by the automatic encoding unit.

This step is specifically performed by the clustering unit in a clustering phase. In a second optional step, a similarity maximization normalized by inter-cluster similarity minimization and intra-cluster similarity maximization is achieved. In other words, the similarity of data points associated with different clusters is minimized, and the similarity of data points associated with the same cluster is maximized.

The object of the invention is achieved by the solution presented in the attached independent claims. Advantageous implementations of the invention are further defined in the dependent claims.

A first aspect of the invention provides an apparatus for clustering input data. The input data is a data set comprising data points. The apparatus comprises: an automatic encoding unit for reducing the dimensionality of the input data and/or extracting features related to clustering from the input data, in a first operating phase of the device, thereby producing low-dimensional data; a clustering unit for obtaining at least one cluster from the low-dimensional data and associating each data point in the low-dimensional data with one of the at least one cluster in a second operational phase of the apparatus, wherein the automatic encoding unit optimizes the low-dimensional data for lossless reconstruction of the low-dimensional data.

This is beneficial because high accuracy unsupervised deep learning can be used for different types of data (not necessarily visual data) to address clustering and anomaly detection issues, such as user behavior analysis for network security, event correlation for IT operations and maintenance, anomaly detection for healthcare applications or industrial monitoring applications. Furthermore, a fast convergence of the auto-encoder algorithm (having elapsed several periods (epochs)) can be achieved. In addition, the autoencoder code is shorter than in the prior art, and therefore less memory is required for inference.

In an implementation form of the first aspect, the low-dimensional data comprises linearly independent code lines, thereby minimizing reconstruction losses.

The linearly independent lines of code in the low-dimensional data output by the auto-encoder ensure that reconstruction losses incurred in reconstructing the input data from the low-dimensional data can be minimized. This also helps to improve the clustering accuracy of the clustering unit.

In an implementation form of the first aspect, the dimensionality reduction of the input data comprises applying a first function to the input data, wherein the first function is used to minimize pairwise similarities of data points in the input data to provide the low-dimensional data.

Minimizing the pairwise similarity of data points during the automatic encoding step helps to improve the clustering accuracy of the clustering unit as well as the overall clustering accuracy.

In one implementation form of the first aspect, the first function applies a similarity measure to data points in the input data.

The first function uses a similarity measure to ensure that the pairwise similarity can be accurately controlled to be minimized according to the similarity measure.

In one implementation form of the first aspect, the similarity measure applied by the first function is cosine similarity.

The use of the cosine similarity measure by the first function ensures that an effective measure can be used to minimize pairwise similarity.

In an implementation form of the first aspect, the apparatus further comprises a decoder for decoding the low-dimensional data and comparing the low-dimensional data with the input data to measure reconstruction losses and to adjust operational parameters of the automatic encoding unit to minimize reconstruction losses.

Comparing the decoding result of the decoder with the input data may determine whether reconstruction loss is effectively minimized and adjust the operating parameters accordingly.

In an implementation form of the first aspect, the clustering unit is further configured to obtain a centroid parameter for each cluster.

Obtaining the centroid parameter for each cluster ensures that processing efficiency can be improved because only the centroid parameter needs to be evaluated during operation of the device, rather than evaluating the attributes of all data points associated with the cluster.

In an implementation form of the first aspect, the clustering unit is further configured to determine a cluster to which a data point is assigned according to a centroid parameter of the cluster.

This ensures that the clustering efficiency of the clustering unit can be further improved.

In an implementation form of the first aspect, the clustering unit is further configured to apply a second function to minimize pairwise similarity of data points and to improve separability of the data points.

Therefore, the clustering accuracy of the clustering unit can be further improved by minimizing the pairwise similarity of the data points in the clustering unit.

In an implementation form of the first aspect, the second function applies the same similarity measure as the first function, in particular a cosine similarity of the data points.

This ensures that the clustering accuracy of the clustering unit can be further improved by minimizing the pairwise similarity of the data points in the clustering unit.

In an implementation form of the first aspect, the clustering unit is further configured to minimize a similarity of data points associated with different clusters.

This ensures that the clustering accuracy of the clustering unit can be further improved.

In an implementation form of the first aspect, the clustering unit is further configured to maximize a similarity of data points associated with a same cluster.

In an implementation form of the first aspect, in the second phase, the apparatus is further configured to optimize an operating parameter of the automatic encoding unit according to the operating parameter of the clustering unit, and optimize an operating parameter of the clustering unit according to the operating parameter of the automatic encoding unit.

This ensures that the clustering accuracy of the automatic encoding unit and the clustering unit can be further improved at the same time.

In an implementation form of the first aspect, the apparatus is further configured to optimize an operating parameter of the automatic encoding unit and an operating parameter of the clustering unit simultaneously.

A second aspect of the invention provides a method for clustering input data. The input data is a data set comprising data points. The method comprises the following steps: in a first operational phase of the device, the automatic encoding unit reduces the dimensionality of the input data and/or the automatic encoding unit extracts features related to clustering from the input data, thereby producing low-dimensional data; in a second operating phase of the device, a clustering unit obtains at least one cluster from the low-dimensional data, and the clustering unit associates each data point in the low-dimensional data with one of the at least one cluster, wherein the automatic encoding unit optimizes the low-dimensional data for lossless reconstruction of the low-dimensional data.

In an implementation form of the second aspect, the low-dimensional data comprises linearly independent code lines, thereby minimizing reconstruction losses.

In one implementation form of the second aspect, the dimensionality reduction of the input data comprises applying a first function to the input data, wherein the first function is to minimize pairwise similarities of data points in the input data to provide the low-dimensional data.

In one implementation form of the second aspect, the first function applies a similarity measure to data points in the input data.

In one implementation form of the second aspect, the similarity measure applied by the first function is cosine similarity.

In one implementation form of the second aspect, the method further comprises: a decoder decodes the low-dimensional data and compares the low-dimensional data with the input data to measure reconstruction loss and adjust operating parameters of the automatic encoding unit to minimize reconstruction loss.

In one implementation form of the second aspect, the method further comprises: the clustering unit obtains a centroid parameter for each cluster.

In one implementation form of the second aspect, the method further comprises: the clustering unit determines a cluster to which a data point is assigned according to a centroid parameter of the cluster.

In one implementation form of the second aspect, the method further comprises: the clustering unit applies a second function to minimize pairwise similarity of data points and improve separability of the data points.

In one implementation form of the second aspect, the second function applies the same similarity measure as the first function, in particular a cosine similarity of the data points.

In one implementation form of the second aspect, the method further comprises: the clustering unit minimizes the similarity of data points associated with different clusters.

In one implementation form of the second aspect, the method further comprises: the clustering unit maximizes the similarity of data points associated with the same cluster.

In one implementation form of the second aspect, in the second stage, the method further comprises: optimizing the operating parameters of the automatic encoding unit according to the operating parameters of the clustering unit, and optimizing the operating parameters of the clustering unit according to the operating parameters of the automatic encoding unit.

In an implementation form of the second aspect, the method further comprises optimizing the operating parameters of the automatic encoding unit and the operating parameters of the clustering unit simultaneously.

The method of the second aspect and its implementation forms comprises the same advantages as the service collaborator of the first aspect and its respective implementation forms.

It should be noted that all devices, elements, units and components described in the present application may be implemented in software or hardware elements or any type of combination thereof. All steps performed by the various entities described in the present application and the functions described to be performed by the various entities are intended to indicate that the respective entities are adapted or arranged to perform the respective steps and functions.

Although in the following description of specific embodiments specific functions or steps performed by an external entity are not illustrated in the description of specific elements of that entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented in respective hardware or software elements or any combination thereof.

Drawings

The following description of specific embodiments, taken in conjunction with the accompanying drawings, set forth above the various aspects of the invention and the manner of attaining them.

Fig. 1 is a schematic diagram of a clustering device according to an embodiment of the present invention.

Fig. 2 is a detailed schematic diagram of a clustering device according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a method according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of an auto encoder.

Fig. 5 is a schematic diagram of an algorithm implemented by the present invention.

Fig. 6 is a schematic diagram of an operation result of the cluster device provided by the present invention compared with the prior art.

Fig. 7 is a schematic diagram of clustering provided by the prior art.

Fig. 8 is another schematic diagram of clustering provided by the prior art.

Detailed Description

Fig. 1 is a schematic diagram of a clustering apparatus 100 according to an embodiment of the present invention. The apparatus 100 is used for clustering input data 101. The input data 101 is a data set comprising data points. The apparatus 100 comprises an automatic encoding unit 102 and a clustering unit 104.

The automatic encoding unit 102 is arranged to reduce the dimensionality of the input data 101 and/or extract cluster-related features from the input data 101 in a first operational phase of the apparatus 100, thereby generating low-dimensional data 103. The low-dimensional data 103 specifically includes data sets that include data points with reduced dimensionality.

In other words, the automatic encoding unit 102 may extract a cluster-oriented discriminant and informational potential space (i.e., features) from the input data 101. That is, in the first operation stage, the input data 101 is lowered into the low-dimensional space, thereby providing a good initial condition for the processing in the clustering unit 104. In particular, maintaining linear independence between lines of code in the low-dimensional data 103 helps to preserve important features for clustering in the clustering unit 104. Thus, the low-dimensional data 103 may optionally include linearly independent lines of code, thereby minimizing reconstruction loss. In particular, the low-dimensional data 103 includes a data set that includes data points of reduced dimensionality. Furthermore, the first operation phase may also be referred to as a learning phase of the automatic encoding unit 102.

The device 100 further comprises a clustering unit 104. The clustering unit 104 is configured to, in a second operational phase of the device 100, obtain at least one cluster 105 from the low-dimensional data 103 and associate each data point in the low-dimensional data 103 with one of the at least one cluster 105. The automatic encoding unit 102 optimizes the low-dimensional data 103 processed in the clustering unit 104 to perform lossless reconstruction of the low-dimensional data 103. Thus, by optimizing the low-dimensional data 103 present in the automatic encoding unit 102, the present invention improves overall clustering accuracy by the apparatus 100.

Although three clusters 105 are shown in fig. 1, device 100 may acquire any number of clusters as long as there is at least one cluster 105.

In other words, clustering in the clustering unit 104 involves grouping data points (obtained from the low-dimensional data 103) with similar patterns according to the potential spatial representation obtained in the first stage of the automatic encoding unit 102 to obtain several centroids (which may also be referred to as centroid parameters) that characterize the data sets such that each data set is represented by an assigned single centroid. In other words, the clustering unit 104 is further configured to obtain a centroid parameter for each cluster 105.

In this way, each new data point can be compared to the centroid individually to decide which cluster to assign, rather than to the entire data set. That is, the clustering unit 104 is further configured to determine a cluster to which the data point is assigned according to the centroid parameter of the cluster. This improves the efficiency of classification of new data points.

The first and second operational phases may also be referred to as training phases of the apparatus 100. In particular, the first phase of operation may be referred to as a training phase of the automatic encoding unit 102, and the second phase may be referred to as a clustering phase.

The training phase of the automatic encoding unit 102 is initiated because of the need for a good initial separation of data points in the low-dimensional data 103 in the clustering phase. In this training phase, a discriminant loss function is used to learn a potential space that includes discriminant attributes. The discriminant loss function can minimize the pairwise similarity between each pair of data points in the data set (consisting of the input data 101). In other words, the dimensionality reduction of the input data 101 includes applying a first function to the input data 101 that serves to minimize the pairwise similarity of data points in the input data 101, thereby providing the low-dimensional data 103. The first function may specifically be a discriminant loss function. The first function may apply a similarity metric to data points in the input data to accurately adjust the result of the first function. The similarity measure applied by the first function may be a cosine similarity. Thus, in the first stage, the operating parameters of the automatic encoding unit 102 are optimized.

The clustering phase is then initialized using the latent space obtained by the automatic encoding unit 102 in the first phase. In particular, the centroid parameters are now optimized together with the auto-encoder parameters (i.e. the operating parameters of the auto-encoding unit 102). In the clustering phase, the potential space is improved to make the overall clustering task in the device 100 more accurate while optimizing for the best representative centroid of each cluster.

Specifically, the metric selected for the goal of the clustering task is the sum of the cosine similarity between each data point and the centroid to which the data point is assigned.

In the second operation phase, the processing of the clustering unit 104 may specifically include the following two optional steps: in a first step, the clustering unit 104 controls the operating parameters of the automatic encoding unit 102 to maximize inter-cluster separation of data points. In a second step, inter-cluster separation and intra-cluster similarity are maximized by controlling the operating parameters of the clustering unit 104.

In other words, whether the first step or the second step is performed, it can be considered that the clustering unit 104 is further configured to apply a second function to minimize the pairwise similarity of the data points and to improve the separability of the data points. The second function may be applied to data points in the input data 101 and also to data points in the low-dimensional data 103. In particular, when the second function involves the first step described above, the second function may apply the same similarity measure as the first function, in particular the cosine similarity of the data points.

The first and second steps described above may optionally include at least one of the following details: in a first step, the centroid and cluster assignments obtained during the optimization in the automatic encoding unit 102 are not yet trusted. Thus, the clustering tasks continue to be regularized using pairwise discriminant loss functions in the auto-encoder training phase. That is, there is no distinction between pairs of data points that are considered similar (i.e., assigned to the same cluster) and dissimilar (i.e., assigned to different clusters).

In a second step, the assignment of data points to centroids in the centroids is used to split the discriminant loss function (i.e., the second function) into two objectives that are sufficiently reliable: minimizing the pairwise similarity of "inter-cluster" data points (i.e., data points associated with different clusters), and maximizing the pairwise similarity of "intra-cluster" data points (i.e., data points associated with the same cluster).

That is, the clustering unit 104 may optionally also be used to minimize the similarity of data points associated with different clusters 105. The clustering unit 104 may optionally also be used to maximize the similarity of data points associated with the same cluster 105.

Optionally, the apparatus 100 may also be used to perform a third operational phase. The third operational phase may be referred to as an inference phase. In the inference phase, each new data point arriving at the device 100 passes through the automatic encoding unit 102 in order to extract a potential representation of the new data point. Then, a cosine similarity between the potential representations of the new data point is calculated. The new data point is then assigned to the cluster with the highest cosine similarity. A data point is considered abnormal if the cosine similarity of all clusters is observed to be low (according to typical values of similarity obtained from the training set).

Fig. 2 is a detailed schematic diagram of the clustering apparatus 100 according to an embodiment of the present invention. The device 100 in fig. 2 is based on the device 100 in fig. 1 and therefore comprises all its functions and features. For this reason, like features are labeled with like reference numerals. All of the features described below with reference to fig. 2 are optional features of the device 100.

As shown in fig. 2, the apparatus 100 may optionally further comprise a decoder 201 for decoding the low-dimensional data 103 and comparing the low-dimensional data 103 with the input data 101 to measure reconstruction loss and adjust operational parameters of the automatic encoding unit 102 to minimize the reconstruction loss.

The double-headed dashed arrow shown in fig. 2 connects the automatic encoding unit 102 and the clustering unit 104. To further improve the clustering accuracy, the apparatus 100 is optionally able to learn the clustering parameters and the auto-encoder parameters (i.e., the operating parameters of the clustering unit 104 and the auto-encoding unit 102) simultaneously. That is, in particular in the second phase, the apparatus 100 is optionally further configured to optimize the operating parameters of the automatic encoding unit 102 in dependence on the operating parameters of the clustering unit 104, and to optimize the operating parameters of the clustering unit 104 in dependence on the operating parameters of the automatic encoding unit 102. Further, optionally, the apparatus 100 may be used to optimize the operating parameters of the automatic encoding unit 102 and the operating parameters of the clustering unit 104 simultaneously.

The following sections describe in further detail the functionality of the device 100 shown in fig. 1 or 2. In particular, implementation aspects of the processing performed in the first and second operational phases of the apparatus 100 are described.

In the first operation phase (i.e., the training phase of the automatic encoding unit 102), the parameter θ of the automatic encoding unit 102 is passed_e、θ_dThe following loss function is minimized:

in the loss function, θ_e、θ_dAre the operating parameters of the encoder and decoder, respectively. X_iRepresenting the original data representation. z is a radical of_i(θ_e) Representing the potential spatial representation (i.e., the output of the automatic encoding unit 102).

Representing the reconstructed data points.

Representing the cosine similarity between pairs of potential spatial representations of the data points.

Is the distance between a data point and its reconstructed data point (e.g., can be L)₂、L₁Or any distance metric). λ is the regularization strength.

Loss function L (θ)_e,θ_d) Item I of (1)

Representing the reconstruction loss, and the second term (lambda sigma) in the loss function_i,j|sim(z_i,z_j) |) represents maximum separability (obtained by minimizing the pairwise cosine similarity between potential representations of all data points).

By minimizing the norm L of the cosine similarity between all pairs of samples in the batch matrix, i.e. the matrix comprising the input data 101 of the device 100₁The technical scheme of the invention can maximize the dimension of the encoder subspace. This helps to achieve linear independence between matrix rows in the low-dimensional data 103, thus maximizing the rank of the matrix.

Since the training phase of the automatic encoding unit 102 is unsupervised, intra-cluster pairs for which the cosine similarity is to be maximized cannot be distinguished from inter-cluster pairs for which the cosine similarity is to be minimized.

However, assuming there are no dominant clusters in the dataset, and considering that the batch size is larger than the number of clusters, the intra-cluster logarithm in each batch matrix is significantly smaller than the inter-cluster logarithm. For example, consider an MNIST dataset (comprising 10 clusters) and a batch size of 1000>>10, then there is cluster

To, but exist between clusters

And (4) carrying out pairing.

In addition, since the sparse cosine similarity vector is expected to be the result (i.e. all inter-cluster cosine similarities are expected to be 0 and all intra-cluster cosine similarities are expected to be 1), the norm L of the cosine similarity vector is minimized₁. This is advantageous for vector sparsity with smaller shrinkage of larger entries of the vector (e.g., with L)₂The opposite).

By this method, the MNIST can reach more than 85% accuracy in the first operation phase, which is superior to the prior art results.

As described above, the clustering orderThe segment is a two-step process, in the first step, by means of an automatic encoder θ_e、θ_dCentroid [ mu ]_k}_k＝1 ^KAnd the distribution matrix S ∈ R^B×KMaximizing the loss function:

the loss function includes three (3) terms (left to right): cosine similarity to be maximized between the centroid and its assigned data points, reconstruction loss, separability (obtained by minimizing the pairwise cosine similarity between potential representations of all data points, similar to the terms used in the training phase of the automatic encoding unit 102).

In the second step of the clustering phase, the centroid and assignment are trusted and the separability items are divided into two items: intra-cluster similarity to be maximized, inter-cluster similarity to be minimized. That is, the loss function of the second step in the second stage becomes:

loss function L₂First and second leftmost entries in (1) and L₁Are the same as above. The third term is intra-cluster similarity. The fourth term is inter-cluster similarity. The infinite norm of inter-cluster similarity is minimized by all average pairwise similarities between data points assigned to disjoint clusters. This amounts to reducing the worst case where two clusters have the highest similarity among all clusters. Lambda [ alpha ]_rec、λ_w、λ_bRepresenting the regularization strength of the different terms in the loss function.

Fig. 3 is a schematic diagram of a method 300 according to an embodiment of the invention. The method 300 is used to operate the device 100 and thus enables clustering of the input data 101.

The method 300 comprises a first step: in a first operational phase of the device 100, the automatic encoding unit 102 in the device 100 reduces 301 the dimensionality of the input data 101 and/or the automatic encoding unit 102 extracts 301 cluster-related features from the input data 101, thereby generating the low-dimensional data 103.

The method 300 further comprises a second step of: in a second operational phase of the apparatus 100, the clustering unit 104 in the apparatus 100 obtains at least one cluster 105 from the low-dimensional data 103, and the clustering unit 104 associates each data point in the low-dimensional data 103 with one of the at least one cluster 105, wherein the automatic encoding unit 102 optimizes the low-dimensional data 103 for lossless reconstruction of the low-dimensional data 103. The following sections describe the invention in more detail with particular reference to fig. 4 and 5. The details provided below may be considered as optional features, which may be combined arbitrarily with the features of the device 100 described in fig. 1 or fig. 2.

The present invention teaches an paradigm in which potential spatial representations of a certain data set are trained together with clustering parameters. The end-to-end approach achieves a cluster-oriented representation, thus improving clustering accuracy. In most prior art solutions, a conventional automatic coding unit is first trained to minimize the reconstruction loss, and then the reconstruction loss is applied as an initial condition to a co-optimization problem, where the clustering parameters and the encoder parameters are optimized together to minimize the clustering error. However, although the main focus is on the clustering phase, it can be observed that in most cases the accuracy is improved by 15% to 20% at most compared to the initial phase. It follows from this that: the initial phase has a significant impact on the overall accuracy, and therefore this step needs to be of great concern.

How to obtain a discriminative automatic encoder in the first operational phase is described below.

Suppose D represents grouping into K clusters

Of (2) data set, let x be_i∈R^pRepresenting data points in the dataset with a characteristic dimension p. Suppose z_i＝f(x_i；θ_e)∈R^dDenotes x_iIs shown in (a), whereinThe encoder parameter is denoted by θ_eAnd p > d. Suppose that

Representing reconstructed data points, i.e. the output of the decoder, wherein the decoder parameters are denoted by theta_d。

According to the present invention, there is provided a formulation L_d(z_i,z_j)∶R^d×dFamily of pairwise discriminant functions of → R, where L_d(z_i,z_j)＝sim(z_i,z_j) Sim (-) represents any similarity measure between a pair of data points. It should be noted that the expected attribute of the similarity measure is sim (z)_i,z_i)＝1，0≤sim(z_i,z_j) Less than or equal to 1. When applying the family of functions to the entire data set, we get:

wherein, w_ijCorrelated with the similarity between the raw data points. When not known in advance, set w_i,j＝|D|^-2Where | D | represents the cardinality of the data set. It should be noted that the objective function in the above formula (1) is not optimal, because the formula (1) penalizes all the similarities whether the data points belong to the same cluster or not. Obviously, if an assignment is available for each data point, equation (1) is divided into two terms: one corresponding inter-cluster similarity is minimized and the other corresponding intra-cluster similarity is maximized, thereby obtaining the following objective function:

wherein N is_b、N_wThe inter-cluster logarithm and the intra-cluster logarithm are represented, respectively. However, the rationality of equation (1) stems from the following observation: considering that the balanced data set is not dominated by a single or a few clusters, the logarithm in D is

The bases of the intra-cluster logarithm and the inter-cluster logarithm are about

And

therefore, the number of different logarithms is larger than the number of similar logarithms. In addition, N is_bIncreasing with increasing | D |, K, for | D | > K, the intra-cluster logarithm is approximately all data pairs

In view of equation (2), a k-nearest neighbor graph is generated between the data points according to the original representation of the data points. Then, a part of the paired data points with the maximum similarity in the k-nearest neighbor graph is applied to formula (1) as the anchor point pair whose similarity is to be maximized. Defining A as an anchor point pair set to obtain:

generating

Wherein alpha is<1 is used to compensate for the untrustworthiness of the similarity between anchor points. Suppose that will

Defined as a normalized latent space representation. Using similarity measures

Where, | · | represents an absolute value. It should be noted that the formula (4) is the norm L of all the pairwise cosine similarities₁. Expected sparsity due to similarity (only

A non-zero element) and a norm L₁Supporting expected sparsity, so L is used₁Without using, for example, L₂。

Since datasets are generally not stored in main memory and a batch dataset B is trained by the Stochastic Gradient Descent (SGD), a batch matrix Z e R of potential representations is used^|B|×dAn approximation calculation is made for equation (4), where | B | represents the cardinality of the batch. Will be provided with

Defined as a row-by-row normalized batch matrix (row i is a row vector)

) Will be

Defined as a pairwise cosine similarity matrix, i.e.

In addition, for each batch, a k-nearest neighbor graph is constructed and a set A is determined_BThe following approximation of equation (4) is obtained:

it should be noted that the diagonal term of C is constant 1, and therefore does not affect the optimization. To avoid any discrimination of the data points, which is not consistent with the actual separation, it is proposed to regularize equation (5) using the reconstruction loss, resulting in:

wherein, λ represents the regularization strength,

representing reconstruction loss, X representing the original input batch matrix, | · | | luminance_FRepresents the Frobenius norm.

The following describes how the dissimilarity is maintained during the clustering phase (i.e., the second operational phase).

After obtaining the discriminative potential space in the first operational phase of the apparatus 100, the discriminative potential space is applied as an initialization condition to the clustering phase (i.e., the second operational phase of the apparatus 100). In this step, the encoder-decoder parameter θ is set_e、θ_dAnd center of mass

Are optimized together. They represent the new optimization variables of the clustering objective. One natural candidate for the objective function of the clustering stage is the cosine similarity between the learned centroid and each data point, since the cosine similarity is used in the initial stage to discriminate the paired data points. Thus, the main goal of the clustering phase is to maximize the following objective function:

wherein S represents an allocation matrix, S_ikE {0,1} is a hard decision for the clustering procedure. The clustering stage is divided into two steps. In the first step, the cluster assignments are not trusted and regularization continues using equation (5) and reconstruction penalties. In a second step, cluster allocation is trusted, dividing the discriminative object into inter-cluster items and intra-cluster items, such that the allocation of each data point to an associated cluster is derived from the allocation matrix S. Thus, the optimization problem solved in the first step is given by the following formula:

wherein λ is_d、λ_rRespectively representDiscriminant loss and reconstruction loss. In a second step, assuming that a similarity measure between each pair of clusters is defined, we derive:

it should be noted that equation (9) penalizes the worst case, i.e., penalizes the pairwise cluster with the largest similarity. Likewise, the above operations are performed on objects within a category:

note that in the formula (10), the absolute value is omitted, and the value 1 takes precedence over-1 for the cosine similarity between paired data points. The optimization problem in the second step becomes:

wherein λ is_b、λ_w、λ_rThe regularization strengths of inter-cluster, intra-cluster and reconstruction loss are indicated, respectively.

The system architecture of the device 100 is described below in one specific example.

The device 100 comprises two parts: an auto-encoder (or encoder or auto-encoding unit 102) and a cluster (or clustering unit 104). The autoencoder network of the autoencoder may be a full convolutional neural network including alternating convolution products with batch normalization and maximum pooling layers. The decoder applies upsampling to achieve higher resolution using nearest neighbor extrapolation to resize and then using convolutional layers that also have layered batch normalization. Fig. 4 shows an autoencoder architecture.

The training strategy for an autoencoder is described below.

Training the auto-encoder in the initial phase (i.e., the first phase of operation) begins with minimizing equation (6). The regularization strength of the discriminant loss is a hyperparameter determined in the range λ e (0, 1.) the values of λ are different between different datasets, so more complex datasets need to be more fully discriminated while keeping the strength of the reconstruction loss constant.

In the clustering phase, clustering variables

S and the autoencoder parameter θ_e、θ_dAre optimized together. An alternating maximization scheme is used in which each set of variables is optimized while the other sets remain unchanged. The optimization process begins with the initialization by optimizing equation (7) from the entire data set D

Next, alternately maximizing the allocation matrix S, then maximizing

Finally, the auto-encoder parameters are maximized. The optimization process iterates until convergence. The clustering phase is divided into two phases (i.e., the two steps of the second phase) with the difference being that the objective function that needs to be maximized is different.

The pseudo code for the first phase (or second phase) is an overview of the algorithm shown in FIG. 5B. In the first stage, equation (8) is optimized while discriminant loss uses a relatively large regularization strength λ_d∈[1,5]With less regularization strength λ for reconstruction loss_r∈(0,1]. While loops in the algorithm refer to an alternating maximization scheme, where each parameter set is maximized over several time periods, optimized by back-propagation of each Z, X ∈ B. When the maximum iteration number N is exceeded_iWhen, or when clustering the target L_cThe inner loop is terminated when no improvement is obtained in successive iterations exceeding the predefined tolerance tol. It should be noted that λ_d、λ_r、λ_minBeta is a hyper-parameter and is related to the data set.

Using parameters in the first stage

To initialize the second phase. Equation (11) is then optimized, where, similar to the previous stage, the discriminant regularization strength is set to a relatively high value, i.e., λ_b、λ_w∈[1,5]While the regularization strength of the reconstruction penalty remains unchanged. The process iterates until convergence. Each set of variables is maximized by mass back propagation over several periods. In both phases, as in the auto-encoder training phase, a large batch is used over several periods. The overall flow of the clustering step is similar to the pseudo-code of the algorithm shown in FIG. 5B, but now using equation (11) and its associated hyper-parameters.

Fig. 6 is a schematic diagram of an operation result of the cluster apparatus 100 provided by the present invention compared with the prior art. Assuming that the output of the auto-encoder stage can be evaluated and a single value histogram of the batch output can be computed, the performance of device 100 on the MNIST dataset runtime can be displayed.

The scheme provided by the invention shows the easily identifiable signatures of the single value decomposition values shown in fig. 6B and 6C. The results of the present invention can be detected and recognized regardless of the layer size of the automatic encoding unit 102. In fig. 6, "corr" is a hyper-parameter that the auto-encoding algorithm uses to define the strength required for linear independence of the data (the larger the value, the stronger the independence requirement). In fig. 6A, all results are distributed over multiple values and it can be seen that the device provided by the present invention is not used. In fig. 6B and 6C, most of the values combine and even a distinct peak (i.e., not zero) can be seen. It is thus readily apparent that the solution provided by the present invention is used.

The invention has been described in connection with various embodiments and implementations as examples. However, other variations can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims as well as in the description, the word "comprising" does not exclude other elements or steps, and the terms "a" or "an" do not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An apparatus (100) for clustering input data (101), the input data (101) being a data set comprising data points, the apparatus (100) comprising:

-an automatic encoding unit (102) for, in a first operational phase of the device (100), reducing the dimensionality of the input data (101) and/or extracting cluster-related features from the input data (101) resulting in low-dimensional data (103);

a clustering unit (104) for obtaining at least one cluster (105) from the low dimensional data (103) and associating each data point in the low dimensional data (103) with one cluster of the at least one cluster (105) in a second operational phase of the device (100),

wherein the automatic encoding unit (102) optimizes the low dimensional data (103) for lossless reconstruction of the low dimensional data (103).

2. The apparatus (100) of claim 1, wherein the low-dimensional data (103) comprises linearly independent lines of code, thereby minimizing reconstruction loss.

3. The apparatus (100) of any of the preceding claims, wherein the dimensionality reduction of the input data (101) comprises applying a first function to the input data (101), wherein the first function is to minimize the pairwise similarity of data points in the input data (101) to provide the low-dimensional data (103).

4. The apparatus (100) of claim 3, wherein the first function applies a similarity measure to data points in the input data (101).

5. The apparatus (100) of claim 4, wherein the similarity measure applied by the first function is cosine similarity.

6. The apparatus (100) according to any of the preceding claims, wherein the apparatus (100) further comprises a decoder (201) for decoding the low dimensional data (103) and comparing the low dimensional data (103) with the input data (101) to measure reconstruction losses and to adjust operational parameters of the automatic encoding unit (102) to minimize reconstruction losses.

7. The device (100) according to any of the preceding claims, wherein the clustering unit (104) is further configured to obtain a centroid parameter for each cluster (105).

8. The apparatus (100) of claim 7, wherein the clustering unit (104) is further configured to determine the cluster to which the data point is assigned based on a centroid parameter of the cluster.

9. The apparatus (100) of any of the preceding claims, wherein the clustering unit (104) is further configured to apply a second function to minimize pairwise similarity of data points and to improve separability of the data points.

10. The apparatus (100) according to claim 9, wherein the second function applies the same similarity measure as the first function, in particular a cosine similarity of the data points.

11. The device (100) according to any of the preceding claims, wherein the clustering unit (104) is further configured to minimize similarity of data points associated with different clusters (105).

12. The device (100) according to any of the preceding claims, wherein the clustering unit (104) is further configured to maximize similarity of data points associated with one and the same cluster (105).

13. The apparatus (100) according to any one of the preceding claims, wherein in the second phase, the apparatus (100) is further configured to: optimizing the operating parameters of the automatic encoding unit (102) in dependence on the operating parameters of the clustering unit (104), and optimizing the operating parameters of the clustering unit (104) in dependence on the operating parameters of the automatic encoding unit (102).

14. The apparatus (100) according to claim 13, wherein the apparatus (100) is further configured to optimize the operating parameters of the automatic encoding unit (102) and the clustering unit (104) simultaneously.

15. A method (300) for clustering input data (101), the input data (101) being a data set comprising data points, the method (300) comprising the steps of:

-in a first operational phase of the device (100), the automatic encoding unit (102) reduces (301) the dimensionality of the input data (101) and/or the automatic encoding unit (102) extracts (301) cluster-related features from the input data (101) resulting in low-dimensional data (103);

-in a second operational phase of the device (100), a clustering unit (104) obtains (302) at least one cluster (105) from the low dimensional data (103), and the clustering unit (104) associates each data point (302) in the low dimensional data (103) with one of the at least one cluster (105),