CN110705618A

CN110705618A - Self-encoder network optimization method serving clustering tasks

Info

Publication number: CN110705618A
Application number: CN201910903391.0A
Authority: CN
Inventors: 王树良; 李琦; 耿晶; 代天茹
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology; Beijing Institute of Technology BIT
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2020-01-17

Abstract

The invention discloses a self-encoder network optimization method serving a clustering task, belongs to the technical field of clustering analysis, and can represent a data set to be clustered again before clustering analysis so that the clustering task can obtain better accuracy. The method comprises the following steps: a standard self-encoder network is constructed. Constructing an objective function, wherein the current objective function is Loss₁And pre-training the self-encoder network by taking the data set to be clustered as input to obtain the optimal network parameters for pre-training. And modifying the target function into Loss according to the optimal network parameters adopting the pre-training, and carrying out secondary training on the self-encoder network until the self-encoder network is optimal. Taking out the encoder part in the self-encoder network which reaches the optimum after the secondary training as a clustering data preprocessing model, taking the data set to be clustered as the input of the clustering data preprocessing model, and outputting the clustering data preprocessing modelAnd outputting as an input of the clustering task.

Description

Self-encoder network optimization method serving clustering tasks

Technical Field

The invention relates to the technical field of clustering analysis, in particular to a self-encoder network optimization method serving clustering tasks.

Background

Clustering is an important data analysis means, and is characterized in that data in a data set is distinguished and classified according to certain requirements and rules, then the data set without class labels is divided into a plurality of subsets (classes) according to certain rules, similar data objects are classified into one class as much as possible, and dissimilar data objects are divided into different classes as much as possible.

With the rapid development of information technology and the big data era, the internet technology and storage equipment are continuously improved and perfected. The form, kind and dimension of data are no longer single. The difficulty of learning the characteristics of the big data and mining the potential information of the big data is continuously upgraded along with the multivariate development of the data, and the characteristics of the mass, the isomerism and the real-time property of the big data bring challenges to the characteristic learning method. Clustering is not only faced with the problem of increasing data volume, but also more important is the problem of high dimensionality of the data. Clustering is a fundamental problem in the field of data-driven applications, whereas the performance of clustering depends to a large extent on the characterization of the data.

Therefore, linear and nonlinear feature transformation methods have been widely applied to the clustering problem, for which a better feature can be learned.

In recent years, much research effort has been devoted to learning a feature suitable for clustering using a deep neural network, thereby improving the performance of clustering to some extent. The inherent characteristics of the neural network learning data such as an autoencoder are utilized, the data are embedded into a low-dimensional space, and the problem of 'dimension disaster' is solved to a certain extent. By utilizing the feature extraction capability of the convolutional neural network and the self-encoder, the strong robustness feature is learned by means of the convolution and pooling operation of the convolutional neural network in deep learning, and the internal structure and the flow pattern space of the original data are learned to the maximum extent.

The existing method mainly adopts combined clustering, obtains an initial clustering center through some clustering methods while extracting features, and iteratively updates a new clustering center while continuously optimizing model parameters and structures until a satisfactory clustering effect is achieved. The joint clustering improves the clustering effect and also inevitably increases the clustering time complexity

The prior technical proposal comprises the following steps

Huang P et al in the paper propose a deep neural network based model for Deep Embedded Networks (DENs) for feature learning. By considering two constraints of feature representation, features that are more favorable for clustering are learned. The method first learns a simplified representation in the raw data using a depth auto-encoder. To make the learned features suitable for clustering, DEN first imposes a local persistent constraint on the learned features with the purpose of embedding the raw data in the underlying diverse space. Then, unlike spectral clustering where features are extracted from the block diagonal similarity matrix, the DEN applies a group sparsity constraint to the learned features to learn a non-zero set of block diagonal representations corresponding to its clusters.

Dizaji et al propose a clustering (DEPICT) model called deep embedding regularization in the paper, effectively map data into a discriminant embedding subspace, and accurately predict the cluster allocation. The model consists of a polynomial logistic regression function stacked on top of a multi-layered convolutional self-encoding network. The top polynomial logistic regression layer (softmax layer) and the encoder part can be considered as a discriminant clustering model. The DEPICT defines an objective function of the cluster by minimizing a cross entropy function, and adds a regular term in the objective function based on the priority of the cluster frequency distribution, wherein the regular term can punish the unbalanced distribution of the cluster and prevent the cluster from being distributed to the sample of the abnormal point. Although the deep clustering model is flexible enough to distinguish complex real input data, the deep clustering model is easy to fall into the problem of non-optimized local minimum in the training process, so that an undesirable clustering result is generated. In order to avoid the overfitting problem faced by the deep clustering model, the DEPICT uses the reconstruction loss function of the self-encoder model as a regularization term depending on data to train numbers. In order to utilize the advantages of end-to-end optimization and eliminate the necessity of pre-training layer by layer, the model provides a joint learning framework to train all network layers while minimizing a clustering objective function and a reconstruction error.

Some studies have used dimensionality reduction prior to clustering as a preprocessing part of clustering. By means of the reduced dimension characteristic, a high-dimension and large-volume data set can be used for a clustering problem. These dimension reduction techniques include simple linear or non-linear models, rather than being based on deep convolutional self-coding networks.

Some studies have utilized the unsupervised and nonlinear nature of deep learning to perform dimensionality reduction prior to cluster analysis. Although the dimension reduction process does solve the problem of 'dimension disaster' faced by clustering to some extent, the feature after dimension reduction does not improve the accuracy or performance of clustering to some extent.

Some studies have combined deep learning and clustering problems and performed cluster-oriented processing on the reduced-dimension features. The clustering algorithm mainly solved by the research is mainly based on K-means, Gaussian mixture clustering and spectral clustering, and no deep research is carried out on other clustering methods. The main reason is that the three clustering algorithm principles have many similarities to a certain extent, and the three clustering algorithms have low requirements for setting parameters. Many studies have aimed at improving the performance and accuracy of these three clusters.

In summary, it is an urgent need to solve the above-mentioned problems to study a dimensionality reduction technique based on a deep convolutional self-coding network, and how to reduce dimensionality before cluster analysis, so that the dimensionality reduced features can improve the accuracy and performance of clustering.

Disclosure of Invention

In view of this, the present invention provides a self-encoder network optimization method serving a clustering task, which is a dimension reduction technical means based on a self-encoding network, and performs dimension reduction before clustering analysis, so that the feature after dimension reduction can further improve the accuracy and performance of clustering.

In order to achieve the above object, the present invention provides a self-encoder network optimization method for serving a clustering task, including:

a standard self-encoder network is constructed.

Structural purposeTarget function, the current target function is Loss₁And pre-training the self-encoder network by taking the data set to be clustered as input to obtain the optimal network parameters for pre-training.

Wherein x_iThe ith data is used as the input of the self-encoder network; f (x)_i) For the output from the encoder network, N represents the total number of input data and i represents the index number of the data.

And modifying the target function into Loss according to the optimal network parameters adopting the pre-training, and carrying out secondary training on the self-encoder network until the self-encoder network is optimal.

Loss＝αLoss₁+βLoss₂

Wherein Loss₂For a cluster-oriented fine-tuning objective function,

wherein α and β are weight coefficients;

wherein h is_i,h_jThe data sets to be clustered are the corresponding hidden layer characteristic data of the ith data and the jth data in the data set to be clustered in a self-encoder network, and N is the number of the data in the data set to be clustered; u shape_i(θ) is h_iA neighborhood of (c); theta is the neighborhood radius, theta ═ D_K1+D_K2+…….D_KN)/N；D_KiIs h_iAnd distance h_iThe distance between the Kth data, K being an adjustable parameter; j is U_i(θ) the number of data.

And taking out the encoder part in the self-encoder network which is optimal after secondary training as a clustering data preprocessing model, taking the data set to be clustered as the input of the clustering data preprocessing model, and taking the output of the clustering data preprocessing model as the input of a clustering task.

Has the advantages that:

1. the invention provides a self-encoder network optimization method serving a clustering task, which has the principle that the most representative feature representation is learned from high-dimensional data by utilizing the characteristics of a neural network, important features which are beneficial to clustering in an original data set are learned while dimensionality reduction is carried out, and unnecessary interference parts in the high-dimensional data set are removed.

2. Target function Loss for encoder model constructed in the invention₂Is an objective function based on density, and can be used for guiding the optimization process of the self-encoder network. And the objective function is embedded into the network secondary adjustment process, so that the clustering-oriented data characteristics can be effectively learned.

Drawings

FIG. 1 is a flow chart of a method for optimizing a self-encoder network serving clustering tasks according to the present invention;

fig. 2 is a schematic diagram of a standard self-encoder network structure constructed in an embodiment of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention provides a self-encoder network optimization method serving clustering tasks, the flow of which is shown in figure 1, and the method comprises the following steps:

and S1, constructing a standard self-encoder network.

A standard self-encoder network constructed in the embodiment of the present invention is shown in fig. 2, and the self-encoder is a network architecture for performing data compression. The self-encoder network compresses input data into a potential hidden layer, and then reconstructs the data according to the characteristics in the hidden layer to obtain final output data. The entire network is constructed in such a way that the best network is obtained by minimizing the difference between the input data and the reconstructed data. By using the feature vectors in the self-encoder hidden layer, a compression dimensionality reduction of the data can be achieved.

The invention trains a self-encoder model step by step, then takes a final encoder as a data processing model, and optimizes a self-encoder network by two parts, namely pre-training and secondary adjustment. The pre-training is the following step S2, and the secondary adjustment is the following step S3.

S2, constructing an objective function, wherein the current objective function is Loss₁And pre-training the self-encoder network by taking the data set to be clustered as input to obtain the optimal network parameters for pre-training.

This step is the pre-training process.

Wherein the current objective function is

Wherein x_iThe ith data is used as the input of the self-encoder network; f (x)_i) For the output of a fully connected convolutional self-encoder network, N represents the total number of input data and i represents the index number of the data.

S3, aiming at the optimal network parameters adopting pre-training, modifying the target function into Loss, and carrying out secondary training on the self-encoder network until the self-encoder network is optimal;

this step is the secondary adjustment part.

Loss＝αLoss₁+βLoss₂

Wherein Loss₂For a cluster-oriented fine-tuning objective function,

alpha and beta are weight coefficients; h is_i,h_jThe data clustering method comprises the following steps that (1) the ith data and the jth data in a data set to be clustered correspond to hidden layer characteristic data in a self-encoder network, and N is the number of the data in the data set to be clustered; u shape_i(θ) is h_iA neighborhood of (c); theta is the neighborhood radius, theta ═ D_K1+D_K2+…….D_KN)/N；D_KiIs h_iAnd distance h_iDistance between Kth data, K beingAdjustable parameters; j is U_i(θ) the number of data.

For each data h in the dataset_iIn the case where the neighborhood radius is θ, h is calculated_iH with all other points in the dataset_j(j is more than or equal to 1 and less than or equal to N, j is not equal to i)_ijFor example, it may be Euclidean distance if d_ijLess than θ, then h_jIs h_iOf the network.

Each data h in the dataset_i(i is more than or equal to 1 and less than or equal to N) all the neighbor sets around the point h_iCorresponding neighborhood U_i(θ)。

And S4, taking out the encoder part in the self-encoder network which is optimal after secondary training as a clustering data preprocessing model, taking the data set to be clustered as the input of the clustering data preprocessing model, and taking the output of the clustering data preprocessing model as the input of a clustering task.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for self-encoder network optimization to serve clustering tasks, comprising:

constructing a standard self-encoder network;

constructing an objective function, wherein the current objective function is Loss₁Pre-training the self-encoder network by taking a data set to be clustered as input to obtain pre-trained optimal network parameters;

wherein x_iThe ith data is used as the input of the self-encoder network; f (x)_i) For the output of the self-encoder network, N represents the total number of input data, i represents the index number of the data;

aiming at the optimal network parameters adopting the pre-training, the target function is modified into Loss, and the self-encoder network is trained for the second time until the self-encoder network is optimal;

Loss＝αLoss₁+βLoss₂

wherein Loss₂For a cluster-oriented fine-tuning objective function,

alpha and beta are weight coefficients; h is_i,h_jThe data clustering method comprises the following steps that (1) the ith data and the jth data in a data set to be clustered correspond to hidden layer characteristic data in a self-encoder network, and N is the number of the data in the data set to be clustered; u shape_i(θ) is h_iA neighborhood of (c); theta is the neighborhood radius, theta ═ D_K1+D_K2+…….D_KN)/N；D_KiIs h_iAnd distance h_iThe distance between the Kth data, K being an adjustable parameter; j is U_i(θ) number of data;

and taking out the encoder part in the self-encoder network which is optimal after secondary training as a clustering data preprocessing model, taking the data set to be clustered as the input of the clustering data preprocessing model, and taking the output of the clustering data preprocessing model as the input of the clustering task.