CN115311483A

CN115311483A - Incomplete multi-view clustering method and system based on local structure and balance perception

Info

Publication number: CN115311483A
Application number: CN202210979979.6A
Authority: CN
Inventors: 文杰; 刘成亮; 刘毅成; 邓世杰
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-11-08

Abstract

The invention discloses an incomplete multi-view clustering method and system based on local structure and balanced sensing, which comprises the steps of designing an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on local structure and balanced sensing aiming at a clustering task of incomplete multi-view data; preprocessing incomplete multi-view data of a given view missing prior position index matrix; and designing variables to be solved based on an alternative iterative optimization method according to the preprocessed data and the variables contained in the incomplete multi-view consistent clustering characterization learning model, so as to achieve the purpose of model optimization, and obtaining clustering results of all samples by using an optimal shared consistent characterization matrix obtained after optimization. The model designed by the method is an incomplete multi-view clustering model with interpretability, high efficiency and stable clustering result.

Description

Incomplete multi-view clustering method and system based on local structure and balance perception

Technical Field

The application relates to the technical field of machine learning and pattern recognition, in particular to an incomplete multi-view clustering method and system based on local structure and balanced perception.

Background

In the last years, a large amount of Multi-view data collected by different sensors or collected in different ways has emerged from different fields or industries, and the need for Multi-view Clustering (MVC) has also arisen in various applications. For example, to predict the likely progression of alzheimer's disease, a multi-view clustering model is proposed that consistently characterizes learning, with each sample of the model being represented by two brain magnetic resonance imaging data; in addition, the multi-view clustering model based on Non-Negative Matrix Factorization (NMF) also obtains better effect in the recommendation of the webpage items. In general, the conventional MVC method is based on a perfect view assumption that all samples can fully observe their complete view characteristic information. When the traditional multi-view clustering method is adopted to process incomplete multi-view data clustering tasks, samples with missing views are removed in advance, and only the samples with complete views can be clustered. In fact, in many practical applications, such as recommendation systems and alzheimer diagnosis, the actual collection is often incomplete data lacking views. Therefore, incomplete multi-view clustering research has important significance, and the method has higher popularization value and application value than the traditional multi-view clustering method.

In recent years, scholars at home and abroad successively put forward some incomplete multi-view clustering models. For example, a typical correlation analysis strategy is introduced with incomplete two-view clustering, and a complete kernel matrix of one complete view can be used to complete missing information of another incomplete kernel matrix. An incomplete multi-core k-means method (IMKKM-MKC) with a core completion characteristic is also proposed to solve the incomplete multi-view clustering problem under the condition of any view deletion. In addition to the methods based on the nuclear completion, an incomplete multi-view clustering method based on the graph completion is used for designing a model from the perspective of graph learning, restoring missing information of a plurality of incomplete graph matrixes and generating a consistent representation of the views. A Unified Embedding Alignment Framework (UEAF) based on matrix factorization designs a joint optimization model that can simultaneously recover missing view information and learn consistent tokens. The methods generally adopt the idea of restoring missing information to solve the incomplete multi-view clustering problem.

Although the missing information completion method can solve the incomplete multi-view clustering problem under the condition of partial view information missing to some extent, the method still has the following defects: 1) Almost all methods divide the clustering task into two independent and unrelated stages, namely graph or characterization learning is firstly carried out, and then k-means or spectral clustering is carried out to obtain the final incomplete multi-view clustering result. On the one hand, these methods do not guarantee that the resulting graph or representation is a cluster-friendly representation that can achieve optimal clustering performance; on the other hand, because all the methods adopt k-means to generate a final clustering result, and k different clustering results are generated by operating the k-means for k times, the methods cannot obtain a stable and unique clustering solution. 2) These methods generally have high computational complexity and memory consumption, resulting in an unsuitability for processing the clustering task of "large-scale" incomplete multiview data.

As described above, in recent years, many incomplete multi-view clustering methods have been proposed to solve the challenging problem of multi-view data clustering in which a missing view exists. However, most of the existing methods are not suitable for large-scale incomplete multi-view data clustering tasks, and clustering performance is unstable.

Disclosure of Invention

Aiming at the problems, the invention provides an incomplete multi-view clustering method and system based on local structure and balanced perception, and designs an incomplete multi-view consistent clustering characterization learning model aiming at the efficient learning problem of incomplete multi-view data, wherein the model obtains a unique clustering result by learning consistent characterization with probability characteristics among views.

In a first aspect of the present invention, an incomplete multi-view clustering method based on local structure and balanced sensing includes the following steps:

establishing a model: aiming at a clustering task of incomplete multi-view data, designing an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on local structure and balanced perception, wherein the model specifically comprises the following steps:

wherein the content of the first and second substances,

base matrix, m, representing the v-th view _v Representing the feature dimension of the v-th view, d representing the dimension of the consistent token space, P ∈ R ^d×n A shared consistent characterization matrix representing incomplete multiview data, n represents a total number of samples of the incomplete multiview data, α = [ α ] ₁ ,...,α _l ]Is a learnable weight vector, 1 ∈ R ^d D-dimensional column vectors, alpha, representing element values all of 1 _v Representing the v-th element in the vector alpha, r is a positive integer no less than 2,

representing the element alpha in the vector alpha _v Is the power of r, λ is a penalty parameter, l represents the number of views, n _v Represents the number of samples that are not missing in the v view, I is an identity matrix, I _i,j A value of an element indicating an (i, j) -th row-column position of the identity matrix,

representing the similarity relationship between the ith sample and the vth view of the jth sample,

representation matrix X ^(v) The ith column of vectors of (a) is,

indicating that the v-th view does not lack the matrix set formed by the samples,

representation matrix G ^(v) The j-th column vector of (a),

is a binary matrix of 0 and 1;

data preprocessing: incomplete multi-view data for a given view missing a priori position index matrix Z

Carrying out pretreatment;

optimizing the model: according to the preprocessed data and the incomplete multi-view consistent clustering characterization learning model, aiming at the variables contained in the model

P, alpha, an introduced auxiliary variable Q, a Lagrange multiplier C and a positive penalty parameter mu, and designing a solution variable based on an alternative iterative optimization method to achieve the purpose of model optimization, wherein:

solving for U ^(v) The optimization problem of (2):

obtain the variable U ^(v) Is the optimal solution of U ^(v) ＝M ^(v) N ^(v)T Wherein M is ^(v) ∑ ^(v) N ^(v)T Is X ^(v) S ^(v)T G ^(v)T P ^T Singular value decomposition equivalent of, S ^(v) ＝W ^(v) +I，

A pre-constructed similarity graph matrix is obtained;

solving an optimization problem of P:

the optimal solution for the variable P is obtained as:

wherein

μ>0 is a positive penalty parameter, C is a Lagrange multiplier, Q is an auxiliary variable and P = Q, and C represents the row number of the matrix P;

solving the optimization problem of Q:

the optimal solution for the variable Q is obtained as: q = (μ P + C) (11) ^T +μI) ^-1 ；

Solving the optimization problem of α:

the optimal solution for the variable α is obtained as:

wherein the content of the first and second substances,

the updated equations for C and μ are:

where p and μ ₀ Is a constant;

and (3) clustering: obtaining a clustering result of data by using the optimized optimal shared consistent representation matrix P, which specifically comprises the following steps: according to

If the ith row P _:,i When the jth element value is maximum, the ith sample is divided into the jth category, and the clustering results of all samples can be obtained by solving the position corresponding to the maximum element value of each column of the characterization matrix P.

Further, in the above-mentioned case,

is a binary matrix of 0 and 1, and is used for reserving the sum X in the matrix P ^(v) Corresponding sample characterization, matrix G ^(v) Constructing according to the view missing prior position index matrix Z, wherein the specific construction mode is as follows:

further, incomplete multi-view data of the a priori position index matrix Z is missing for a given view

The pretreatment is carried out, and the specific steps comprise:

deletion of missing views: deleting the missing samples in each view according to the view missing prior position index matrix Z to obtain the non-missing data set

Data normalization: to pair

Carrying out normalization pretreatment by the calculation mode of

Wherein

Representation matrix X ^(v) The ith column vector of (2);

local neighbor map

Construction: non-missing data X for each view ^(v) The distance between each sample and k nearest neighbor samples is calculated by Gaussian kernel in the way of

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighboring elements are set to 0;

constructing a conversion matrix according to the view missing prior position index matrix Z

In a second aspect of the present invention, an incomplete multi-view clustering system based on local structure and balance perception is provided, the system comprising:

the method comprises the steps of establishing a model unit, designing a consistent clustering characterization learning model of incomplete multi-view with probability characteristics based on local structure and balanced perception, and specifically comprising the following steps of:

wherein the content of the first and second substances,

base matrix, m, representing the v-th view _v Representing the feature dimension of the v-th view, d representing the dimension of the consistent token space, P ∈ R ^d×n A shared consistent characterization matrix representing incomplete multiview data, n representing a total number of samples of the incomplete multiview data, α = [ α ] ₁ ,...,α _l ]Is a learnable weight vector, 1 ∈ R ^d Representing d-dimensional column vectors with element values of 1, lambda is a penalty parameter, l represents the number of views, and n _v Represents the v-th viewNumber of samples not missing, I is the identity matrix, I _i,j A value of an element indicating an (i, j) th row-column position of the identity matrix,

representation matrix X ^(v) The vector of the ith column of (a),

indicating that the v view does not lack the matrix set formed by the samples,

representation matrix G ^(v) The (j) th column vector of (a),

is a binary matrix of 0 and 1;

a data preprocessing unit for missing incomplete multi-view data of the a priori position index matrix Z for a given view

Carrying out pretreatment;

an optimization model unit used for characterizing a learning model according to the preprocessed data and the incomplete multi-view consistent clustering and aiming at the variables contained in the model

P, alpha, an introduced auxiliary variable Q, a Lagrange multiplier C and a positive penalty parameter mu, and designing a method based on alternating iterative optimization to solve variables so as to achieve the purpose of model optimization, wherein:

solving for U ^(v) The optimization problem of (2):

obtain the variable U ^(v) Is U as the optimal solution of ^(v) ＝M ^(v) N ^(v)T Wherein M is ^(v) ∑ ^(v) N ^(v)T Is X ^(v) S ^(v)T G ^(v)T P ^T Singular value decomposition of S ^(v) ＝W ^(v) +I，

Is a pre-constructed similarity graph matrix;

solving the optimization problem of P:

the optimal solution for the variable P is obtained as:

wherein

μ>0 is a positive penalty parameter, C is a lagrange multiplier, Q is an auxiliary variable and P = Q, C represents the number of rows of the matrix P;

solving the optimization problem of Q:

Solving the optimization problem of α:

the optimal solution for the variable α is obtained as:

wherein the content of the first and second substances,

the updated equations for C and μ are:

where p and μ ₀ Is a constant;

the clustering unit is configured to obtain a clustering result of the data by using the optimized optimal shared consistent representation matrix P, and specifically includes: according to

If the ith row P _:,i The jth element has the largest value, then the ith sample is classified into the jth class. And obtaining the clustering result of all samples by calculating the position corresponding to the maximum element value of each column of the characterization matrix P.

In a further aspect of the present invention,

further, the data preprocessing unit specifically comprises the following steps:

deletion of missing views: deleting the missing samples in each view according to the view missing prior position index matrix Z to obtain a non-missing data set

Data normalization: to pair

Carrying out normalization pretreatment in a calculation mode of

Wherein

Representation matrix X ^(v) The ith column vector of (2);

local neighbor map

Construction: non-missing data X for each view ^(v) Calculating the distance between each sample and k nearest neighbor samples by using Gaussian kernel in the following calculation mode

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighbor elements are set to 0;

In a third aspect of the present invention, an incomplete multi-view clustering system based on local structure and balanced perception is provided, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the above-described method of partial structure and balance perception based incomplete multi-view clustering.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which instructions are stored, and when executed by a processor, the instructions cause the processor to perform the above incomplete multi-view clustering method based on local structure and balance perception.

The invention provides a method and a system for Clustering Incomplete multiview based on Local strUcture and Balance perception, aiming at the problem of Efficient learning of Incomplete multiview, the method and the system design an Incomplete multiview consistent Clustering characterization learning model with probability characteristics, the model obtains a unique Clustering result by learning consistent characterization with probability characteristics among views, wherein each element in a consistent probability characterization vector can directly reflect the probability that a corresponding sample belongs to a certain category. In addition, the model integrates the geometric structure maintenance and the consistent characterization learning into a very compact model, and any additional constraint term and penalty term parameter are not required to be introduced due to the introduction of the geometric structure maintenance characteristic, so that the model is more compact, and the parameter adjustment burden is reduced. Furthermore, to avoid over-partitioning of samples into a few classes, balanced perceptual learning techniques are introduced. The method not only has the best and most stable clustering performance, but also has higher calculation efficiency than the current advanced incomplete multi-view clustering method. Specifically, the beneficial effects of the invention include:

the LUBA _ EIMVC designs a novel balance perception graph regularization incomplete multi-view orthogonal matrix decomposition model, the model can not only mine and utilize local structure information of views to guide optimization of the model, but also can fully utilize non-missing view information to learn cluster consistency representation with probability characteristics;

different from the existing method for acquiring the clustering result by using k-means, the LUBA _ EIMVC directly acquires a unique positive probability matrix shared by all views, and each element in the matrix can be regarded as the probability that a sample belongs to a certain class, so that the problem of inaccurate clustering caused by the k-means can be solved;

in order to avoid the problem that the samples are excessively concentrated in certain classes or even a certain class in the process of optimizing the clustering result by the model, the balance perception constraint of a probability matrix is introduced, a consistent characterization matrix with clustering friendliness and probability characteristics is jointly learned, and the clustering result of incomplete multi-view data can be directly obtained on the basis of the matrix;

due to the learning of the consistent representation of the probability characteristics, the model designed by the LUBA _ EIMVC is an incomplete multi-view clustering model with interpretability, high efficiency and stable clustering result.

Drawings

FIG. 1 is a schematic diagram of an incomplete multi-view clustering method based on local structure and balanced sensing in an embodiment of the present invention;

FIG. 2 is a schematic diagram of an optimization learning process of an incomplete multi-view consistent clustering characterization learning model in the embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an incomplete multi-view clustering system based on local structure and balance perception in an embodiment of the present invention;

fig. 4 is an architecture diagram of a computer device in an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures associated with the present invention are shown in the drawings, not all of them.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The embodiment of the invention provides the following embodiments aiming at an incomplete multi-view clustering method and system based on local structure and balanced perception:

example 1 based on the invention

Fig. 1 shows a schematic diagram of an Incomplete Multi-View Clustering method based on Local strUcture and balanced perception in embodiment 1 of the present invention, which is a schematic diagram of an Incomplete Multi-View Clustering (LUBA _ EIMVC, local strUcture-and Balance-architecture efficiency inclusion Multi-View Clustering) method based on Local strUcture and balanced perception, and the method aims to obtain a positive probability consistent representation matrix with balanced Clustering results from Incomplete Multi-View data, so as to output Clustering results with interpretability and uniqueness. The method comprises the following specific steps:

establishing a model: designing a clustering task based on incomplete multi-view data, and designing an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on local structure and balanced perception, wherein the clustering task specifically comprises the following steps:

wherein the content of the first and second substances,

base matrix representing the v-th view angle, m _v Representing the characteristic dimension of the v-th view, d representing the dimension of the consistent token space, P ∈ R ^d×n Representing a shared consistent characterization matrix of the incomplete multiview data, n representing a total number of samples of the incomplete multiview data; α = [ α = ₁ ,...,α _l ]Is a learnable weight vector, 1 ∈ R ^d Representing d-dimensional column vectors, alpha, with element values all of 1 _v Representing the v-th element in the vector alpha, r is a positive integer no less than 2,

representing the element alpha in the vector alpha _v To the r-th power of; λ is a penalty parameter, l represents the number of views, n _v Represents the number of samples that are not missing in the v view, I is an identity matrix, I _i,j A value of an element indicating an (i, j) th row-column position of the identity matrix,

representation matrix X ^(v) The ith column of vectors of (a) is,

representation matrix G ^(v) The (j) th column vector of (a),

is a binary matrix of 0 and 1;

in the implementation, given incomplete multi-view data

And view missing position information matrix Z epsilon R ^l×n ，Y ^(v) A matrix set of n samples representing the v-th view, with column vectors

Representing the v-th view feature of the j-th sample. If the v view of the j sample is missing, the corresponding element Z in the view missing position information matrix Z _v,j =0; otherwise Z _v,j =1 indicates that the view of the corresponding sample is not missing. On the original data

In (2), all feature elements of the feature vector corresponding to the missing view can be identified by "NaN". For the clustering task of the incomplete multi-view data, the embodiment designs the learning model of the local structure and balance perception of the formula (1) and the incomplete multi-view consistent clustering representation with probability characteristicsIn the formula (1), matrix X ^(v) Can be deleted from the original data Y ^(v) The column corresponding to the vector denoted "NaN" in (b) is obtained directly. In the formula (1), the reaction mixture is,

a base matrix representing the v-th perspective, the values of which are obtained by optimizing the model (1), where d represents the dimension of the uniform characterization space, usually set to the number of classes into which the data is expected to be partitioned. 1 ≧ P ≧ 0 indicates that the range of all element values in matrix P is [0,1]。

For pre-constructed similarity graph matrix, its elements

The method is used for representing the similarity relation between the ith sample and the v view of the jth sample, and the specific construction mode is as follows: 1) And calculating the Gaussian distance between the non-missing samples in the v view in the following way:

2) For the ith non-missing sample, according to it and other n _v Ordering the distances between 1 sample, at W ^(v) Only the gaussian distance corresponding to each sample and its first k minimum samples is kept, and the other elements are set to 0.

In a preferred embodiment, in formula (1)

in the model (1), the basis matrix U ^(v) Orthogonal constraint on U ^(v)T U ^(v) = I may avoid the problem of cluster center degradation. Constraint term

Is a balanced perceptual constraint that avoids the problem of over-clustering into a small number of classes, i.e., categorizing data with class c as

And (4) class. Compared with the binary clustering label obtained by kmeans, the constraint P introduced in the model (1) ^T 1=1 results in a consistent probability matrix, which increases the degree of freedom for basis matrix learning and consistent representation learning. In the case of the model (1),

the novel graph designed for the method is embedded with a multi-view consistent characterization learning item, and the novelty and the distinctiveness of the item from other methods are mainly reflected in that: the invention puts the graph embedded structure retention characteristics and the sharing consistent characterization learning of incomplete multiple views into a very simplified model, has no hyper-parameters, and can obtain the structured identification consistent characterization.

Carrying out pretreatment;

in a preferred embodiment, incomplete multi-view data of the a priori position index matrix Z is missing for a given view

The pretreatment is carried out, and the specific steps comprise:

Data normalization: to pair

Carrying out normalization pretreatment in a calculation mode of

Wherein

Representation matrix X ^(v) The ith column vector of (1);

local neighbor map

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighboring elements are set to 0;

constructing a conversion matrix according to the view missing prior position index matrix Z by using the formula (2)

Optimizing the model: according to the preprocessed data and the designed incomplete multi-view consistent clustering characterization learning model (1), aiming at the variables contained in the model

P, alpha, an introduced auxiliary variable Q, a Lagrange multiplier C and a positive penalty parameter mu, and solving variables by designing a method based on alternating iterative optimization to achieve the purpose of model optimization.

In the concrete implementation process, the model (1) contains

P, alpha and the like, and a method based on alternate iterative optimization is designed to solve the variables. First, let S ^(v) ＝W ^(v) + I and introduce an auxiliary variable Q and let P = Q as follows:

the augmented Lagrangian function of problem (3) can be expressed as:

in the formula (I), the compound is shown in the specification,

μ>0 is a positive penalty parameter; c is the lagrange multiplier.

Representation matrix ^A∈Rm×n The 'Frobenius' norm of (1) is calculated in a way of

Wherein ^A _i,j Is the (i, j) th element of matrix a.

Then, the following five problems are optimized one by one through iterative solution, and the optimal solution of the variables can be obtained:

step 1: solving for U ^(v) When solving for U ^(v) Then, the other belt solution variables can be regarded as two known variables, and then the variable U is obtained ^(v) The following optimization sub-problem:

according to constraint U ^(v)T U ^(v) = I, the problem (5) can be simplified as:

in the formula D ^(v) Is a diagonal matrix because of the matrix S ^(v) Has a symmetrical structure, so D ^(v) Is calculated as

From equation (6), the following optimization problem equivalent to problem (5) can be obtained:

let X ^(v) S ^(v)T G ^(v)T P ^T Singular Value Decomposition (SVD) of ^(v) ∑ ^(v) N ^(v)T Then the optimal solution of problem (7) is U ^(v) ＝M ^(v) N ^(v)T The singular value decomposition operation can be obtained by directly calling the 'svd' function in matlab software.

Step 2: solving for P, considering variables other than P as known quantities, the following sub-optimization problem for the variable P can be obtained:

problem (8) can be simplified as:

in the formula

It can be found that matrix a is a diagonal matrix with the diagonal elements all being positive values.According to

The problem (9) can be transformed into the equivalent form:

problem (10) can be viewed as n independent optimization problems, so P can be optimized on a column-by-column basis _:,i The optimal solution for the variable P is represented as follows:

in the formula, the function max (a, 0) indicates that an element a smaller than 0 is set to 0.c denotes the number of rows of the matrix P.

And step 3: solving Q, fixing other variables except Q, and then degenerating the optimization problem into:

by calculating the partial derivative of the problem (12) with respect to the variable Q and setting it to 0, one can obtain:

Q＝(μP+C)(11 ^T +μI) ^-1 (13)

and 4, step 4: and solving alpha. Order to

And fixing the variable independent of the variable alpha, the following optimization problem about the variable alpha can be obtained:

solving the problem (14), the optimal solution of the available variable α is:

wherein r is a positive integer not less than 2.

And 5: c and μ are updated. The updated equations for C and μ are as follows:

where ρ and μ ₀ Is a constant number of μ ₀ Is usually set to a relatively large value such as 10 ⁸ ρ is usually set to a value greater than 1.

One specific example of a model optimization procedure is shown in Algorithm 1 below, for LUBA _ EIMVC, the label of the ith sample can be passed

And (4) directly obtaining.

The complete optimization flowchart for the model (1) is shown in fig. 2, wherein the initialization step mainly includes the following steps in the algorithm 1: will be provided with

Initialized to a random arbitrary orthogonal matrix. Initialization α =1, μ ₀ ＝10 ⁸ μ =0.01, ρ =1.08. The convergence criterion in FIG. 2 is | loss _t -loss _t-1 |＜10 ^-5 Wherein loss _t And loss _t-1 Respectively representing target loss values of the t step and the t-1 step, and the calculation formula is as follows:

and (3) clustering process: using the optimized optimal shared consistent characterization matrix P according to

If the ith row P _:,i The jth element has the largest value, then the ith sample is classified into the jth class. The clustering result of all samples can be obtained by solving the position corresponding to the maximum element value of each column of the characterization matrix P, and d represents the row number of the matrix P and can be generally set as the clustering category number c.

Since the elements in the matrix P of the present invention represent the probability that each sample belongs to a certain class, it can be directly based on

To obtain the clustering result of the data, i.e. if the ith column P _:,i The jth element has the largest value, then the ith sample is classified into the jth class. The clustering results of all samples can be obtained by solving the position corresponding to the maximum element value of each column of the matrix P.

The method designs a fast, stable and interpretable incomplete multi-view data clustering new model for the incomplete multi-view data clustering problem under view deficiency in application scenes of various industries, and compared with the previous model, the method has the following unique characteristics: the model is simple and distinctive: providing a distinctive and concise 'incomplete multi-view consistent characterization learning item embedded in a local structure', wherein the item integrates the local structure embedding and the incomplete multi-view consistent characterization learning into an optimization item; the model has interpretability: each constraint item of the model has meaning and value, and each obtained element value sharing consistent representation is the representation of the clustering result, so that the model and the output result of the model have interpretability; the model has a unique stable solution: different from the traditional method which needs additional kmeans clustering to obtain an unstable clustering result, the method can directly obtain a data unique clustering result according to the unique output 'consistent representation probability matrix P' of the model. The method of the invention not only can obtain higher clustering precision, but also has the advantages of least time expenditure and highest efficiency.

Example 2 based on the invention

The partially structured and balanced sensing-based incomplete multi-view clustering system 300 provided in embodiment 2 of the present invention can execute the partially structured and balanced sensing-based incomplete multi-view clustering method provided in any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method.

Fig. 3 is a schematic structural diagram of an incomplete multi-view clustering system 300 based on local structure and balanced sensing in embodiment 2 of the present invention. Referring to fig. 3, an incomplete multi-view clustering system 300 based on local structure and balanced sensing according to an embodiment of the present invention may specifically include:

the model establishing unit 310 is configured to design an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on a local structure and balanced perception based on a clustering task of incomplete multi-view data, and specifically includes:

wherein the content of the first and second substances,

base matrix representing the v-th view angle, m _v Representing the characteristic dimension of the v-th view, d representing the dimension of the consistent token space, P ∈ R ^d×n A shared consistent representation matrix representing incomplete multiview data, n representing incomplete multiviewTotal number of samples of the graph data, α = [ α = [ ] ₁ ,...,α _l ]Is a learnable weight vector, 1 ∈ R ^d Representing d-dimensional column vectors with element values of 1, lambda being a penalty parameter, l representing the number of views, n _v Represents the number of samples that are not missing in the v view, I is an identity matrix, I _i,j A value of an element indicating an (i, j) -th row-column position of the identity matrix,

representation matrix X ^(v) The ith column of vectors of (a) is,

representation matrix G ^(v) The (j) th column vector of (a),

is a binary matrix of 0 and 1; r is a positive integer not less than 2.

A data preprocessing unit 320 for missing incomplete multi-view data of the a priori position index matrix Z for a given view

Carrying out pretreatment;

an optimization model unit 330, configured to characterize a learning model according to the preprocessed data and the designed incomplete multi-view consistent cluster, and target to variables included in the model

P, alpha, introduced auxiliary variable Q, lagrange multiplier C and positive penalty parameter mu, designing a method based on alternative iterative optimization to solve the variables to reach a normThe purpose of type optimization, wherein:

solving for U ^(v) The optimization problem of (2):

obtain the variable U ^(v) Is U as the optimal solution of ^(v) ＝M ^(v) N ^(v)T Wherein M is ^(v) ∑ ^(v) N ^(v)T Is X ^(v) S ^(v)T G ^(v)T P ^T Singular value decomposition equivalent of, S ^(v) ＝W ^(v) +I，

A pre-constructed similarity graph matrix is obtained;

solving the optimization problem of P:

the optimal solution for the variable P is obtained as follows:

wherein

μ>0 is a positive penalty parameter, C is a lagrange multiplier, Q is an auxiliary variable and P = Q;

solving the optimization problem of Q:

Solving the optimization problem of α:

the optimal solution for the variable α is obtained as:

wherein, the first and the second end of the pipe are connected with each other,

the updated equations for C and μ are:

where p and μ ₀ Is a constant;

a clustering unit 340, configured to obtain a clustering result of the data according to the optimized optimal shared consistent characterization matrix P

If the ith row P _:,i The jth element has the largest value, then the ith sample is divided into the jth class. And obtaining the clustering result of all samples by calculating the position corresponding to the maximum element value of each column of the characterization matrix P.

Further, in the above-mentioned case,

further, the data preprocessing unit 320 specifically includes:

deletion of missing view: deleting the missing samples in each view according to the view missing prior position index matrix Z to obtain a non-missing data set

Data normalization: to pair

Go on to returnA normalization pretreatment in the form of calculation

Wherein

Representation matrix X ^(v) The ith column vector of (2);

local neighbor map

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighboring elements are set to 0;

The system 300 may include other components in addition to the 4 units described above, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.

The specific working process of the incomplete multi-view clustering system 300 based on local structure and balanced sensing is described in reference to the above-mentioned embodiment 1 of the incomplete multi-view clustering method based on local structure and balanced sensing, and is not described again.

Example 3 based on the invention

A system according to an embodiment of the present invention may also be implemented by means of the architecture of a computing device as shown in fig. 4. Fig. 4 illustrates an architecture of the computing device. As shown in fig. 4, a computer system 410, a system bus 430, one or more CPUs 440, input/output components 420, memory 450, and the like. The memory 450 may store various data or files used in computer processing and/or communication and program instructions executed by the CPU including the method of embodiment 1. The architecture shown in fig. 4 is merely exemplary, and one or more of the components in fig. 4 may be adjusted as needed to implement different devices.

Example 4 based on the invention

Embodiments of the present invention may also be implemented as a computer-readable storage medium. The computer-readable storage medium according to embodiment 4 has computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform the incomplete multi-view clustering method based on local structure and balance perception according to embodiment 1 of the present invention described with reference to the above drawings.

The embodiment of the invention aims at the incomplete multi-view clustering method, the incomplete multi-view clustering system and the storage medium based on the local structure and the balance perception, and utilizes the embodiments 1 to 4 to carry out training test on the incomplete multi-view clustering method and the incomplete multi-view clustering system based on the local structure and the balance perception. Table 2 shows the average clustering accuracy obtained on the BBCSport, caltech101 and 3Sources datasets with a view miss rate of 30%. Where MIC and DAIMC are imperfect multi-view clustering methods of current manifolds, the results are shown in table 2:

TABLE 2

Data set	MIC	DAIMC	The invention
				BBCSport	46.21±4.71	63.45±1.97	78.79±3.02
Caltech101	20.12±0.75	25.15±0.31	27.63±0.90
				3Sources	47.69±7.61	52.43±6.63	71.83±7.37

Table 3 is the execution time (in seconds) on BBCSport, caltech101 and 3Sources datasets with a view miss rate of 30%, where MIC and DAIMC are the incomplete multi-view clustering methods of the current manifold, and the results are shown in table 3:

TABLE 3

Data set	MIC	DAIMC	The invention
				BBCSport	3.843	148.501	2.183
Caltech101	1.407×10 ⁴	1.861×10 ³	129.541
				3Sources	5.912	563.780	4.967

By utilizing the embodiments 1 to 4 and the performance analysis, the invention provides an Incomplete Multi-View Clustering method and system based on Local strUcture and Balance perception, aiming at the Efficient learning problem of Incomplete Multi-View, the invention designs an Incomplete Multi-View consistent cluster characterization learning model with probability characteristics, which obtains a unique Clustering result by learning consistent characterization with probability characteristics among views, wherein each element in a consistent probability characterization vector can directly reflect the probability of a corresponding sample belonging to a certain category. In addition, the model integrates the geometric structure maintenance and the consistent characterization learning into a very concise model, and any additional constraint term and penalty term parameter are not required to be introduced due to the introduction of the geometric structure maintenance characteristic, so that the model is simpler, and the parameter adjustment burden is reduced. Furthermore, to avoid over-partitioning of samples into a few classes, balanced perceptual learning techniques are introduced. The method not only has the best and most stable clustering performance, but also has higher calculation efficiency compared with the current more advanced incomplete multi-view clustering method. Specifically, the beneficial effects of the invention include: the LUBA _ EIMVC designs a novel balance perception graph regularization incomplete multi-view orthogonal matrix decomposition model, the model can not only mine and utilize local structure information of views to guide optimization of the model, but also can fully utilize non-missing view information to learn cluster consistency representation with probability characteristics; different from the existing method for acquiring the clustering result by using k-means, the LUBA _ EIMVC directly acquires a unique positive probability matrix shared by all views, and each element in the matrix can be regarded as the probability that a sample belongs to a certain class, so that the problem of inaccuracy of the clustering result caused by the k-means can be solved; in order to avoid the problem that the samples are excessively concentrated in certain classes or even a certain class in the process of optimizing the clustering result by the model, the balance perception constraint of a probability matrix is introduced, a consistent characterization matrix with clustering friendliness and probability characteristics is jointly learned, and the clustering result of incomplete multi-view data can be directly obtained on the basis of the matrix; due to the learning of the consistent representation of the probability characteristics, the model designed by the LUBA _ EIMVC is an incomplete multi-view clustering model with interpretability, high efficiency and stable clustering results.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An incomplete multi-view clustering method based on local structure and balanced perception is characterized by comprising the following steps:

establishing a model: for the clustering task of incomplete multi-view data, designing an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on local structure and balanced perception, wherein the model specifically comprises the following steps:

s.t.U ^(v)T U ^(v) ＝I,1≥α _v ≥0,

1≥P≥0,P ^T 1＝1

base matrix, m, representing the v-th view _v Representing the characteristic dimension of the v-th view, d representing the dimension of the consistent token space, P ∈ R ^d×n A shared consistent characterization matrix representing incomplete multiview data, n representing a total number of samples of the incomplete multiview data, α = [ α ] ₁ ,...,α _l ]Is a learnable weight vector, 1 ∈ R ^d Representing d-dimensional column vectors, alpha, with element values all of 1 _v Representing the v-th element in the vector alpha, r is a positive integer no less than 2,

representing the element alpha in the vector alpha _v Is a penalty term parameter, l represents the number of views, n _v Represents the number of samples that are not missing in the v view, I is an identity matrix, I _i,j A value of an element indicating an (i, j) -th row-column position of the identity matrix,

representation matrix X ^(v) The ith column of vectors of (a) is,

representation matrix G ^(v) The (j) th column vector of (a),

is a binary matrix of 0 and 1;

Carrying out pretreatment;

solving for U ^(v) The optimization problem of (2):

obtain the variable U ^(v) Is U as the optimal solution of ^(v) ＝M ^(v) N ^(v)T Wherein M is ^(v) ∑ ^(v) N ^(v)T Is X ^(v) S ^(v)T G ^(v)T P ^T Singular value decomposition equivalent of (1), S ^(v) ＝W ^(v) +I，

A pre-constructed similarity graph matrix is obtained;

solving the optimization problem of P:

the optimal solution for the variable P is obtained as follows:

wherein

solving the optimization problem of Q:

Solving the optimization problem of α:

the optimal solution for the variable α is obtained as:

wherein the content of the first and second substances,

the updated equations for C and μ are:

where p and μ ₀ Is a constant;

and (3) clustering process: obtaining a clustering result of data by using the optimized optimal shared consistent representation matrix P, which specifically comprises the following steps: according to

2. The incomplete multi-view clustering method based on local structure and balance perception according to claim 1,

3. the incomplete multi-view clustering method based on local structure and balance perception according to claim 2, characterized in that incomplete multi-view data of the prior position index matrix Z is missing for a given view

The pretreatment is carried out, and the specific steps comprise:

Data normalization: to pair

Carrying out normalization pretreatment by the calculation mode of

Wherein

Representation matrix X ^(v) The ith column vector of (2);

local neighbor map

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighbor elements are set to 0;

4. An incomplete multi-view clustering system based on local structure and balance perception, the system comprising:

the method comprises the following steps of establishing a model unit for clustering tasks of incomplete multi-view data, designing an incomplete multi-view consistent clustering characterization learning model with probability characteristics based on local structure and balanced perception, wherein the model specifically comprises the following steps:

s.t.U ^(v)T U ^(v) ＝I,1≥α _v ≥0,

1≥P≥0,P ^T 1＝1

wherein the content of the first and second substances,

representation matrix X ^(v) The vector of the ith column of (a),

indicating that the v view does not lack the matrix set formed by the samples,

representation matrix G ^(v) The (j) th column vector of (a),

is a binary matrix of 0 and 1;

Carrying out pretreatment;

solving for U ^(v) The optimization problem of (2):

A pre-constructed similarity graph matrix is obtained;

solving the optimization problem of P:

the optimal solution for the variable P is obtained as follows:

wherein

solving the optimization problem of Q:

the optimal solution for obtaining the variable Q is: q = (μ P + C) (11) ^T +μI) ^-1 ；

Solving the optimization problem of α:

the optimal solution for the variable α is obtained as:

wherein the content of the first and second substances,

the updated equations for C and μ are:

where p and μ ₀ Is a constant;

the clustering unit is configured to obtain a clustering result of the data by using the optimized optimal shared consistent characterization matrix P, and specifically includes: according to

If the ith row P _:,i When the jth element value is maximum, the ith sample is divided into the jth category, and the position corresponding to the maximum element value of each column of the characterization matrix P is obtained to obtain the aggregation of all samplesAnd (4) classifying the result.

5. The incomplete multi-view clustering system based on local structure and balance perception according to claim 4,

6. the incomplete multi-view clustering system based on local structure and balanced perception according to claim 5, wherein the data preprocessing unit comprises:

Data normalization: to pair

Carrying out normalization pretreatment in a calculation mode of

Wherein

Representation matrix X ^(v) The ith column vector of (1);

local neighbor map

Wherein

As a sample

One of k neighbors, W ^(v) Other non-neighbor elements are set to 0;

7. An incomplete multi-view clustering system based on local structure and balance perception, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of partial structure and balance perception based incomplete multi-view clustering as claimed in any one of claims 1 to 3.

8. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for imperfectly multi-view clustering based on local structure and balance perception according to any one of claims 1 to 3.