CN111091137A

CN111091137A - Sparse subset selection method based on dissimilarity and Laplace regularization

Info

Publication number: CN111091137A
Application number: CN201811243586.9A
Authority: CN
Inventors: 杨晨曦
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2020-05-01

Abstract

The invention discloses a sparse subset selection method based on dissimilarity and Laplace regularization, which considers the problem of finding a representative element capable of effectively representing a target set from a source set by utilizing the pairwise dissimilarity relation between a given source set and the target set, provides a low-rank sparse subset selection model based on dissimilarity and can be effectively solved by using convex programming. On the basis of past work, the structure among the representative elements is considered, so that the number of the representative elements is less, and the representative quality is higher. Where Algorithm 1 is also used for efficient implementation of the Algorithm and our Algorithm can be parallelized even further, thus further reducing computation time.

Description

Sparse subset selection method based on dissimilarity and Laplace regularization

Technical Field

The application relates to the field of machine learning and data analysis, in particular to a sparse subset selection method based on dissimilarity and Laplace regularization.

Background

Selection of sparse subset: finding a large number of models or subsets of data points, which retain the characteristics of the entire set, is an important issue in machine learning and data analysis in computer vision applications, which have a large number of applications in image and natural language processing, bio/health informatics, recommendation systems, etc. These information elements are referred to as representatives or demonstrations. The data representation facilitates the summarization and visualization of data sets of text/Web documents, images and videos, thus increasing the interpretability of large-scale data sets by data analysts and domain experts. The model representation helps to efficiently describe complex phenomena or events using a small number of models, or can be used for model compression in a collective model. More importantly, the computation time and memory requirements of learning and reasoning algorithms (such as Nearest Neighbor (NN)) classifiers are improved by processing a representation that contains most of the information of the original set. Selecting a small portion of the product to recommend to the customer not only increases retailer revenue but also saves customer time. Furthermore, the representative elements facilitate clustering of the data sets and, as the most primitive element, can be used to efficiently synthesize/generate new data points. Finally, and equally important, a high performance classifier can be obtained using a representative, with very few samples being used to select and annotate from a large number of unlabeled samples. Dissimilarity degree: dissimilarity is a pairwise correspondence between data, which has many advantages: first, for high dimensional datasets, where the ambient spatial dimension is much higher than the cardinality of the dataset, processing pairwise relationships is more efficient than working on high dimensional measurement vectors. Second, although some actual data sets do not exist in vector space, such as social network data or proteomics data, pairwise relationships can already be efficiently computed for them.

Laplace regularization: low RANK methods capture the potentially low dimensional-RANK representation (LRR), which has attracted great interest in pattern analysis and signal processing communities as a promising data structure. In particular, problems related to low order matrix estimation have attracted considerable attention in recent years. LRR has been widely used for subspace segmentation, image removal, image clustering, and video background/foreground separation. Low-level normalizers in LRR have been deeply linked to recent theoretical advances in Robust Principal Component Analysis (RPCA), which has brought new powerful modeling options for many applications.

Disclosure of Invention

The purpose of the invention is realized by the following technical scheme:

let us assume that we have one source set X ═ X₁,...,x_MY and a target set Y ═ Y₁,...,y_NThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and Y

d_ijDenotes x_iRepresents y_jThe smaller the value of (A) represents x_iThe better the representation of y_j. This binary relationship is written in the form of a matrix as follows

Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found which can represent the features of the target set Y well

Given a dissimilarity matrix D, we need to find a representative subset, i.e., representative element, of the source set X so that it can effectively represent the target set Y. For this reason, we consider the relation d to dissimilarity_ijAssociated unknown variable z_ijThe optimization relationship of (1). We represent these unknown variables by the following matrix

We use the variable z_ijDenotes x_iWhether or not to represent y_jWhen z is_ijWhen 0 is taken, x is represented_iRepresents y_jOtherwise, it is not represented. To ensure each y_jAll have corresponding representative elements, we stipulate

Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enough_jIf x is_iIs selected as the representative element, then y is encoded_jThe cost of d_ijz_ij∈{0，d_ijThe cost of representing Y by a subset of X is

Second, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Thirdly, we want to have a better structure of the resulting representation elements, i.e. the representation elementsThe "distance" between the cells can be as far as possible.

Combining these three objectives, we obtain the following optimization function

Wherein | - | purple_pRepresents l_pNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of the encoding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements.

Since it contains a binary structure z_ijE {0, 1}, so this problem is a non-convex problem, NP-hard, so we consider the convex-down relaxation problem:

in this optimization function we have removed the non-convex part, the indicator function I (-). We can continue to write the above problem as a matrix form as follows

s.t.1^TZ＝1^T,Z≥0

Wherein the content of the first and second substances,

tr represents the trace of the matrix and,

L＝one(1)-E。

the method overcomes the following defects and shortcomings of the prior method:

searching a representative element (namely, a representative subset) of the big data so that the representative element can represent most characteristics of the source set has important research and application values in the problems related to machine learning. The related work of finding representative elements has been done for some time, and related research algorithms can be divided into two categories according to the type of information that the representative should retain.

The first type of algorithm is to find the representative elements of the data located in one or more low-dimensional subspaces, in which case the data is typically embedded in a vector space, and this method cannot be applied in the general case where the data is not in a subspace.

The second type of algorithm uses similarity/dissimilarity between pairs of data points instead of a metric vector. The model can be considered outside the linear subspace using pairwise similarity/dissimilarity relationships, however existing algorithms suffer from reliance on initialization.

Compared with the prior art, the invention has the advantages and effects that:

the method provided by the invention finds a representative subset of a source set based on dissimilarity relation between data points, reduces the original problem into a dissimilarity-based low-rank sparse subset selection problem, and obtains a better balance between the number of representative elements and the representative quality, and on the basis, a low-rank condition is introduced, so that the selected subset can keep the structure.

Drawings

FIG. 1 is a schematic diagram of target set finding according to the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found that is well representative of the features possessed by the target set Y.

Second, we hope to be able toAs few as possible representatives can be selected to represent the set of objects Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible.

Combining these three objectives, we obtain the following optimization function

s.t.1^TZ＝1^T,Z≥0

Wherein the content of the first and second substances,

tr represents the trace of the matrix and,

L＝one(1)-E。

Claims

1. a sparse subset selection method based on dissimilarity and Laplace regularization, the method comprising:

suppose there is one source set X ═ X₁,...,x_MY and a target set Y ═ Y₁,...,y_NThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and Y

d_ijDenotes x_iRepresents y_jThe smaller the value of (A) represents x_iThe better the representation of y_j(ii) a This binary relationship is written in the form of a matrix as follows

These unknown variables are represented by the following matrix

By variable z_ijDenotes x_iWhether or not to represent y_jWhen z is_ijWhen 0 is taken, x is represented_iRepresents y_jOtherwise, it does not represent; to ensure each y_jAll have corresponding representative elements, stipulate

Second, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible;

combining these three objectives, we obtain the following optimization function

Wherein | - | purple_pRepresents l_pNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of coding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements;

since it contains a binary structure z_ijE {0, 1}, so the problem is a non-convex problem, i.e., NP-hard, so there is:

in this optimization function we have removed the non-convex part-the indicator function I (-); continue to write the above problem as a matrix form as follows

s.t.1^TZ＝1^T,Z≥0

Wherein the content of the first and second substances,

tr represents the trace of the matrix and,

L＝one(1)-E。