CN111091137A - Sparse subset selection method based on dissimilarity and Laplace regularization - Google Patents

Sparse subset selection method based on dissimilarity and Laplace regularization Download PDF

Info

Publication number
CN111091137A
CN111091137A CN201811243586.9A CN201811243586A CN111091137A CN 111091137 A CN111091137 A CN 111091137A CN 201811243586 A CN201811243586 A CN 201811243586A CN 111091137 A CN111091137 A CN 111091137A
Authority
CN
China
Prior art keywords
dissimilarity
representative
matrix
represent
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811243586.9A
Other languages
Chinese (zh)
Inventor
杨晨曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811243586.9A priority Critical patent/CN111091137A/en
Publication of CN111091137A publication Critical patent/CN111091137A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a sparse subset selection method based on dissimilarity and Laplace regularization, which considers the problem of finding a representative element capable of effectively representing a target set from a source set by utilizing the pairwise dissimilarity relation between a given source set and the target set, provides a low-rank sparse subset selection model based on dissimilarity and can be effectively solved by using convex programming. On the basis of past work, the structure among the representative elements is considered, so that the number of the representative elements is less, and the representative quality is higher. Where Algorithm 1 is also used for efficient implementation of the Algorithm and our Algorithm can be parallelized even further, thus further reducing computation time.

Description

Sparse subset selection method based on dissimilarity and Laplace regularization
Technical Field
The application relates to the field of machine learning and data analysis, in particular to a sparse subset selection method based on dissimilarity and Laplace regularization.
Background
Selection of sparse subset: finding a large number of models or subsets of data points, which retain the characteristics of the entire set, is an important issue in machine learning and data analysis in computer vision applications, which have a large number of applications in image and natural language processing, bio/health informatics, recommendation systems, etc. These information elements are referred to as representatives or demonstrations. The data representation facilitates the summarization and visualization of data sets of text/Web documents, images and videos, thus increasing the interpretability of large-scale data sets by data analysts and domain experts. The model representation helps to efficiently describe complex phenomena or events using a small number of models, or can be used for model compression in a collective model. More importantly, the computation time and memory requirements of learning and reasoning algorithms (such as Nearest Neighbor (NN)) classifiers are improved by processing a representation that contains most of the information of the original set. Selecting a small portion of the product to recommend to the customer not only increases retailer revenue but also saves customer time. Furthermore, the representative elements facilitate clustering of the data sets and, as the most primitive element, can be used to efficiently synthesize/generate new data points. Finally, and equally important, a high performance classifier can be obtained using a representative, with very few samples being used to select and annotate from a large number of unlabeled samples. Dissimilarity degree: dissimilarity is a pairwise correspondence between data, which has many advantages: first, for high dimensional datasets, where the ambient spatial dimension is much higher than the cardinality of the dataset, processing pairwise relationships is more efficient than working on high dimensional measurement vectors. Second, although some actual data sets do not exist in vector space, such as social network data or proteomics data, pairwise relationships can already be efficiently computed for them.
Laplace regularization: low RANK methods capture the potentially low dimensional-RANK representation (LRR), which has attracted great interest in pattern analysis and signal processing communities as a promising data structure. In particular, problems related to low order matrix estimation have attracted considerable attention in recent years. LRR has been widely used for subspace segmentation, image removal, image clustering, and video background/foreground separation. Low-level normalizers in LRR have been deeply linked to recent theoretical advances in Robust Principal Component Analysis (RPCA), which has brought new powerful modeling options for many applications.
Disclosure of Invention
The purpose of the invention is realized by the following technical scheme:
let us assume that we have one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and Y
Figure BDA0001839980660000021
dijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj. This binary relationship is written in the form of a matrix as follows
Figure BDA0001839980660000022
Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found which can represent the features of the target set Y well
Given a dissimilarity matrix D, we need to find a representative subset, i.e., representative element, of the source set X so that it can effectively represent the target set Y. For this reason, we consider the relation d to dissimilarityijAssociated unknown variable zijThe optimization relationship of (1). We represent these unknown variables by the following matrix
Figure BDA0001839980660000023
We use the variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it is not represented. To ensure each yjAll have corresponding representative elements, we stipulate
Figure BDA0001839980660000024
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X is
Figure BDA0001839980660000025
Second, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Thirdly, we want to have a better structure of the resulting representation elements, i.e. the representation elementsThe "distance" between the cells can be as far as possible.
Combining these three objectives, we obtain the following optimization function
Figure BDA0001839980660000031
Figure BDA0001839980660000032
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of the encoding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements.
Since it contains a binary structure zijE {0, 1}, so this problem is a non-convex problem, NP-hard, so we consider the convex-down relaxation problem:
Figure BDA0001839980660000033
Figure BDA0001839980660000034
in this optimization function we have removed the non-convex part, the indicator function I (-). We can continue to write the above problem as a matrix form as follows
Figure BDA0001839980660000035
s.t.1TZ=1T,Z≥0
Wherein the content of the first and second substances,
Figure BDA0001839980660000036
tr represents the trace of the matrix and,
Figure BDA0001839980660000037
L=one(1)-E。
the method overcomes the following defects and shortcomings of the prior method:
searching a representative element (namely, a representative subset) of the big data so that the representative element can represent most characteristics of the source set has important research and application values in the problems related to machine learning. The related work of finding representative elements has been done for some time, and related research algorithms can be divided into two categories according to the type of information that the representative should retain.
The first type of algorithm is to find the representative elements of the data located in one or more low-dimensional subspaces, in which case the data is typically embedded in a vector space, and this method cannot be applied in the general case where the data is not in a subspace.
The second type of algorithm uses similarity/dissimilarity between pairs of data points instead of a metric vector. The model can be considered outside the linear subspace using pairwise similarity/dissimilarity relationships, however existing algorithms suffer from reliance on initialization.
Compared with the prior art, the invention has the advantages and effects that:
the method provided by the invention finds a representative subset of a source set based on dissimilarity relation between data points, reduces the original problem into a dissimilarity-based low-rank sparse subset selection problem, and obtains a better balance between the number of representative elements and the representative quality, and on the basis, a low-rank condition is introduced, so that the selected subset can keep the structure.
Drawings
FIG. 1 is a schematic diagram of target set finding according to the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Let us assume that we have one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and Y
Figure BDA0001839980660000041
dijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj. This binary relationship is written in the form of a matrix as follows
Figure BDA0001839980660000042
Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found that is well representative of the features possessed by the target set Y.
Given a dissimilarity matrix D, we need to find a representative subset, i.e., representative element, of the source set X so that it can effectively represent the target set Y. For this reason, we consider the relation d to dissimilarityijAssociated unknown variable zijThe optimization relationship of (1). We represent these unknown variables by the following matrix
Figure BDA0001839980660000043
We use the variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it is not represented. To ensure each yjAll have corresponding representative elements, we stipulate
Figure BDA0001839980660000051
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X is
Figure BDA0001839980660000052
Second, we hope to be able toAs few as possible representatives can be selected to represent the set of objects Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible.
Combining these three objectives, we obtain the following optimization function
Figure BDA0001839980660000053
Figure BDA0001839980660000054
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of the encoding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements.
Since it contains a binary structure zijE {0, 1}, so this problem is a non-convex problem, NP-hard, so we consider the convex-down relaxation problem:
Figure BDA0001839980660000055
Figure BDA0001839980660000056
in this optimization function we have removed the non-convex part, the indicator function I (-). We can continue to write the above problem as a matrix form as follows
Figure BDA0001839980660000057
s.t.1TZ=1T,Z≥0
Wherein the content of the first and second substances,
Figure BDA0001839980660000058
tr represents the trace of the matrix and,
Figure BDA0001839980660000059
L=one(1)-E。

Claims (1)

1. a sparse subset selection method based on dissimilarity and Laplace regularization, the method comprising:
suppose there is one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and Y
Figure FDA0001839980650000011
dijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj(ii) a This binary relationship is written in the form of a matrix as follows
Figure FDA0001839980650000012
These unknown variables are represented by the following matrix
Figure FDA0001839980650000013
By variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it does not represent; to ensure each yjAll have corresponding representative elements, stipulate
Figure FDA0001839980650000014
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X is
Figure FDA0001839980650000015
Second, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible;
combining these three objectives, we obtain the following optimization function
Figure FDA0001839980650000016
Figure FDA0001839980650000017
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of coding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements;
since it contains a binary structure zijE {0, 1}, so the problem is a non-convex problem, i.e., NP-hard, so there is:
Figure FDA0001839980650000021
Figure FDA0001839980650000022
in this optimization function we have removed the non-convex part-the indicator function I (-); continue to write the above problem as a matrix form as follows
Figure FDA0001839980650000023
s.t.1TZ=1T,Z≥0
Wherein the content of the first and second substances,
Figure FDA0001839980650000024
tr represents the trace of the matrix and,
Figure FDA0001839980650000025
L=one(1)-E。
CN201811243586.9A 2018-10-24 2018-10-24 Sparse subset selection method based on dissimilarity and Laplace regularization Withdrawn CN111091137A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811243586.9A CN111091137A (en) 2018-10-24 2018-10-24 Sparse subset selection method based on dissimilarity and Laplace regularization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811243586.9A CN111091137A (en) 2018-10-24 2018-10-24 Sparse subset selection method based on dissimilarity and Laplace regularization

Publications (1)

Publication Number Publication Date
CN111091137A true CN111091137A (en) 2020-05-01

Family

ID=70391572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811243586.9A Withdrawn CN111091137A (en) 2018-10-24 2018-10-24 Sparse subset selection method based on dissimilarity and Laplace regularization

Country Status (1)

Country Link
CN (1) CN111091137A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200302228A1 (en) * 2019-03-20 2020-09-24 Tata Consultancy Services Limited System and method for signal pre-processing based on data driven models and data dependent model transformation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200302228A1 (en) * 2019-03-20 2020-09-24 Tata Consultancy Services Limited System and method for signal pre-processing based on data driven models and data dependent model transformation
US11443136B2 (en) * 2019-03-20 2022-09-13 Tata Consultancy Services Limited System and method for signal pre-processing based on data driven models and data dependent model transformation

Similar Documents

Publication Publication Date Title
Xie et al. Hyper-Laplacian regularized multilinear multiview self-representations for clustering and semisupervised learning
Lu et al. Remote sensing scene classification by unsupervised representation learning
Noroozi et al. Representation learning by learning to count
Elhamifar et al. See all by looking at a few: Sparse modeling for finding representative objects
Mairal et al. Sparse modeling for image and vision processing
Santa Cruz et al. Visual permutation learning
Upchurch et al. From a to z: supervised transfer of style and content using deep neural network generators
Zhao et al. Laplacian regularized nonnegative representation for clustering and dimensionality reduction
Wang et al. Person re-identification in identity regression space
Chen et al. Dictionary learning from ambiguously labeled data
Balakrishnan et al. A novel approach for tumor image set classification based on multi-manifold deep metric learning
Peng et al. Fast low rank representation based spatial pyramid matching for image classification
Lu et al. Learning-based bipartite graph matching for view-based 3D model retrieval
Cheema et al. High dimensional low sample size activity recognition using geometric classifiers
Yang et al. Recognizing cartoon image gestures for retrieval and interactive cartoon clip synthesis
Yu et al. Hope: Hierarchical object prototype encoding for efficient object instance search in videos
Kayo Locally linear embedding algorithm: extensions and applications
CN111091137A (en) Sparse subset selection method based on dissimilarity and Laplace regularization
Bibi et al. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval
Yousefzadeh et al. Using wavelets and spectral methods to study patterns in image-classification datasets
Freytag et al. Exemplar-specific patch features for fine-grained recognition
Jose et al. Genus and species-level classification of wrasse fishes using multidomain features and extreme learning machine classifier
Yang et al. A class of manifold regularized multiplicative update algorithms for image clustering
Zhang et al. Learning in multimodal and mixmodal data: locality preserving discriminant analysis with kernel and sparse representation techniques
Wang et al. Domain adaptation network based on hypergraph regularized denoising autoencoder

Legal Events

Date Code Title Description
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Yang Chenxi

Document name: Notice of Preliminary Confirmation Opinion on Abnormal Patent Application

PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200501