CN111091137A - Sparse subset selection method based on dissimilarity and Laplace regularization - Google Patents
Sparse subset selection method based on dissimilarity and Laplace regularization Download PDFInfo
- Publication number
- CN111091137A CN111091137A CN201811243586.9A CN201811243586A CN111091137A CN 111091137 A CN111091137 A CN 111091137A CN 201811243586 A CN201811243586 A CN 201811243586A CN 111091137 A CN111091137 A CN 111091137A
- Authority
- CN
- China
- Prior art keywords
- dissimilarity
- representative
- matrix
- represent
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a sparse subset selection method based on dissimilarity and Laplace regularization, which considers the problem of finding a representative element capable of effectively representing a target set from a source set by utilizing the pairwise dissimilarity relation between a given source set and the target set, provides a low-rank sparse subset selection model based on dissimilarity and can be effectively solved by using convex programming. On the basis of past work, the structure among the representative elements is considered, so that the number of the representative elements is less, and the representative quality is higher. Where Algorithm 1 is also used for efficient implementation of the Algorithm and our Algorithm can be parallelized even further, thus further reducing computation time.
Description
Technical Field
The application relates to the field of machine learning and data analysis, in particular to a sparse subset selection method based on dissimilarity and Laplace regularization.
Background
Selection of sparse subset: finding a large number of models or subsets of data points, which retain the characteristics of the entire set, is an important issue in machine learning and data analysis in computer vision applications, which have a large number of applications in image and natural language processing, bio/health informatics, recommendation systems, etc. These information elements are referred to as representatives or demonstrations. The data representation facilitates the summarization and visualization of data sets of text/Web documents, images and videos, thus increasing the interpretability of large-scale data sets by data analysts and domain experts. The model representation helps to efficiently describe complex phenomena or events using a small number of models, or can be used for model compression in a collective model. More importantly, the computation time and memory requirements of learning and reasoning algorithms (such as Nearest Neighbor (NN)) classifiers are improved by processing a representation that contains most of the information of the original set. Selecting a small portion of the product to recommend to the customer not only increases retailer revenue but also saves customer time. Furthermore, the representative elements facilitate clustering of the data sets and, as the most primitive element, can be used to efficiently synthesize/generate new data points. Finally, and equally important, a high performance classifier can be obtained using a representative, with very few samples being used to select and annotate from a large number of unlabeled samples. Dissimilarity degree: dissimilarity is a pairwise correspondence between data, which has many advantages: first, for high dimensional datasets, where the ambient spatial dimension is much higher than the cardinality of the dataset, processing pairwise relationships is more efficient than working on high dimensional measurement vectors. Second, although some actual data sets do not exist in vector space, such as social network data or proteomics data, pairwise relationships can already be efficiently computed for them.
Laplace regularization: low RANK methods capture the potentially low dimensional-RANK representation (LRR), which has attracted great interest in pattern analysis and signal processing communities as a promising data structure. In particular, problems related to low order matrix estimation have attracted considerable attention in recent years. LRR has been widely used for subspace segmentation, image removal, image clustering, and video background/foreground separation. Low-level normalizers in LRR have been deeply linked to recent theoretical advances in Robust Principal Component Analysis (RPCA), which has brought new powerful modeling options for many applications.
Disclosure of Invention
The purpose of the invention is realized by the following technical scheme:
let us assume that we have one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and YdijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj. This binary relationship is written in the form of a matrix as follows
Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found which can represent the features of the target set Y well
Given a dissimilarity matrix D, we need to find a representative subset, i.e., representative element, of the source set X so that it can effectively represent the target set Y. For this reason, we consider the relation d to dissimilarityijAssociated unknown variable zijThe optimization relationship of (1). We represent these unknown variables by the following matrix
We use the variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it is not represented. To ensure each yjAll have corresponding representative elements, we stipulate
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X isSecond, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Thirdly, we want to have a better structure of the resulting representation elements, i.e. the representation elementsThe "distance" between the cells can be as far as possible.
Combining these three objectives, we obtain the following optimization function
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of the encoding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements.
Since it contains a binary structure zijE {0, 1}, so this problem is a non-convex problem, NP-hard, so we consider the convex-down relaxation problem:
in this optimization function we have removed the non-convex part, the indicator function I (-). We can continue to write the above problem as a matrix form as follows
s.t.1TZ=1T,Z≥0
Wherein the content of the first and second substances,tr represents the trace of the matrix and,L=one(1)-E。
the method overcomes the following defects and shortcomings of the prior method:
searching a representative element (namely, a representative subset) of the big data so that the representative element can represent most characteristics of the source set has important research and application values in the problems related to machine learning. The related work of finding representative elements has been done for some time, and related research algorithms can be divided into two categories according to the type of information that the representative should retain.
The first type of algorithm is to find the representative elements of the data located in one or more low-dimensional subspaces, in which case the data is typically embedded in a vector space, and this method cannot be applied in the general case where the data is not in a subspace.
The second type of algorithm uses similarity/dissimilarity between pairs of data points instead of a metric vector. The model can be considered outside the linear subspace using pairwise similarity/dissimilarity relationships, however existing algorithms suffer from reliance on initialization.
Compared with the prior art, the invention has the advantages and effects that:
the method provided by the invention finds a representative subset of a source set based on dissimilarity relation between data points, reduces the original problem into a dissimilarity-based low-rank sparse subset selection problem, and obtains a better balance between the number of representative elements and the representative quality, and on the basis, a low-rank condition is introduced, so that the selected subset can keep the structure.
Drawings
FIG. 1 is a schematic diagram of target set finding according to the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Let us assume that we have one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and YdijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj. This binary relationship is written in the form of a matrix as follows
Our goal is to find a smaller subset of X so that it can represent the target set Y well, as shown in fig. 1, where on the left side of fig. 1: dissimilarity relation between the source set X and the target set Y; right side: a subset of the source set X is found that is well representative of the features possessed by the target set Y.
Given a dissimilarity matrix D, we need to find a representative subset, i.e., representative element, of the source set X so that it can effectively represent the target set Y. For this reason, we consider the relation d to dissimilarityijAssociated unknown variable zijThe optimization relationship of (1). We represent these unknown variables by the following matrix
We use the variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it is not represented. To ensure each yjAll have corresponding representative elements, we stipulate
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X isSecond, we hope to be able toAs few as possible representatives can be selected to represent the set of objects Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible.
Combining these three objectives, we obtain the following optimization function
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of the encoding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements.
Since it contains a binary structure zijE {0, 1}, so this problem is a non-convex problem, NP-hard, so we consider the convex-down relaxation problem:
in this optimization function we have removed the non-convex part, the indicator function I (-). We can continue to write the above problem as a matrix form as follows
s.t.1TZ=1T,Z≥0
Claims (1)
1. a sparse subset selection method based on dissimilarity and Laplace regularization, the method comprising:
suppose there is one source set X ═ X1,...,xMY and a target set Y ═ Y1,...,yNThey contain M and N elements, respectively, assuming we have obtained a dissimilarity relationship between X and YdijDenotes xiRepresents yjThe smaller the value of (A) represents xiThe better the representation of yj(ii) a This binary relationship is written in the form of a matrix as follows
These unknown variables are represented by the following matrix
By variable zijDenotes xiWhether or not to represent yjWhen z isijWhen 0 is taken, x is representediRepresents yjOtherwise, it does not represent; to ensure each yjAll have corresponding representative elements, stipulate
Selecting a well-coded Y X element based on dissimilarity requires three goals, first, we need a representation that can represent Y well enoughjIf x isiIs selected as the representative element, then y is encodedjThe cost of dijzij∈{0,dijThe cost of representing Y by a subset of X isSecond, we want to be able to select as few as possible representatives to represent the set of targets Y, which is equivalent to the matrix Z having fewer non-zero rows. Third, we want the resulting representatives to have a good structure, i.e., the "distance" between the representatives can be as far as possible;
combining these three objectives, we obtain the following optimization function
Wherein | - | purplepRepresents lpNorm, I (-) represents the indicator function. The first term in the objective function represents the quality of coding, the second term represents the number of the representative elements, and the third term represents the structure of the representative elements;
since it contains a binary structure zijE {0, 1}, so the problem is a non-convex problem, i.e., NP-hard, so there is:
in this optimization function we have removed the non-convex part-the indicator function I (-); continue to write the above problem as a matrix form as follows
s.t.1TZ=1T,Z≥0
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811243586.9A CN111091137A (en) | 2018-10-24 | 2018-10-24 | Sparse subset selection method based on dissimilarity and Laplace regularization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811243586.9A CN111091137A (en) | 2018-10-24 | 2018-10-24 | Sparse subset selection method based on dissimilarity and Laplace regularization |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111091137A true CN111091137A (en) | 2020-05-01 |
Family
ID=70391572
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811243586.9A Withdrawn CN111091137A (en) | 2018-10-24 | 2018-10-24 | Sparse subset selection method based on dissimilarity and Laplace regularization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111091137A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302228A1 (en) * | 2019-03-20 | 2020-09-24 | Tata Consultancy Services Limited | System and method for signal pre-processing based on data driven models and data dependent model transformation |
-
2018
- 2018-10-24 CN CN201811243586.9A patent/CN111091137A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302228A1 (en) * | 2019-03-20 | 2020-09-24 | Tata Consultancy Services Limited | System and method for signal pre-processing based on data driven models and data dependent model transformation |
US11443136B2 (en) * | 2019-03-20 | 2022-09-13 | Tata Consultancy Services Limited | System and method for signal pre-processing based on data driven models and data dependent model transformation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xie et al. | Hyper-Laplacian regularized multilinear multiview self-representations for clustering and semisupervised learning | |
Lu et al. | Remote sensing scene classification by unsupervised representation learning | |
Noroozi et al. | Representation learning by learning to count | |
Elhamifar et al. | See all by looking at a few: Sparse modeling for finding representative objects | |
Mairal et al. | Sparse modeling for image and vision processing | |
Santa Cruz et al. | Visual permutation learning | |
Upchurch et al. | From a to z: supervised transfer of style and content using deep neural network generators | |
Zhao et al. | Laplacian regularized nonnegative representation for clustering and dimensionality reduction | |
Wang et al. | Person re-identification in identity regression space | |
Chen et al. | Dictionary learning from ambiguously labeled data | |
Balakrishnan et al. | A novel approach for tumor image set classification based on multi-manifold deep metric learning | |
Peng et al. | Fast low rank representation based spatial pyramid matching for image classification | |
Lu et al. | Learning-based bipartite graph matching for view-based 3D model retrieval | |
Cheema et al. | High dimensional low sample size activity recognition using geometric classifiers | |
Yang et al. | Recognizing cartoon image gestures for retrieval and interactive cartoon clip synthesis | |
Yu et al. | Hope: Hierarchical object prototype encoding for efficient object instance search in videos | |
Kayo | Locally linear embedding algorithm: extensions and applications | |
CN111091137A (en) | Sparse subset selection method based on dissimilarity and Laplace regularization | |
Bibi et al. | Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval | |
Yousefzadeh et al. | Using wavelets and spectral methods to study patterns in image-classification datasets | |
Freytag et al. | Exemplar-specific patch features for fine-grained recognition | |
Jose et al. | Genus and species-level classification of wrasse fishes using multidomain features and extreme learning machine classifier | |
Yang et al. | A class of manifold regularized multiplicative update algorithms for image clustering | |
Zhang et al. | Learning in multimodal and mixmodal data: locality preserving discriminant analysis with kernel and sparse representation techniques | |
Wang et al. | Domain adaptation network based on hypergraph regularized denoising autoencoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
DD01 | Delivery of document by public notice | ||
DD01 | Delivery of document by public notice |
Addressee: Yang Chenxi Document name: Notice of Preliminary Confirmation Opinion on Abnormal Patent Application |
|
PB01 | Publication | ||
PB01 | Publication | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200501 |