CN106570173B - Spark-based high-dimensional sparse text data clustering method - Google Patents

Spark-based high-dimensional sparse text data clustering method Download PDF

Info

Publication number
CN106570173B
CN106570173B CN201610988558.4A CN201610988558A CN106570173B CN 106570173 B CN106570173 B CN 106570173B CN 201610988558 A CN201610988558 A CN 201610988558A CN 106570173 B CN106570173 B CN 106570173B
Authority
CN
China
Prior art keywords
matrix
data
data set
similarity
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610988558.4A
Other languages
Chinese (zh)
Other versions
CN106570173A (en
Inventor
王进
黄超
莫倩雯
陈乔松
邓欣
欧阳卫华
胡峰
李智星
雷大江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yami Technology Guangzhou Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201610988558.4A priority Critical patent/CN106570173B/en
Publication of CN106570173A publication Critical patent/CN106570173A/en
Application granted granted Critical
Publication of CN106570173B publication Critical patent/CN106570173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Spark-based high-dimensional sparse text data clustering method, which comprises the following steps of: reading in a data set by using RDD; designing a distributed sparse vector set by using an RDD interface; and calculating the similarity between the distributed sparse vector set and the complete data set of the node where the distributed sparse vector set is located, and abstracting the similarity into a similarity matrix according to the number. And (4) the stored similar matrix is symmetrical, and the normalized form and the Laplace matrix form are solved. 4. And (5) utilizing the normalized Laplace matrix in the SVD decomposition step 3, and inputting the new matrix constructed in the step (5) and the step (4) as a sample into a K-means model for training. 6. And clustering the test set by using the established model. The invention improves the operational performance of the traditional spectral clustering algorithm under a large data set.

Description

Spark-based high-dimensional sparse text data clustering method
Technical Field
The invention relates to the field of text data clustering, machine learning and distributed computing, in particular to a Spark-based high-dimensional sparse text data clustering method. .
Background
With the advent of the big data age, the internet has accumulated more and more network data. The accumulated data has reached the limit that can be handled by ordinary computers. In order to deal with the increasingly difficult data processing problem, various industries aim at Spark-based distributed processing platforms and parallel sparse data set storage technologies.
Spark is a large data distributed programming framework similar to Hadoop, but there are some differences between the two that make Spark superior in some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. The Spark big data platform integrates batch processing, real-time stream processing, interactive query and graph calculation, and resource waste caused by the fact that different clusters need to be deployed in various operation scenes is avoided. The Spark big data platform provides a data structure of a flexible distributed data set (RDD) for programmers. The RDD data structure is a data set distributed across multiple machine clusters and has an efficient fault tolerance mechanism. With the advantage of the flexible distributed data set (RDD) data structure of Spark big data platform, there have been many traditional machine learning algorithms that extend their computational performance and data processing capabilities.
Sparse vector set storage refers to a matrix with most of elements in the matrix being 0, and in fact, large-scale data sets in practical problems are basically sparse, and much sparsity is more than 90% or even 99%. There is therefore a need for efficient sparse data set storage formats and computation methods. And in combination with an elastic distributed data structure (RDD) provided by a Spark big data platform, the sparse data set can be stored in different computing nodes in parallel. The method mainly aims to solve the storage problem and calculation of a large data set, and the problem of overlarge data set is solved by widely adopting a parallel sparse storage method in the aspects of high-dimensional sparse text data set storage and computer vision.
Clustering is a very important method in machine learning and data mining tasks. Spectral clustering is a graph theory-based method that divides a weighted undirected graph into two or more optimal subgraphs. Compared with the traditional method, the spectral clustering method is more efficient in searching the similarity among the data samples. As such, spectral clustering has been widely used in fields such as information retrieval and computer vision. Unfortunately, when the number of data samples (n) increases dramatically, spectral clustering algorithms may face a bottleneck in computational resources. For example, it is possible to have an inoperable problem when calculating the similarity matrix M (n × n) between samples (n is very large) or storing similarity matrices of similar size. Meanwhile, the traditional spectral clustering algorithm also needs a large amount of storage space and operation time when calculating the K eigenvectors of the Laplace matrix. These outstanding problems make spectral clustering increasingly unsatisfactory to the computational requirements that are now being met in the case of a sharp increase in data volume.
Aiming at the practical situation, the invention provides a spectral clustering method based on a Spark big data processing platform.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A method is presented. The technical scheme of the invention is as follows:
a Spark-based high-dimensional sparse text data clustering method comprises the following steps:
step 1: reading in a data set sample to be processed through an elastic distributed data set RDD provided by a Spark big data platform, and designing a distributed sparse vector set for storing a high-dimensional sparse data set by using an RDD interface;
step 2: and calculating a similar matrix M among the data set samples to be processed and storing the similar matrix M in a parallel sparse vector set mode. The similarity is measured by a Euclidean distance mode;
and step 3: the similarity matrix M stored in the parallel sparse vector set mode in the step 2 is symmetrical, and a normalized laplace matrix is solved;
and 4, step 4: decomposing the normalized laplace matrix by using SVD (singular value decomposition) characteristics and solving K nearest neighbor characteristic vectors, and forming the K nearest neighbor characteristic vectors into a nearest neighbor matrix;
and 5: and (4) inputting the constructed nearest neighbor matrix serving as a sample into a K-means model for training to finish clustering.
Further, the step 1 reads in a data set sample to be processed provided by a Spark big data platform into an elastic distributed data set RDD, and loads the data set sample into a distributed sparse vector set;
reading a data source to be processed of a source UCI data platform into an elastic distributed data set RDD, loading the data source into high-dimensional distributed vector set data P, and dividing the data source into a training set A1And test set A2
Further, the data set to be processed is an RCV1 data set, and the form of the data set is { decision tag, condition attribute 1, condition attribute 2, condition attribute 3 … …, condition attribute n }, and the dimension of the data set is >30000 }).
Further, the division into training set A1And test set A2By selecting 90% of the samples in the data set as training set A without repeating randomly1The remaining 10% was set as test set A2
Further, the step of loading the high-dimensional distributed vector set data P includes:
a1, reading in a high-dimensional sparse text data set by using a distributed elastic data set (RDD);
a2, adopting sparse storage to record each sample in the data set as A;
a3, randomly sampling the sample of A and dividing the sample into a few sample data blocks B, wherein each data block uses an index mark;
a4, using Spark data platform to provide programming interface MapPartitionWithIndex to distribute data block B with less samples to cluster nodes according to index numbers.
Further, the step 3 specifically includes the following steps:
b1, solving Euclidean distance between the data block B in each calculation node and the sparse vector set P to obtain a distance matrix for representing similarity, and obtaining an upper triangular matrix U by utilizing a parallel upper triangular method;
b2, storing the distance matrix obtained in the step B1 by adopting a point coordinate mode COO, and recording a point set as: CO;
b3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and marking as CO';
b4, combining the point sets of the step B2 and the step B3 together to form a symmetric matrix S;
b5, let element S in symmetric matrix Sij0, where (i ═ j), the diagonal matrix is noted: d
Calculating a Laplace matrix and normalizing;
L=D-M; (2)
in the formula (2), L is a Laplace matrix, D is a diagonal matrix, and M is a similar matrix;
normalizing L:
L1=D-1/2L D-1/2=I-D-1/2M D-1/2; (3)
wherein I is an identity matrix, L1Is a normalized Laplace matrix.
Further, step 4 comprises: the obtained normalized Laplace matrix L is calculated by adopting a parallel SVD (singular value decomposition) calculation method1Characteristic decomposition, selecting first K characteristic vectors, theta ═ theta1i1i1i1i.....θNi) ', i-1, 2, 3, …, K, the first K eigenvectors will form an N x K eigenvector matrix θ, the rows of which represent samples of a cluster class, i.e., the decision labels in the corresponding high-dimensional sparse text dataset.
Further, step 5, inputting the constructed nearest neighbor matrix as a sample into a K-means model for training, and combining with the test set A2And clustering the test set, and ending.
The invention has the following advantages and beneficial effects:
the embodiment of the invention discloses a Spark-based high-dimensional sparse text data clustering method, wherein a high-dimensional sparse text data set (such as RCV1) comprises the following steps: the dimension is high (n is greater than 10000, n is attribute dimension), the space complexity is high, and the storage and the calculation are not convenient. The invention mainly solves the problem of difficult storage and calculation of the high-dimensional sparse text data set, and has the following specific advantages and beneficial effects: 1. acquiring a high-dimensional sparse text data set (such as RCV1) from a UCI data platform; 2. reading in a high-dimensional sparse text data set by using a distributed elastic data set (RDD), selecting a sparse storage strategy (such as CSR) to store data samples, randomly sampling and dividing the data samples into a few sample data blocks, using an index mark for each data block, and finally distributing the data blocks with few samples to each node of a cluster according to an index number, wherein the method has the advantages of fully utilizing the memory resources of each node in the cluster and storing data with higher dimensionality; 3. splitting the data set to verify the quality of model training; 4. on the basis of a distributed sparse vector set, a set of distributed computing mode (such as parallel triangular matrix taking and parallel symmetric matrix transformation) is designed according to the reality, and aims to solve the problem of computing efficiency caused by high space complexity of a high-dimensional sparse text data set and accelerate the computing efficiency; 5. and solving and normalizing the Laplace matrix, aiming at mapping the data set to a high-dimensional space, conveniently extracting the characteristic vectors of the data set, and selecting the first K characteristic vectors to construct an N x K characteristic matrix. 6. And training the K-means model by using the feature matrix and testing.
Drawings
FIG. 1 is a block diagram of a flow chart of a Spark-based spectral clustering method according to a preferred embodiment of the present invention;
FIG. 2 is a flow chart of a parallel triangle-fetching method;
FIG. 3 is a flow chart of parallel acquisition of symmetric matrices;
FIG. 4 is a flow chart of distributed sparse vector set generation.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme of the invention is as follows:
FIG. 1 is a block diagram of the process of the present invention, which includes the following steps:
1. the data loading phase is as shown in figure 2;
in this stage, a data source (source UCI data platform) to be processed needs to be read into an elastic distributed data set (RDD), then loaded into high-dimensional distributed vector set data P, and divided into a training set A1And test set A2
The RCV1 data set is downloaded from a UCI experimental data platform (website: http:// archive. ics. UCI. edu/ml /), the form of the data set is { decision label, condition attribute 1, condition attribute 2, condition attribute 3 … … and condition attribute n }, and the data set has the characteristics of high dimensionality (>30000), sparseness, high time complexity and the like.
2. Loading into a high-dimensional distributed vector set;
step 1, reading a high-dimensional sparse text data set by using a distributed elastic data set (RDD);
step 2, the high-dimensional sparse text data set has sparsity, so that each sample in the data set is recorded as: a. the
And 3, randomly sampling the sample of the A and dividing the sample into a few sample data blocks B, wherein each data block uses an index mark.
And 4, distributing the data block B with few samples to the cluster nodes according to the index numbers by using a programming interface MapPartitionWithIndex provided by a Spark data platform.
3. Splitting the data set;
splitting a data set is splitting the data set into a training set and a test set. The specific way is to select 90% of samples in the data set as a training set A by random non-repetition1The remaining 10% was set as test set A2
4. Calculating an adjacency matrix;
this step requires the calculation of the similarity between the individual samples, measured in euclidean distance. The size of the similarity can be used to represent the correlation between samples, and the similarity between samples is considered as Vertex in Graph and Edge in Graph, so as to obtain the concept of Graph commonly seen by us. The adjacency matrix is calculated according to the concept of graph theory under the condition of known similarity matrix.
5. Taking the upper triangle of the matrix in parallel, as shown in FIG. 3;
step 1, loading a similarity matrix M between samples in a high-dimensional distributed vector set, and assuming that the number of each sample is n (n is less than an attribute dimension d), wherein the number of partitions idn is d/n, and d% n is 0;
step 2, dividing the data of each node data set into idn n-x-n matrixes according to column division;
and 3, finding an n-n matrix at a corresponding position according to the partition number id, taking a triangle, and keeping all columns on the right unchanged.
Step 4, if the last matrix is n x n, directly taking the upper triangle, and ending the operation; otherwise, abandoning;
6. parallel symmetric matrix transformations, as shown in fig. 4;
step 1, solving Euclidean distance between a data block B and a sparse vector set P in each calculation node to represent similarity, and obtaining an upper triangular matrix U by using the parallel upper triangular method;
and 2, storing the distance matrix obtained in the step 1 in a point coordinate mode (COO), wherein a point set is recorded as: CO;
step 3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and recording as CO';
and 4, combining the point sets in the steps b and c to form a symmetric matrix S.
Step 5, making the element S in the symmetric matrix Sij0, where (i ═ j), the diagonal matrix is noted: d
7. Calculating a Laplace matrix and normalizing;
L=D-M; (2)
and (2) obtaining a Laplace matrix, wherein D is a diagonal matrix and M is a similar matrix.
Normalizing L:
L1=D-1/2L D-1/2=I-D-1/2M D-1/2; (3)
wherein I is an identity matrix, L1Is a normalized Laplace matrix.
8. The normalized Laplace matrix utilizes SVD characteristics;
SVD is a typical characteristic decomposition method, and the normalized Laplace matrix L obtained in the step (6) is calculated by using a parallel SVD calculation method provided by a Spark big data platform1And (5) decomposing the characteristics. Selecting the first K eigenvectors, theta ═ theta1i1i1i1i.....θNi) ', i-1, 2, 3, …, K. The first K eigenvectors will form an N x K eigenvector matrix θ. The rows of the matrix represent samples of a cluster class, i.e. corresponding to a high-dimensional sparse text datasetThe decision tag of (1).
9. Training a K-means model;
and (4) generating a N x K feature matrix, wherein each row of the matrix represents a clustering sample and represents a decision label in the high-dimensional sparse text data set. And training a K-means model by using the feature matrix.
10. Testing a K-means model;
in step (1), 10% of the split data set is the test set A2And (5) testing the model trained in the step (8).
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (2)

1. A Spark-based high-dimensional sparse text data clustering method is characterized by comprising the following steps:
step 1: reading in a data set sample to be processed through an elastic distributed data set RDD provided by a Spark big data platform, and designing a distributed sparse vector set for storing a high-dimensional sparse data set by using an RDD interface;
the step of designing a distributed sparse vector set suitable for storing a high-dimensional sparse data set by using an RDD interface comprises the following steps:
a1, reading in a high-dimensional sparse text data set by using a distributed elastic data set (RDD);
a2, adopting sparse storage to record each sample in the data set as A;
a3, randomly sampling the sample of A and dividing the sample into a few sample data blocks B, wherein each data block is marked by an index;
a4, distributing the data block B with few samples to the cluster nodes according to the index numbers by using a Spark data platform to provide a programming interface MapPartitionWithIndex;
step 2: calculating a similarity matrix M between the data set samples to be processed and storing the similarity matrix M in a sparse vector set mode, wherein the similarity is measured in an Euclidean distance mode;
and step 3: and (3) symmetry is carried out on the similar matrix M stored in the step (2) in a sparse vector set mode, a normalized laplace matrix is solved, and the parallel symmetric matrix transformation specifically comprises the following steps:
the step 3 comprises the following steps:
b1, solving Euclidean distance between the data block B in each calculation node and the sparse vector set P to obtain a distance matrix for representing similarity, and designing a parallel upper triangular method to obtain an upper triangular matrix U;
the parallel upper triangle taking step of the matrix comprises the following steps:
step 1, loading a similarity matrix M between samples in a high-dimensional distributed vector set, and assuming that the number of each sample is n (n is less than an attribute dimension d), wherein the number of partitions idn is d/n, and d% n is 0;
step 2, dividing the data of each node data set into idn n-x-n matrixes according to column division;
step 3, finding an n-n matrix at a corresponding position according to the partition number id, taking a triangle, and keeping all columns on the right unchanged;
step 4, if the last matrix is n x n, directly taking the upper triangle, and ending the operation; otherwise, abandoning;
b2, constructing a symmetric matrix S by the distance matrix obtained in the step B1;
b3, let element S in symmetric matrix Sij0, where (i ═ j), the diagonal matrix is noted: d, calculating a Laplace matrix and normalizing the Laplace matrix;
and 4, step 4: decomposing the normalized laplace matrix by using SVD (singular value decomposition) characteristics and solving K nearest neighbor characteristic vectors, and forming the K nearest neighbor characteristic vectors into a nearest neighbor matrix;
and 5: and (4) inputting the constructed nearest neighbor matrix serving as a sample into a K-means model for training to finish clustering.
2. The Spark-based high-dimensional sparse text data clustering method according to claim 1, wherein a method for calculating the similarity between each sample and constructing a symmetric matrix S is designed and implemented:
d1, solving Euclidean distance between the data block B in each calculation node and a sparse vector set P to represent similarity, and obtaining an upper triangular matrix U by using the parallel upper triangular method;
d2, storing the distance matrix obtained in the step D1 by adopting a point coordinate mode COO, and recording a point set as: CO;
d3, interchanging the row coordinate and the column coordinate of the CO midpoint coordinate, and recording as CO';
d4, combining the point sets of the step D2 and the step D3 together to form a symmetric matrix S.
CN201610988558.4A 2016-11-09 2016-11-09 Spark-based high-dimensional sparse text data clustering method Active CN106570173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610988558.4A CN106570173B (en) 2016-11-09 2016-11-09 Spark-based high-dimensional sparse text data clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610988558.4A CN106570173B (en) 2016-11-09 2016-11-09 Spark-based high-dimensional sparse text data clustering method

Publications (2)

Publication Number Publication Date
CN106570173A CN106570173A (en) 2017-04-19
CN106570173B true CN106570173B (en) 2020-09-29

Family

ID=58540842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610988558.4A Active CN106570173B (en) 2016-11-09 2016-11-09 Spark-based high-dimensional sparse text data clustering method

Country Status (1)

Country Link
CN (1) CN106570173B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115543991A (en) * 2022-12-02 2022-12-30 湖南工商大学 Data restoration method and device based on feature sampling and related equipment

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197272A (en) * 2018-01-05 2018-06-22 北京搜狐新媒体信息技术有限公司 A kind of update method and device of distributed association rules increment
CN108805174B (en) * 2018-05-18 2022-03-29 广东惠禾科技发展有限公司 Clustering method and device
CN111767941B (en) * 2020-05-15 2022-11-18 上海大学 Improved spectral clustering and parallelization method based on symmetric nonnegative matrix factorization
CN112988693A (en) * 2021-03-26 2021-06-18 武汉大学 Spectral clustering algorithm parallelization method and system in abnormal data detection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903258A (en) * 2014-02-27 2014-07-02 西安电子科技大学 Method for detecting changes of remote sensing image based on order statistic spectral clustering
CN105354243A (en) * 2015-10-15 2016-02-24 东南大学 Merge clustering-based parallel frequent probability subgraph searching method
CN105808581A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Data clustering method and device and Spark big data platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10496642B2 (en) * 2014-10-03 2019-12-03 The Regents Of The University Of Michigan Querying input data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903258A (en) * 2014-02-27 2014-07-02 西安电子科技大学 Method for detecting changes of remote sensing image based on order statistic spectral clustering
CN105808581A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Data clustering method and device and Spark big data platform
CN105354243A (en) * 2015-10-15 2016-02-24 东南大学 Merge clustering-based parallel frequent probability subgraph searching method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于分布式平台Spark和YARN的数据挖掘算法的并行化研究";梁彦;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150115(第1期);第I138-744页 *
吴哲夫 等." 基于Spark平台的K-means聚类算法改进及并行化实现".《互联网天地》.2016,第44-50页. *
张吉文." 基于谱聚类的文本聚类算法研究".《中国优秀硕士学位论文全文数据库 信息科技辑》.2016,第I138-7927页. *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115543991A (en) * 2022-12-02 2022-12-30 湖南工商大学 Data restoration method and device based on feature sampling and related equipment
CN115543991B (en) * 2022-12-02 2023-03-10 湖南工商大学 Data restoration method and device based on feature sampling and related equipment

Also Published As

Publication number Publication date
CN106570173A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
Wang et al. A survey on learning to hash
Shen et al. Deep asymmetric pairwise hashing
Li et al. A deeper look at facial expression dataset bias
Zhu et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval
Cakir et al. Hashing with mutual information
CN106570173B (en) Spark-based high-dimensional sparse text data clustering method
Iscen et al. Memory vectors for similarity search in high-dimensional spaces
Wu et al. Online multi-modal distance metric learning with application to image retrieval
Chen et al. Parallel spectral clustering in distributed systems
Wang et al. Fast approximate k-means via cluster closures
Liu et al. Supervised hashing with kernels
Mu et al. Weakly-supervised hashing in kernel space
WO2013129580A1 (en) Approximate nearest neighbor search device, approximate nearest neighbor search method, and program
Qian et al. Unsupervised feature selection for multi-view clustering on text-image web news data
Qin et al. Fast action retrieval from videos via feature disaggregation
CN109784405B (en) Cross-modal retrieval method and system based on pseudo-tag learning and semantic consistency
Babenko et al. Similarity metrics for categorization: from monolithic to category specific
US20100299379A1 (en) Non-Negative Matrix Factorization as a Feature Selection Tool for Maximum Margin Classifiers
CN114299362A (en) Small sample image classification method based on k-means clustering
Li et al. Sub-selective quantization for learning binary codes in large-scale image search
Zhang et al. Dataset-driven unsupervised object discovery for region-based instance image retrieval
Mithun et al. Generating diverse image datasets with limited labeling
Duan et al. Minimizing reconstruction bias hashing via joint projection learning and quantization
Magliani et al. An efficient approximate kNN graph method for diffusion on image retrieval
CN110209895B (en) Vector retrieval method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230224

Address after: Room 801, 85 Kefeng Road, Huangpu District, Guangzhou City, Guangdong Province

Patentee after: Yami Technology (Guangzhou) Co.,Ltd.

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS