CN115129713A - Data retrieval method, data retrieval device, computer equipment and storage medium - Google Patents

Data retrieval method, data retrieval device, computer equipment and storage medium Download PDF

Info

Publication number
CN115129713A
CN115129713A CN202210679144.9A CN202210679144A CN115129713A CN 115129713 A CN115129713 A CN 115129713A CN 202210679144 A CN202210679144 A CN 202210679144A CN 115129713 A CN115129713 A CN 115129713A
Authority
CN
China
Prior art keywords
matrix
target
data
similarity
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210679144.9A
Other languages
Chinese (zh)
Inventor
赵文哲
林庆泓
蒋杰
郭春超
王红法
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210679144.9A priority Critical patent/CN115129713A/en
Publication of CN115129713A publication Critical patent/CN115129713A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to a data retrieval method, apparatus, computer device, computer readable storage medium and computer program product. The method comprises the following steps: acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix are obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data; calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the core similarity between the anchor point feature matrix of the anchor point and the target feature matrix; based on the parameter relation matrix, carrying out score calculation on the target similarity matrix to obtain a target score matrix of the target data; generating a target hash code matched with the target data according to the target score matrix; and performing data retrieval through the target hash code, and determining a retrieval result of the target data. By adopting the method of the embodiment of the application, the performance of data retrieval can be improved.

Description

Data retrieval method, data retrieval device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data retrieval method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product.
Background
With the rapid development of the internet, multimedia data such as images, texts and videos are rapidly increased, and large-scale data retrieval becomes a research hotspot. In the face of mass data, the nearest neighbor retrieval has wider application advantages compared with the accurate retrieval, thereby becoming a key technology in information retrieval. The hash technique has received more and more attention due to its low storage cost and high query efficiency, and is widely used in data retrieval.
The hash algorithm may encode high-dimensional data into a low-dimensional compact binary hash code. In the conventional technology, because the cosine similarity of two samples tends to be distributed in a range greater than 0 and less than 1, the cosine similarity of the high-dimensional vector [0, +1] can be reconstructed from the cosine similarity of {0, +1} hash code, and the quantization error of the hash code can be reduced, thereby improving the data retrieval performance. However, considering that the cosine similarity of some samples may still fall between [ -1, +1], in this case, the conventional method has certain limitations, thereby resulting in poor performance of data retrieval.
Disclosure of Invention
In view of the above, it is necessary to provide a data retrieval method, an apparatus, a computer device, a computer readable storage medium, and a computer program product, which can improve the performance of data retrieval, in view of the above technical problems.
In a first aspect, the present application provides a data retrieval method. The method comprises the following steps:
acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix are obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
performing score calculation on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data;
generating a target hash code matched with the target data according to the target score matrix;
and performing data retrieval through the target hash code, and determining a retrieval result of the target data.
In a second aspect, the present application further provides a data retrieval device. The device comprises:
the data acquisition module is used for acquiring a target characteristic matrix of target data and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix between the anchor points and the sample data are obtained by carrying out Hash coding training on the sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
the similarity calculation module is used for calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
the score calculation module is used for calculating scores of the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data;
the data coding module is used for generating a target hash code matched with the target data according to the target score matrix;
and the result determining module is used for performing data retrieval through the target hash code and determining the retrieval result of the target data.
In a third aspect, the application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix is obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
based on the parameter relation matrix, performing score calculation on the target similarity matrix to obtain a target score matrix of the target data;
generating a target hash code matched with the target data according to the target score matrix;
and performing data retrieval through the target hash code, and determining a retrieval result of the target data.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix are obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
performing score calculation on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data;
generating a target hash code matched with the target data according to the target score matrix;
and performing data retrieval through the target hash code, and determining a retrieval result of the target data.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix are obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
based on the parameter relation matrix, performing score calculation on the target similarity matrix to obtain a target score matrix of the target data;
generating a target hash code matched with the target data according to the target score matrix;
and performing data retrieval through the target hash code, and determining a retrieval result of the target data.
According to the data retrieval method, the data retrieval device, the computer equipment, the computer readable storage medium and the computer program product, Hash coding training is carried out based on the sample characteristic matrix of the sample data, the similarity parameter matrix between anchor points and the parameter relation matrix between the anchor points and the sample data can be obtained, the anchor points are the clustering centers of the sample data, accordingly, Hash coding processing can be carried out on the target data, various parameters can be directly obtained and used when target Hash coding of the target data is generated, and the processing efficiency of Hash coding processing on the target data can be improved. The target feature matrix of the target data is obtained, the similarity parameter matrix obtained by Hash coding training and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix are further used for calculating the target similarity matrix matched with the target data, accordingly, the kernel similarity is introduced, the value of the kernel similarity of the target data distributed at random can be enabled to fall within a preset value range, the similarity between the data is enabled to be approximate through the similarity between the data and the anchor point through the introduction of the anchor point, the dimension expansion of the target feature matrix of the target data can be achieved, the length of the target Hash code obtained subsequently is enabled to be longer, namely the target Hash code can break through the dimension limitation of the target feature matrix, and the subsequent data retrieval is facilitated. Furthermore, score calculation is carried out on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data, and the target hash code matched with the target data is generated according to the target score matrix, so that the accuracy of the generated target hash code can be improved. And finally, data retrieval can be carried out through the target hash codes, the retrieval result of the target data is determined, and the data retrieval efficiency and the data retrieval performance can be improved.
Drawings
FIG. 1 is a diagram of an exemplary data retrieval system;
FIG. 2 is a schematic diagram illustrating a similarity processing method of a conventional data retrieval method according to an embodiment;
FIG. 3 is a diagram illustrating a similarity determination method in a conventional data retrieval method according to an embodiment;
FIG. 4 is a schematic diagram illustrating similarity distribution of a conventional data retrieval method according to an embodiment;
FIG. 5 is a flow diagram illustrating a method for data retrieval in one embodiment;
FIG. 6 is a diagram illustrating a Gaussian kernel function distribution of data for a data retrieval method according to an embodiment;
FIG. 7 is a diagram illustrating the similarity between data and an anchor point for a data retrieval method in accordance with one embodiment;
FIG. 8 is a flow chart illustrating a method for retrieving data in an exemplary embodiment;
FIG. 9 is a graph showing the search performance of the data search method in one embodiment;
FIG. 10 is a block diagram showing the structure of a data retrieval device according to an embodiment;
FIG. 11 is a diagram of the internal structure of a computer device in one embodiment;
fig. 12 is an internal structural diagram of a computer device in another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that the data referred to in the present application, including but not limited to sample data, target data, candidate data, etc. for analysis, are data that are fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.
In one embodiment, the data retrieval method provided by the present application can be applied to an application environment as shown in fig. 1, where the application environment relates to the terminal 102 and the server 104. In some embodiments, the terminal 106 is also involved. The terminals 102 and 106 communicate with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be placed on the cloud or other server.
Server 104 may obtain sample data from terminal 102 and/or terminal 106 and determine the cluster center of the sample data as an anchor point. The sample data may be data stored in the terminal 102, the terminal 106, or may also be sample data obtained from a public data set. The server 104 performs hash coding training based on the sample feature matrix of the sample data to obtain a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data. Then, the server 104 may obtain target data from the terminal 102 and/or the terminal 106, where the target data is unrelated to the sample data, and determine a target feature matrix of the target data. Calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the core similarity between the anchor point feature matrix of the anchor point and the target feature matrix; based on the parameter relation matrix, carrying out score calculation on the target similarity matrix to obtain a target score matrix of the target data; and generating a target hash code matched with the target data according to the target score matrix. Therefore, the server 104 can perform data retrieval through the target hash code, determine a retrieval result of the target data, and send the retrieval result to the terminal 102 and/or the terminal 106 for display, so as to implement data display or data recommendation.
In one embodiment, the application environment may only involve the terminal 102 in case the data processing capabilities of the terminal 102 meet the data processing requirements. Specifically, the terminal 102 acquires sample data, determines a clustering center of the sample data as an anchor point, and performs hash coding training based on a sample feature matrix of the sample data to obtain a similarity parameter matrix between anchor points and a parameter relationship matrix between the anchor points and the sample data. Then, the terminal 102 acquires a target feature matrix of the target data, finally generates a target hash code of the target data, performs data retrieval through the target hash code, and determines and displays a retrieval result of the target data to realize data display or data recommendation.
The terminals 102 and 106 may be, but are not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart car-mounted devices, and the like. The portable wearable device may be a smart watch or bracelet, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
With the rapid development of the internet, multimedia data such as images, texts and videos are rapidly increased, and large-scale data retrieval becomes a research hotspot. In the face of mass data, the Nearest Neighbor search (ANN) has wider application advantages compared with the accurate search, and thus becomes a key technology in information search. The hash technology has received more and more attention due to its low storage cost and high query efficiency, and is widely applied to data retrieval.
The hash algorithm may encode high-dimensional data into a low-dimensional compact binary hash code. In particular, a sample feature vector x of dimension d is given i ∈R 1×d The hashing technique is intended to pass through a mapping
Figure BDA0003697638460000061
Encoding sample feature vector into r-dimensional binary hash code b i ∈{+1,-1} 1×r Wherein r < d. After the hash coding, the Euclidean distance (Euclidean distance) between two sample feature vectors can be approximated by a Hamming distance (Hamming distance), and the calculation of the Hamming distance can be supported by a bit exclusive or operation (XOR) of a computer, thereby realizing acceleration of the distance measurement.
Referring to fig. 2, the sample image includes a first image and a second image, and the first image feature of the first image is represented as (x) 1 ,x 2 ,…x n ) The second image feature of the second image is represented as (y) 1 ,y 2 ,…y n ) And the Euclidean distance between sample images is expressed as
Figure BDA0003697638460000062
After the hash encoding, the first image feature is encoded as [1,0,1, … 1]]Encoding the second image feature as [0,1,1, … 0 ]]The euclidean distance can be approximated by hamming distance, and the acceleration of the distance metric can be realized by Bit counting (Bit counting).
The hash algorithm learns the hash code by keeping the similarity of the original vectors in the hamming space, please refer to fig. 3, common similarity measures include euclidean distance and cosine distance, and the euclidean distance can better represent the absolute difference of the two vectors in terms of values, so most hash methods are designed based on the similarity of the euclidean distance. However, the euclidean distance has a disadvantage of an excessively large range in a high-dimensional vector, and in this case, the cosine distance represents a relative difference in direction, and the range is stable, so that the euclidean distance has better applicability. In an actual scene, due to the popularization of deep learning, a high-dimensional vector is a common representation, and therefore, the hash algorithm needs to be capable of sufficiently mining cosine similarity information under a sample.
In the conventional technology, the overall hash-based search algorithm can be divided into two categories: a Data-independent (Data-independent) hash algorithm and a Data-dependent (Data-dependent) hash algorithm. The hash function of the data-independent hash algorithm, usually obtained by manual construction or random projection, is independent of the specific training data. For example, a Locality-Sensitive Hashing (LSH) is a classical data-independent Hashing method, which maps original data into binary codes through random projection, and neighboring samples are similar in hamming space with a high probability after being coded. Data-independent hash algorithms tend to require a larger number of bits to obtain adequate performance, which means more storage cost is required.
Data dependent hash algorithms train hash functions through given data to obtain more compact hash codes, which can be divided into two categories: supervised hashing and unsupervised hashing. The supervised hash method utilizes the class mark information to learn a hash function and hash codes, thereby achieving excellent retrieval performance. However, in an actual application scenario, the class label information is expensive, and unsupervised hashing has wider applicability because hash encoding of data is not required by means of the class label information.
In the Iterative Quantization (ITQ) algorithm, a Principal Component Analysis (PCA) method is used to map original data into a low-dimensional real-valued feature, and then a Quantization error caused by mapping the low-dimensional real-valued feature onto a hamming space is reduced by orthogonal rotation. The Spectral Hashing (SH) algorithm is a Hashing method based on manifold learning, and an adjacency matrix W epsilon R is constructed through original features n×n And then solving the eigenvector of the graph Laplacian matrix and quantizing the eigenvector to obtain the binary code. Although the spectral hash algorithm is derived by exploring the local structure of the dataGood performance, but the cost of constructing the adjacency matrix W is very high when the number of samples n increases.
In order to overcome the problem, an Anchor Graph Hashing Algorithm (AGH) is proposed, which firstly performs clustering on a training sample to obtain m clustering centers, which are called anchors. The approximate adjacency matrix is constructed by calculating the similarity between the sample and the anchor point, so that the calculation complexity for constructing the similarity is obviously reduced.
The above mentioned hash algorithms mainly learn hash codes by keeping consistency of similarity before and after hash coding in the euclidean space. In addition, some researchers have focused on cosine distance information of samples, such as angle Quantization-based Binary Codes (AQBC) algorithm, which proposes the following loss function:
Figure BDA0003697638460000081
wherein x is i Is a sample characteristic of the sample whose value is non-negative. b is a mixture of i ∈{0,1} r Is x i The characterized samples correspond to hash codes. R is an orthogonal matrix, meaning a projection matrix that is learned by minimizing the included angle of each sample feature and its own hash code. However, there is an assumption in the angle quantization hash code algorithm that x is in the sample vector i The value of (1) is not negative, but the value in the actual sample feature vector has positive or negative, so the angle quantization hash code algorithm has great limitation in practical application. Moreover, the angle quantization hash code algorithm learns the hash code based on a single sample, and ignores the local structure information of the data to some extent.
After a large amount of observation, determining a sample characteristic matrix X E [ -1, +1 after L2 regularization processing] n×d Cosine similarity XX of T Ideally distributed in [ -1, +1 [ ]]But actually exhibits the asymmetric distribution shown in fig. 4, that is, most of the values of the cosine similarity tend to be distributed in a range of more than 0 and less than 1. Therefore, an Angular quantization (Angular Quanti) is proposedThe aza, AQ) algorithm aims at making cosine similarity of {0, +1} hash code matrix B
Figure BDA0003697638460000082
Cosine similarity XX for reconstructing high-dimensional vector T ∈[0,+1] n×n To reduce quantization error, the objective function is as follows:
Figure BDA0003697638460000083
since the cosine similarity of the sample data tends to be distributed in a range larger than 0 and smaller than 1, the cosine similarity of the high-dimensional vector [0, +1] is reconstructed by using the cosine similarity of {0, +1} hash code, and good performance can be obtained in most cases. However, considering that the cosine similarity of part of sample data may still fall between [ -1, +1], in this case, the angle quantization algorithm still has a certain limitation.
Therefore, the embodiment of the present application aims to extend an angle Quantization algorithm, and provides an angle Quantization Algorithm (AGQ) algorithm based on a Gaussian kernel, so that samples with cosine similarity distributed at will can be approximated by {0, +1} hash codes with a smaller Quantization error, and meanwhile, the hash codes can break through the limitation of original features in dimension, so as to achieve longer hash codes. Therefore, the accuracy of determining the hash codes corresponding to the data is effectively improved, and better retrieval performance is realized during data retrieval.
In one embodiment, as shown in fig. 5, a data retrieval method is provided, which is described by taking the method as an example applied to the server 104 in fig. 1, and includes:
step S202, a target characteristic matrix of target data is obtained, and a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, which are obtained by carrying out Hash coding training on the sample characteristic matrix based on the sample data, are obtained; the anchor point is the clustering center of the sample data.
Sample data refers to data used by the hash coding training process. The target data refers to data that needs to be subjected to hash coding, and the data types of the target data and the sample data can be, but are not limited to, images, texts, videos and the like. The characteristics of the sample data are called sample characteristic vectors, and the sample characteristic matrix refers to a characteristic matrix formed by the sample characteristic vectors of the sample data. The feature of the target data is called a target feature vector, and the target feature matrix refers to a feature matrix composed of the target feature vectors of the target data. The hash code training refers to a process of training a loss function of the hash code based on a sample feature matrix of sample data, and the hash code corresponding to the sample data can be determined through the hash code training.
Specifically, when data retrieval or feature quantization is required, target data may be obtained first, and feature extraction processing may be performed on the target data, so as to obtain a target feature matrix of the target data. The feature extraction processing mode may be set according to the data type of the target data and based on actual technical needs, for example, when the target data is text data, a pre-trained text model may be used for feature extraction, such as a natural language processing model. When the target data is image data, feature extraction can be performed by adopting a direction gradient histogram algorithm and a scale invariant feature transformation algorithm. It can be understood that the processing mode of obtaining the target feature matrix of the target data is consistent with the processing mode of obtaining the sample feature matrix of the sample data during the hash coding training.
It should be noted that, for one sample data or one target data, each feature of the sample data or the target data can be represented as a feature vector of dimension 1 × d, so that an n × d-dimensional feature matrix composed of feature vectors can be obtained, that is, the feature vectors are row vectors of the feature matrix. It will be understood that when the number of rows and columns of the feature matrix are interchanged, the feature vector may also be a column vector of the feature matrix. In this embodiment, to distinguish the expression form of the feature, the feature vector is represented by a lower case letter, for example, the feature vector is represented by X, and the feature matrix formed by the feature vectors is represented by an upper case letter corresponding to the lower case letter, for example, the feature matrix formed by the feature vector X is represented by X.
The anchor points refer to determined positioning mark points, and the number of the anchor points is multiple. In this embodiment, the anchor point may determine the manner according to the training data. Specifically, a sample feature matrix of the sample data is obtained, the sample data matrix is clustered to obtain a cluster center of the sample data, and the cluster center of the sample data is determined as an anchor point, namely the anchor point is the cluster center of the sample data. The clustering process may be to divide the sample data into a plurality of clusters based on the similarity or distance between the sample feature matrices, where the similarity of the sample data in each cluster is high, the similarity of the sample data in different clusters is low, and the clustering center is the center of the cluster. The algorithm of the clustering process may be any one of a partition method, a hierarchy method, a density algorithm, a mesh algorithm, and the like.
The similarity parameter matrix is a matrix determined based on the similarity between anchor points in the hash coding training process. The parameter relation matrix is a matrix obtained after parameter operation is performed based on the similarity between the anchor point and the sample data in the hash coding training process, and the number of the parameter relation matrixes is more than one. The parameter relation matrix can reduce quantization errors in the process of hash coding processing of the target characteristic matrix.
Specifically, after the hash coding training is performed on the sample feature matrix based on the sample data, a similarity parameter matrix between anchor points and a parameter relationship matrix between the anchor points and the sample data can be obtained. Therefore, the similarity parameter matrix and the parameter relation matrix can be directly used in the subsequent processing process to carry out hash coding on the target data and determine the hash coding corresponding to the target data.
It should be noted that the calculation method of the similarity between the anchor point and the anchor point needs to be consistent with the calculation method of the similarity between the anchor point and the sample data, so as to ensure the consistency and accuracy of the obtained matrix.
And step S204, calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the nuclear similarity between the anchor point feature matrix of the anchor point and the target feature matrix.
The anchor features of an anchor are called anchor feature vectors, which may constitute an anchor feature matrix. The kernel function refers to a function that can realize a transformation from a low-dimensional space to a high-dimensional space, and includes, but is not limited to, a linear kernel function, a polynomial kernel function, a gaussian kernel function, and the like. The kernel similarity refers to a similarity parameter between the anchor point feature matrix and the target feature matrix, which is obtained by calculating through a kernel function. The target similarity matrix is a matrix matched with target data obtained by performing parameter operation according to the kernel similarity and the similarity parameter matrix.
The type of kernel function can be chosen according to the actual technical needs. In this embodiment, the similarity [ -1, +1] of the original data is calculated by means of Kernel function (Kernel function)]Mapped to a new space such that its value range is [0, +1]. Therefore, the mapped similarity can be sent to an Angle Quantization (AQ) algorithm to reduce quantization errors brought by the AQ algorithm. For convenience of understanding, the kernel function is selected as a gaussian kernel function in this embodiment. The Gaussian kernel function can be positioned at any point x in space i To a certain central point x j The calculation formula is expressed as:
Figure BDA0003697638460000111
wherein x is i And x j Respectively representing the feature vector, | x, of the data i -x j || 2 Representing a vector x i And x j The Euclidean distance of (1) is increased along with the distance between two vectors, the value of the Gaussian kernel function is monotonically decreased, and sigma is 2 Representing a hyper-parameter.
Referring to the Gaussian kernel distribution diagram of FIG. 6, it can be seen from FIG. 6 that for arbitrarily distributed data x i And x j The Gaussian kernel similarity K (x) can be obtained by the Gaussian kernel function i ,x j ) Falls into [0, +1]In range, and thus, it may be desirable to further quantize it subsequently using an angular quantization algorithm.
In this embodiment, an Anchor Graph (Anchor Graph) idea is introduced, the gaussian kernel similarity is combined with the Anchor point, and the similarity between data is described by the similarity between data and Anchor point. Please refer to fig. 7 for a similarity distribution between data and anchor points. Feature vector x assuming the presence of arbitrarily distributed data i And x j Anchor point is m 1 、m 2 And m 3 The Gaussian kernel similarity K (x) i ,x j ) May be determined by the similarity K (x) between the data and the anchor point i ,m 1 )、K(x i ,m 2 )、K(x j ,m 2 ) And K (x) j ,m 3 ) To approximate. By introducing the anchor point, the dimension of the target characteristic matrix is expanded, the finally obtained hash code length of the target data can break through the dimension limit of the target characteristic matrix, the hash code length can be longer, the representation of the hash code is more accurate, and the performance of data retrieval is effectively improved.
It should be noted that, in the foregoing embodiment, the kernel function is taken as a gaussian kernel function as an example, if other types of kernel functions are selected, the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix may be determined according to a specific calculation formula of the kernel function, and the target similarity matrix matched with the target data is finally obtained.
And step S206, performing score calculation on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data.
The score calculation refers to the operation of the target similarity matrix based on the parameter relation matrix. The target score matrix is obtained by performing score calculation based on the parameter relation matrix and the target similarity matrix. The row vectors of the target score matrix are referred to as score vectors.
In particular, the way of score calculation may be set according to actual technical needs. In this embodiment, the score calculation refers to performing matrix multiplication operation on the parameter relationship matrix and the target similarity matrix, and the obtained target score matrix is a matrix multiplication operation result of the parameter relationship matrix and the target similarity matrix.
And step S208, generating a target hash code matched with the target data according to the target score matrix.
The target hash code is hash code matched with the target data, and the hash code consists of 0 and + 1. After the target score matrix is determined, a determined hash code is correspondingly existed for each score vector in the target score matrix, and after the hash codes corresponding to all the score vectors in the target score matrix are determined, the target hash code of the target data can be determined.
Specifically, an association relationship between the score vector and the hash code may be preset, and the hash code corresponding to each score vector in the target score matrix is determined according to the association relationship. Or, based on the numerical value of each score vector in the target score matrix, performing hash code assignment on each score vector, determining a hash code corresponding to the score vector, and finally generating a target hash code matched with the target data.
And step S210, performing data retrieval through the target hash code, and determining a retrieval result of the target data.
Data retrieval refers to the manner in which data associated with target data is retrieved from a large amount of data. The search result refers to specific data type and data content of data associated with the target data determined by data search, and the data type of the data associated with the target data may be the same as or different from the data type of the target data, which is not limited herein.
The data retrieval is performed through the target hash code, which may be to calculate the similarity between the target hash code and the hash codes of other data, and determine the retrieval result of the target data according to the similarity. The hash codes of other data and the target hash code need to be the same in length, so as to improve the accuracy of the retrieval result. After the retrieval result of the target data is determined, the candidate data corresponding to the retrieval result can be recalled, and then data display or data recommendation and the like can be performed according to actual technical requirements.
Specifically, the similarity between the hash codes can be determined by using a hamming distance, where the hamming distance is the number of bits corresponding to two hash codes with the same length. For example, there are two differences between 1011101 and 1001001, then the hamming distance is 2. Data retrieval is carried out based on Hash coding, time complexity, space complexity and storage overhead are small, data retrieval efficiency and accuracy can be improved, and storage space is saved.
According to the data retrieval method, the Hash coding training is carried out based on the sample characteristic matrix of the sample data, the similarity parameter matrix between the anchor points and the parameter relation matrix between the anchor points and the sample data can be obtained, the anchor points are the clustering centers of the sample data, accordingly, when the Hash coding processing is carried out on the target data and the target Hash coding of the target data is generated, all the parameters can be directly obtained and used, and the processing efficiency of the Hash coding processing on the target data can be improved. The target feature matrix of the target data is obtained, the target similarity matrix matched with the target data is calculated according to the similarity parameter matrix obtained by Hash code training and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix, accordingly, the kernel similarity can be introduced, the value of the kernel similarity of the target data distributed at random can be enabled to fall within a preset value range, the similarity between the data is enabled to be approximate through the similarity between the data and the anchor point by introducing the anchor point, the dimension expansion of the target feature matrix of the target data can be achieved, the length of the subsequently obtained target Hash code is enabled to be longer, namely the target Hash code can break through the dimension limitation of the target feature matrix, and the subsequent data retrieval is facilitated. Furthermore, score calculation is carried out on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data, and the target hash code matched with the target data is generated according to the target score matrix, so that the accuracy of the generated target hash code can be improved. And finally, data retrieval can be carried out through the target hash codes, the retrieval result of the target data is determined, and the data retrieval efficiency and the data retrieval performance can be improved.
In one embodiment, when the similarity calculation is performed through the kernel function, the adopted feature vectors of the data can be subjected to regularization processing, so that overfitting can be effectively prevented, and generalization capability is improved. Therefore, the regularization process is required to be performed when the target feature matrix of the target data is acquired. Specifically, obtaining a target feature matrix of target data includes: performing feature extraction processing on the target data to obtain an initial feature matrix of the target data; and carrying out regularization processing on the initial characteristic matrix to obtain a target characteristic matrix of the target data.
The initial feature matrix is a feature matrix directly obtained after feature extraction processing is performed on target data. Regularization is a process that effectively prevents over-fitting. The target feature matrix is the initial feature matrix after the regularization process.
The method for extracting the features of the target data may be selected according to the data type of the target data, and specifically may include using a pre-trained neural network model or a classical algorithm. The regularization process may be determined according to the operation requirement of the kernel function, and when the kernel function is a gaussian kernel function, the regularization process may specifically be L2 regularization, which may also be referred to as L2 normalization process, performed on the initial feature matrix to obtain a target feature matrix of the target data, which is denoted as X Q ={x i }。
It will be appreciated that the particular manner of regularization may vary when the kernel is other types of kernels. When the kernel function is adaptable to regularization in various manners, regularization of the initial feature matrix may be L1 regularization or normalization, for example.
In this embodiment, the initial feature matrix of the target data is regularized to obtain a target feature matrix, and the regularized target feature matrix is used in a subsequent operation process, so that overfitting can be effectively prevented, generalization capability is improved, and accuracy of performing hash coding on the target data subsequently is improved.
In one embodiment, taking the case that the kernel function is a gaussian kernel function, calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix, includes: calculating the Gaussian kernel similarity between the anchor point characteristic matrix of the anchor point and the target characteristic matrix through a Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor point and target data; and performing matrix product operation on the Gaussian kernel similarity matrix and the similarity parameter matrix to obtain a target similarity matrix matched with the target data.
The Gaussian kernel similarity refers to kernel similarity parameters obtained by performing operation through a Gaussian kernel function, and the Gaussian kernel similarity matrix refers to a matrix formed on the basis of the Gaussian kernel similarity between the anchor point and the target data obtained through calculation. The similarity parameter matrix is obtained based on Hash coding training and can be directly obtained and used when data processing is carried out on target data. Matrix multiplication refers to multiplying a plurality of matrices.
In particular, anchor points are denoted m i The anchor feature matrix is expressed as M ═ M i And calculating the Gaussian kernel similarity between the anchor point characteristic matrix and the target characteristic matrix through a Gaussian kernel function, and expressing the Gaussian kernel similarity as K (X, M). For each target feature vector in the target feature matrix, the corresponding Gaussian kernel similarity is represented as K (x) i M). Expressing the similarity parameter matrix as C, performing matrix product operation on the Gaussian kernel similarity matrix and the similarity parameter matrix to obtain a target similarity matrix matched with the target data, expressing the target similarity matrix as F ═ K (X, M) C, and expressing each target similarity vector in the target similarity matrix as F i =M(x i ,M)C。
It should be noted that, when performing matrix product operation, when the number of columns of one matrix is equal to the number of rows of the other matrix, the two matrices can be multiplied, and therefore, the position relationship of the two matrices in the above matrix product operation can be adaptively adjusted according to the actual matrix dimension.
In this embodiment, a gaussian kernel function is used to obtain a gaussian kernel similarity matrix between the anchor point and the target data, so that for data distributed at random, the similarity between the data may fall within the range of [0, +1 ]. The similarity between the data is approximated by the similarity between the data and the anchor point, and the Gaussian kernel similarity matrix and the similarity parameter matrix are subjected to matrix product operation to obtain a target similarity matrix matched with the target data, so that the dimension of the target characteristic matrix of the target data is expanded, the finally obtained hash coding length of the target data can break through the limitation of the dimension of the target characteristic matrix, the hash coding length can be longer, the representation of the hash coding is more accurate, and the performance of data retrieval is effectively improved.
In one embodiment, the target hash code matched with the target data can be obtained by calculating a target score matrix corresponding to the target similarity matrix and then performing hash code assignment according to the target score matrix. Specifically, based on the parameter relationship matrix, performing score calculation on the target similarity matrix to obtain a target score matrix of the target data, including: performing score calculation on the target similarity matrix based on the parameter relation matrix, and determining an initial score matrix of the target data; and respectively sequencing the elements of each score vector in the initial score matrix to obtain a target score matrix of the target data.
The parameter relation matrix is obtained based on Hash coding training and can be directly obtained and used when data processing is carried out on target data. The initial score matrix is a matrix determined directly based on the operation of the parameter relationship matrix and the target similarity matrix. The sorting process means sorting by the size of the element. The target score matrix is obtained by sorting the elements of each score vector in the initial score matrix.
Specifically, through Hash coding training, the determined parameter relation matrix comprises a projection matrix and a rotation matrix, and the projection matrix is represented as ^ r The rotation matrix is denoted as R. The initial score matrix is represented as Y, and each score vector in the initial score matrix is represented as Y i ,y i =f ir And R is shown in the specification. The sorting process may be sorting according to the descending order of the numerical values of the elements, so that the elements in the score vector keep descending order, and finally the target score matrix Y of the target data is obtained.
In this embodiment, since the maximum value of the first k score vectors needs to be determined in the subsequent target hash coding process, the elements of each score vector in the initial score matrix are sorted respectively, so that each element keeps descending order, and the target score matrix of the target data is finally obtained.
In one embodiment, after the target score matrix is determined, a target hash code corresponding to the target data may be determined according to the target score matrix. Specifically, generating a target hash code matched with the target data according to the target score matrix includes: according to the position of each score vector in the target score matrix, carrying out numerical comparison on the numerical value of each score vector and the numerical value of the score vector positioned in front of the score vector to obtain a comparison result of each score vector; performing hash code assignment based on the comparison result of each score vector to generate a hash code corresponding to each score vector; and combining the Hash codes of each score vector according to the score vector sequence to obtain the target Hash codes matched with the target data.
The position of the score vector refers to the position of the score vector in the target score matrix, and since the score vector is a row vector of the target score matrix, the position of the score vector can be defined according to the number of rows in which the score vector is located. The hash code assignment refers to assigning the hash code corresponding to the score vector. The score vector order refers to the order of the positions of the score vectors in the target score matrix. For example, the number of rows of the target score matrix is three, that is, three score vectors are included, the hash code corresponding to the score vector in the first row is 1, the hash code corresponding to the score vector in the second row is 0, and the hash code corresponding to the score vector in the third row is 1, so that the target hash code is 101.
Specifically, the hash code corresponding to the score vector is related to the size of the score vector, and since the hash code in this embodiment is composed of 0 and +1, it is assumed that +1 in the target hash code isThe number of which is k, thereby defining the vector y if and only if a certain component value i ' is the maximum of the first k score vectors such that
Figure BDA0003697638460000161
Otherwise
Figure BDA0003697638460000162
That is, it is necessary to determine whether the score vector is the maximum value among the first k score vectors according to the comparison result of each score vector, and the corresponding hash code is +1 when the score vector is the maximum value, otherwise 0.
For example, assume that the target score matrix Y' corresponding to the target data is [ Y ═ Y 1 ′,y 2 ′,...,y n ′]More specifically, Y' is [1,0,2,4,3 ]]If the score vector 1 is the maximum value in the previous score vector, the corresponding hash code is 1, if the score vector 0 is not the maximum value in the first two score vectors, the corresponding hash code is 0, if the score vector 2 is the maximum value in the first three score vectors, the corresponding hash code is 1, if the score vector 4 is the maximum value in the first four score vectors, the corresponding hash code is 1, if the score vector 3 is not the maximum value in the first five score vectors, the corresponding hash code is 0. Therefore, the hash code value corresponding to each score vector can be determined to be b 1 =1、b 2 =0、b 3 =1、b 4 =1、b 5 0. So that the target hash code corresponding to the target data can be determined to be 10110.
In this embodiment, the hash code corresponding to each score vector is generated according to the position and the value of each score vector in the target score matrix, and then the hash codes are combined according to the score vector sequence, so as to finally obtain the target hash code matched with the target data, thereby enabling the obtained target hash code to be more accurate.
In one embodiment, after determining the target hash encoding that the target data matches, data retrieval may be performed based on the target hash encoding. Specifically, the data retrieval is performed through the target hash code, and the determination of the retrieval result of the target data includes: according to the target hash code, calculating the similarity between the target hash code and the candidate hash codes corresponding to the candidate data of the target data to obtain a similarity result; based on the similarity result, screening out similar hash codes meeting the similarity condition from the candidate hash codes; and taking the candidate data corresponding to the similar hash codes as the retrieval result of the target data.
Candidate data refers to data that may be associated with the target data, and may be data in a search database. The candidate hash code refers to a hash code matched with the candidate data. The similarity condition refers to a condition that needs to be satisfied by candidate hash codes of candidate data similar to the target data, and the similarity condition may be determined according to a similarity calculation manner, for example, when cosine similarity is calculated, the cosine similarity may be set to be greater than a set cosine similarity threshold, when hamming distance is calculated, the hamming distance may be set to be smaller than a set hamming distance threshold, and both the cosine similarity threshold and the hamming distance threshold may be set according to actual technical requirements. The similar hash codes refer to candidate hash codes which can satisfy a similarity condition.
Specifically, the similarity between the target hash code and the candidate hash codes corresponding to the candidate data of the target data may be calculated by determining a hamming distance between the target hash code and the candidate hash codes to obtain a similarity result. Therefore, similar hash codes meeting the similarity condition can be screened out from the candidate hash codes, candidate data corresponding to the similar hash codes are determined, and the candidate data corresponding to the similar hash codes are used as the retrieval result of the target data.
It should be noted that, in order to save processing time and effectively improve the efficiency of data retrieval in this embodiment, candidate data for the target data may be set in advance according to the target data, so that it is not necessary to determine the similarity between large-scale data and the target data. It is understood that when the data processing capacity of the computer device is sufficient and the retrievable data amount is limited, the similarity between the target data and all other data may be calculated directly without determining candidate data.
In this embodiment, by setting candidate data for the target data and performing data retrieval according to the target hash code and the candidate hash code of the candidate data, processing time can be saved and data retrieval efficiency can be improved.
In one embodiment, in the hash coding training process, a similarity parameter matrix between anchor points may be determined based on sample data, and may be directly obtained for use when data processing is performed on target data. Specifically, the method further comprises: calculating the Gaussian kernel similarity between anchor points through a Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor points; and carrying out matrix decomposition processing on the inverse matrix corresponding to the Gaussian kernel similarity matrix between the anchor points to obtain a similarity parameter matrix between the anchor points.
The matrix decomposition processing refers to splitting a matrix into products of a plurality of matrices, and the matrix decomposition processing mode may be one of a trigonometric decomposition method, a QR decomposition method and a singular value decomposition method, and may be specifically selected according to actual technical requirements. The similarity parameter matrix is related to the anchor point, and the anchor point is the clustering center of the sample data, namely the sample data is determined by clustering, so that the similarity parameter matrix can be determined in the Hash code training process after the sample data is determined.
Specifically, the Gaussian kernel similarity between anchor points is calculated through a Gaussian kernel function, a Gaussian kernel similarity matrix between anchor points is represented as K (M, M), and an inverse matrix thereof is represented as K - (M, M), the inverse matrix may be determined by a matrix operation. Performing matrix decomposition on the inverse matrix to determine K - (M,M)=CC T And obtaining a similarity parameter matrix C between anchor points.
In this embodiment, the gaussian kernel similarity between anchor points is calculated through a gaussian kernel function, and then a similarity parameter matrix between anchor points is obtained through further processing, so that the similarity parameter matrix is associated with the anchor points.
In one embodiment, for ease of understanding, the following description is made with respect to a hash code training process. In the process of Hash coding training, a parameter relation matrix between an anchor point and sample data can be determined based on the sample data, and the anchor point and the sample data can be directly obtained and used when data processing is carried out on target data.
Specifically, the method further comprises: calculating a sample similarity matrix matched with the sample data according to the similarity parameter matrix and the nuclear similarity between the anchor point characteristic matrix of the anchor point and the sample characteristic matrix; performing matrix decomposition processing according to the sample similarity matrix to obtain an initial parameter relation matrix between the anchor point and the sample data; based on the sample similarity matrix and the initial parameter relation matrix, performing loss function calculation on the Hash codes matched with the sample characteristics to obtain a Hash code loss function of sample data; and according to the Hash coding loss function, updating data of the initial parameter relation matrix to obtain a parameter relation matrix between the anchor point and the sample data.
The kernel similarity between the anchor point feature matrix and the sample feature matrix is also determined based on a kernel function, and the kernel function in this embodiment is consistent with a kernel function used when data processing is performed on target data. The sample similarity matrix is a matrix matched with sample data obtained by performing parameter operation on the kernel similarity and the similarity parameter matrix obtained by calculating the sample data. The initial parameter relation matrix is a matrix obtained by directly performing matrix decomposition processing on the sample similarity matrix. The hash coding loss function refers to a loss function in the hash coding training process determined by referring to an angle quantization hash (AQ) algorithm according to each parameter calculation mode in this embodiment. The parameter relation matrix is finally determined when the Hash coding training is finished, and comprises a projection matrix A r And a rotation matrix R.
Specifically, the sample feature matrix is represented by X ═ R n×d Anchor point is denoted as M ═ R m×d The hash code length is denoted r. Referring to FIG. 7, the Gaussian kernel similarity between the sample data can be approximated by the Gaussian kernel similarity between the sample data and the anchor point, and the relationship can beExpressed in the following form:
XX T ≈K(X,M)K - (M,M)K(M,X) T
for inverse matrix K corresponding to Gaussian kernel similarity matrix between anchor points - (M, M) matrix decomposition processing is carried out, and the following can be obtained:
K - (M,M)=CC T
wherein, the similarity parameter matrix between anchor points is C. Thus, there is the following expression:
XX T ≈K(X,M)K - (M,M)K(M,X) T =K(X,M)CC T K(M,X) T
according to the above expression, a sample similarity matrix to which sample data matches can be defined: f ═ K (X, M) C. It can be determined that the dimension of the sample similarity matrix F is n × m, where m is the number of anchor points. The original hash coding loss function is expressed as:
Figure BDA0003697638460000191
based on the sample similarity matrix F, the original hash coding loss function can be approximated as follows:
Figure BDA0003697638460000192
namely, it is determined that the hash coding loss function in this embodiment is expressed as follows:
Figure BDA0003697638460000193
wherein, in the above formula
Figure BDA0003697638460000194
Represented is a hash-coding of the sample data,
Figure BDA0003697638460000195
are discrete variables. Is composed ofThe solution is convenient, continuous Y is introduced as an intermediate variable, and the hash coding loss function of the embodiment is converted into:
Figure BDA0003697638460000201
in the above hash coding loss function, an initial parameter relationship matrix may be determined according to the intermediate variable Y, and then, by performing operation on the hash coding loss function, that is, by performing data update on the initial parameter relationship matrix, a parameter relationship matrix between the anchor point and the sample data may be finally obtained.
In this embodiment, the original hash coding loss function is combined with each parameter calculation method in this embodiment to obtain the hash coding loss function, and the hash coding loss function is solved to finally obtain a parameter relationship matrix between the anchor point and the sample data, so that the accuracy of the obtained parameter relationship matrix can be improved.
In one embodiment, solving the initial parameter relationship matrix between the anchor point and the sample data according to the hash coding loss function can be regarded as solving the first item in the hash coding loss function
Figure BDA0003697638460000202
To obtain an initial parameter relation matrix, namely an initial projection matrix A r And an initial rotation matrix R.
Specifically, the matrix decomposition processing is performed according to the sample similarity matrix to obtain an initial parameter relationship matrix between the anchor point and the sample data, and the method includes: calculating the product of the transposition of the sample similarity matrix and the sample similarity matrix to obtain a matrix product result; and carrying out matrix decomposition processing on the matrix product result to obtain an initial parameter relation matrix between the anchor point and the sample data.
Specifically, the transpose F of the sample similarity matrix is calculated T Multiplying the sample similarity matrix F to obtain a matrix product result expressed as FF T The matrix product result FF T Is also the first in the hash coding loss functionA partial expression of an item. For matrix product result FF T Matrix decomposition processing is carried out to obtain the following expression:
FF T =H∧H T
wherein ^ is a diagonal matrix corresponding to the characteristic value, and H is a characteristic matrix. Taking out the front r row to obtain ^ r ∈R m×r R is the length of the hash code, i.e. an initial projection matrix Λ is determined r . The dimension of the F is n multiplied by m, so that the value of r can be larger than d, that is, the introduction of the anchor point proves that the finally obtained hash code breaks through the limitation of characteristic dimension. Then, using the properties of the matrix decomposition, it can be determined:
FF T ≈(F∧ r R)(F∧ r R) T
wherein, R is an orthogonal matrix, i.e. an initial rotation matrix R. The continuous variable introduced in the above embodiment may be defined as Y ═ F ^ r R)∈R n×r To facilitate subsequent solution of the hash coding loss function.
In this embodiment, by processing the first item in the hash coding loss function, the solution of the hash coding loss function is more targeted, and the speed of the hash coding training is increased.
In one embodiment, after determining the initial parameter relationship matrix, solving the parameter relationship matrix between the anchor point and the sample data may be regarded as solving the second term in the hash coding loss function
Figure BDA0003697638460000212
The solution of (1) can be an iterative solution by adopting an alternate optimization method, and can be a method of firstly initializing parameters, then fixing one variable and updating the other variable.
Specifically, according to the hash coding loss function, performing data update on the initial parameter relationship matrix to obtain a parameter relationship matrix between the anchor point and the sample data, including: taking the numerical minimization of the Hash coding loss function as a target, taking the initial parameter relation matrix as a fixed value, and updating the Hash coding matrix matched with the sample data to obtain an updated coding matrix; and updating the initial parameter relation matrix by taking the updated coding matrix as a fixed value, and obtaining the parameter relation matrix between the anchor point and the sample data when the data update reaches the update finishing condition.
The updated encoding matrix refers to the hash encoding matrix used in the hash encoding training process, i.e. the discrete variable in the above embodiment
Figure BDA0003697638460000213
The update ending condition refers to a condition corresponding to the end of the hash code training, and can be set according to actual technical requirements. For example, the update end condition may be set such that the number of iterations reaches a set number of iterations, or may be hash coding loss function convergence.
Specifically, the expression Y ═ of the intermediate variable (F ^) has been determined in the above-described embodiment r R)∈R n×r . Thus, the second term of the hash coding loss function can be expressed as follows:
Figure BDA0003697638460000211
wherein R can be understood as the ratio of F ^ to r Rotation is performed to reduce quantization error. In view of
Figure BDA0003697638460000214
Is normalized by L2, i.e. exists
Figure BDA0003697638460000215
Therefore, to avoid the influence of the norm range, F ^ is used r Taken as a whole, it was also normalized by L2.
The solution objective of the second term can be converted into a solution
Figure BDA0003697638460000216
Wherein the content of the first and second substances,
Figure BDA0003697638460000217
represents Y and
Figure BDA0003697638460000218
inner product of (A), F and ^ r Is a known variable, R and
Figure BDA0003697638460000219
are variables to be solved. Firstly, R is initialized randomly, then one variable is fixed, and the other variable is updated, namely R is fixed and updated
Figure BDA00036976384600002110
And fixing
Figure BDA00036976384600002111
And updating the R, and finally obtaining a parameter relation matrix between the anchor point and the sample data when the data updating reaches the updating ending condition.
Specifically, at fixed R update
Figure BDA00036976384600002112
In the case of (2), the solution objective of the second term can be converted into a sequential pair
Figure BDA00036976384600002113
Row vector of
Figure BDA00036976384600002114
Solving for
Figure BDA00036976384600002115
Due to the existence of
Figure BDA00036976384600002116
But does not
Figure BDA00036976384600002117
The number of +1 s in (A) is uncertain, but it is known that the number of +1 s ranges from [1, r ]]. If it is assumed that
Figure BDA0003697638460000222
The number of +1 in (b) is k, thatChinese character' Tao
Figure BDA0003697638460000223
The position of +1 in (b) necessarily corresponds to the first k maxima of y, in such a way that
Figure BDA0003697638460000224
This term is maximized. Thus, traverse k ∈ [1, r]And recording the corresponding value thereof
Figure BDA0003697638460000225
Final use of
Figure BDA0003697638460000226
To obtain
Figure BDA0003697638460000227
Where argmax is a function that sets parameters to a function. The calculation process of the above embodiment can be described as follows:
algorithm 1:
Figure BDA0003697638460000228
is solved for
Inputting: y ═ f ^ r R;
Definition score ═ 0, …,0] 1×r
Sorting the elements in y to keep the elements in descending order;
For k=1,…,r;
definition b k At each position element of
Figure BDA0003697638460000229
When only y i Is the first k maxima, otherwise
Figure BDA00036976384600002210
Computing
Figure BDA0003697638460000221
And (3) outputting: b k Wherein, in the step (A),k=arg max(score k )。
wherein, the score k Representing the kth element in the score vector.
In particular, in the fixing
Figure BDA00036976384600002211
When R is updated, F ^ is set r Viewed as a whole, a
Figure BDA00036976384600002212
Singular value decomposition is carried out to obtain:
Figure BDA00036976384600002213
thus, the rotation matrix R can be expressed as: r ═ SS T
In the process of Hash code training, setting the update ending condition of data update to reach the set iteration number T, and executing one-time fixed R update
Figure BDA00036976384600002214
Once fixing
Figure BDA00036976384600002215
Updating R as one iteration operation, and performing iteration operation circularly until reaching set iteration times to obtain parameter relation matrix between anchor point and sample data, including projection matrix Λ r And a rotation matrix R. The hash coding training process may be described as follows:
and 2, algorithm: hash coding training
Inputting: the sample characteristic matrix X belongs to R (n multiplied by d), the anchor point M belongs to R (M multiplied by d), the length R of the Hash code and the iteration time T;
calculating K (X, M) and K (M, M) by a Gaussian kernel function K (·, ·);
to K - (M, M) carrying out matrix decomposition to obtain C e R m×m
Calculating F ═ K (X, M) C;
to FF T Matrix decomposition processing is carried out to obtain H ^ H T Definition of Y ═ F ^ r R);
Random initialization R r×r
For t=1,…,T;
For i=1,…,n;
Fixing R d×r By Algorithm 1 to generate x i Corresponding hash code b i
Fixing B n×r To is aligned with
Figure BDA0003697638460000231
Performing singular value decomposition, and updating R ═ SS T
And (3) outputting: similarity parameter matrix C, projection matrix Λ r And a rotation matrix R.
In this embodiment, the second term in the hash coding loss function is processed, specifically, an alternating optimization method is used to iteratively update solved data, and a final parameter relationship matrix, that is, a projection matrix Λ, is obtained after training is finished r And the rotation matrix R can improve the speed of Hash code training and can also improve the accuracy of the determined parameter relationship.
In one embodiment, after the training of hash coding based on the sample feature matrix of the sample data is finished, a similarity parameter matrix C between anchor points, a projection matrix Λ between the anchor points and the sample data can be obtained r And a rotation matrix R. When target data needing hash coding is given, firstly, a Gaussian kernel function which is the same as the training process of the hash coding is adopted to determine the kernel similarity between the anchor point characteristic matrix and the target characteristic matrix, and then, an algorithm 1 is adopted to generate the target hash coding matched with the target data, wherein the algorithm of the process can be described as follows:
and (3) algorithm: hash coding of target data;
inputting: target feature matrix X Q ∈R m×d Anchor M ∈ R m×d Similarity parameter matrix C, projection matrix Λ r The rotation matrix R;
For i=1,…,m;
calculating f i =K(x i ,M)C;
Generation of f by Algorithm 1 i Corresponding hash code b i
And (3) outputting: hash code B Q ∈{0,1} m×r
In this embodiment, after the training of the hash codes is finished, the target hash codes matched with the target data are determined by directly obtaining various parameters obtained by the training of the hash codes and based on a mode corresponding to the training process, so that the accuracy of the obtained target hash codes can be improved, and the processing efficiency of the hash code processing is improved.
The present application will be described in further detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In a specific embodiment, taking the case that the kernel function is a gaussian kernel function, the method of this embodiment is also referred to as an angle quantization algorithm based on a gaussian kernel. The following describes a specific process of training hash codes of sample data, generating target hash codes matched with target data, and performing data retrieval according to the target hash codes.
Step S801, a sample feature matrix of the sample data is obtained.
Specifically, the sample feature matrix is a matrix subjected to the regularization processing of L2, and is represented as X ═ R n×d
And S802, clustering the sample feature matrix to obtain a sample data clustering center, and determining the sample data clustering center as an anchor point.
Specifically, the algorithm of the clustering process may be any one of a partition method, a hierarchy method, a density algorithm, a mesh algorithm, and the like. By clustering the sample feature matrix, a cluster center of the sample data can be obtained, and thus, the cluster center of the sample data can be determined as an anchor point and expressed as M-R m×d
Step S803, calculating the Gaussian kernel similarity between anchor points through a Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor points; and carrying out matrix decomposition processing on the inverse matrix corresponding to the Gaussian kernel similarity matrix between the anchor points to obtain a similarity parameter matrix between the anchor points.
Specifically, the gaussian kernel similarity between the sample data may be approximated by the gaussian kernel similarity between the sample data and the anchor point, and the relationship may be expressed as follows:
XX T ≈K(X,M)K - (M,M)K(M,X) T
for inverse matrix K corresponding to Gaussian kernel similarity matrix between anchor points - (M, M) matrix decomposition processing is carried out, and the following can be obtained:
K - (M,M)=CC T
wherein, the similarity parameter matrix between anchor points is C. Thus, the following expression exists:
XX T ≈K(X,M)K - (M,M)K(M,X) T =K(X,M)CC T K(M,X) T
step S804, a sample similarity matrix matched with the sample data is calculated according to the similarity parameter matrix and the Gaussian kernel similarity between the anchor point feature matrix of the anchor point and the sample feature matrix.
Specifically, according to the above expression, a sample similarity matrix to which sample data matches may be defined, denoted as F ═ K (X, M) C.
Step S805, calculating the product of the transpose of the sample similarity matrix and the sample similarity matrix to obtain a matrix product result; and carrying out matrix decomposition processing on the matrix product result to obtain an initial parameter relation matrix between the anchor point and the sample data.
Step S806, based on the sample similarity matrix and the initial parameter relation matrix, performing loss function calculation on the hash codes matched with the sample characteristics to obtain hash code loss functions of the sample data.
Specifically, the original hash coding loss function is expressed as:
Figure BDA0003697638460000251
based on the sample similarity matrix F, the original hash coding loss function can be approximated as follows:
Figure BDA0003697638460000252
namely, it is determined that the hash coding loss function in this embodiment is expressed as follows:
Figure BDA0003697638460000253
wherein, in the above formula
Figure BDA0003697638460000255
It is indicated that the hash of the sample data is encoded,
Figure BDA0003697638460000256
are discrete variables. To solve for easy solution, continuous Y is introduced as an intermediate variable, and the hash coding loss function of this embodiment is converted into:
Figure BDA0003697638460000254
computing transpose F of sample similarity matrix T Multiplying the sample similarity matrix F to obtain a matrix product result expressed as FF T The matrix product result FF T Which is also part of the expression of the first term in the hash coding loss function. For matrix product result FF T Matrix decomposition processing is carried out to obtain the following expression:
FF T =H∧H T
wherein ^ is a diagonal matrix corresponding to the characteristic value, and H is a characteristic matrix. Taking out the front r row to obtain ^ r ∈R m×r R is the length of the Hash code, i.e. the initial projection matrix A in the initial parameter relation matrix is determined r . Then, using the properties of the matrix decomposition, it can be determined:
FF T ≈(F∧ r R)(F∧ r R) T
wherein, R is an orthogonal matrix, that is, an initial rotation matrix R in the initial parameter relation matrix is determined. The continuous variable introduced in the above embodiment may be defined as Y ═ F ^ r R)∈R n×r To facilitate subsequent solution of the hash coding loss function.
Step S807, taking the numerical minimization of the Hash coding loss function as a target, taking the initial parameter relation matrix as a fixed value, and updating the Hash coding matrix matched with the sample data to obtain an updated coding matrix; and updating the initial parameter relation matrix by taking the updated coding matrix as a fixed value, and obtaining the parameter relation matrix between the anchor point and the sample data when the data update reaches the update finishing condition.
Specifically, the expression Y ═ of the intermediate variable (F ^) has been determined in the above-described embodiment r R)∈R n×r . Thus, the second term of the hash coding loss function can be expressed as follows:
Figure BDA0003697638460000261
wherein R can be understood as the ratio of F ^ to r Rotation is performed to reduce quantization error. In view of
Figure BDA0003697638460000262
Is normalized by L2, i.e. exists
Figure BDA0003697638460000263
Therefore, to avoid the influence of the norm range, F ^ is used r Taken as a whole, it was also normalized by L2.
The solution objective of the second term can be converted into a solution
Figure BDA0003697638460000264
Wherein the content of the first and second substances,
Figure BDA0003697638460000265
represents Y and
Figure BDA0003697638460000266
inner product of (A), F and ^ r Are known variables, R and
Figure BDA0003697638460000267
are variables to be solved. Firstly, R is initialized randomly, then one variable is fixed, and the other variable is updated, namely R is fixed and updated
Figure BDA0003697638460000268
And fixing
Figure BDA0003697638460000269
And updating the R, and finally obtaining a parameter relation matrix between the anchor point and the sample data when the data updating reaches the updating ending condition.
Specifically, at fixed R update
Figure BDA00036976384600002610
In the case of (2), the solution objective of the second term can be converted into a sequential pair
Figure BDA00036976384600002611
Row vector of
Figure BDA00036976384600002613
Solving for
Figure BDA00036976384600002612
Due to the existence of
Figure BDA00036976384600002614
But does not
Figure BDA00036976384600002615
The number of +1 s in (A) is uncertain, but it is known that the number of +1 s ranges from [1, r ]]. If it is assumed that
Figure BDA00036976384600002616
The number of +1 s in (b) is k, then
Figure BDA00036976384600002617
The position of +1 in (b) necessarily corresponds to the first k maxima of y, in such a way that
Figure BDA00036976384600002618
This term is maximized. Thus, traverse k ∈ [1, r]And recording the corresponding value thereof
Figure BDA00036976384600002619
Final use of
Figure BDA00036976384600002620
To obtain
Figure BDA00036976384600002621
Where argmax is a function that sets parameters to the function. The calculation process of the above embodiment can be described as follows:
algorithm 1:
Figure BDA00036976384600002622
is solved for
Inputting: y ═ f ^ r R;
Definition score ═ 0, …,0] 1×r
Sorting the elements in y to keep the elements in descending order;
For k=1,…,r;
definition b k Each positional element of which is
Figure BDA00036976384600002623
When only y i Is the first k maxima, otherwise
Figure BDA00036976384600002624
Computing
Figure BDA00036976384600002625
And (3) outputting: b k Where k is argmax (score) k )。
In particular, in the fixing
Figure BDA00036976384600002626
When R is updated, F ^ is set r Viewed as a whole, a
Figure BDA00036976384600002627
Singular value decomposition is carried out to obtain:
Figure BDA00036976384600002628
thus, the rotation matrix R can be expressed as: r ═ SS T
In the process of Hash code training, setting the update ending condition of data update to reach the set iteration number T, and executing one-time fixed R update
Figure BDA0003697638460000271
Once fixing
Figure BDA0003697638460000272
Updating R as one iteration operation, and performing iteration operation circularly until reaching set iteration times to obtain parameter relation matrix between anchor point and sample data, including projection matrix Λ r And a rotation matrix R. The hash coding training process may be described as follows:
and 2, algorithm: hash coding training
Inputting: the sample characteristic matrix X belongs to R (n multiplied by d), the anchor point M belongs to R (M multiplied by d), the length R of the Hash code and the iteration time T;
calculating K (X, M) and K (M, M) by a Gaussian kernel function K (·, ·);
to K - (M, M) carrying out matrix decomposition to obtain C e R m×m
Calculating F ═ K (X, M) C;
to FF T Matrix decomposition processing is carried out to obtain H ^ H T Definition of Y ═ F ^ r R);
Random initialization R r×r
Fort=1,…,T;
Fori=1,…,n;
Fixed R d×r Generating x by Algorithm 1 i Corresponding hash code b i
Fixing B n×r To is aligned with
Figure BDA0003697638460000273
Performing singular value decomposition, and updating R ═ SS T
And (3) outputting: similarity parameter matrix C, projection matrix Λ r And a rotation matrix R.
Step S808, a target feature matrix of the target data is obtained.
Specifically, the target feature matrix is a matrix subjected to L2 regularization processing and is denoted by X Q ∈R m×d
Step S809, calculating the Gaussian kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix through the Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor point and the target data; and performing matrix product operation on the Gaussian kernel similarity matrix and the similarity parameter matrix to obtain a target similarity matrix matched with the target data.
Step S810, calculating the score of the target similarity matrix based on the parameter relation matrix, and determining an initial score matrix of the target data; and respectively sequencing elements of each score vector in the initial score matrix to obtain a target score matrix of the target data.
Step S811, comparing the value of each score vector with the value of the score vector positioned before the score vector according to the position of each score vector in the target score matrix to obtain the comparison result of each score vector; performing hash code assignment based on the comparison result of each score vector to generate a hash code corresponding to each score vector; and combining the Hash codes of each score vector according to the score vector sequence to obtain a target Hash code matched with the target data.
Specifically, in the above embodiment, when target data that needs to be subjected to hash coding is given, a gaussian kernel function that is the same as a hash coding training process is first used to determine a kernel similarity between an anchor point feature matrix and a target feature matrix, and then an algorithm 1 is used to generate a target hash code matched with the target data, where the algorithm in the process may be described as follows:
and (3) algorithm: hash coding of the target data;
inputting: target feature matrix X Q ∈R m×d Anchor M ∈ R m×d Similarity parameter matrix C, projection matrix Λ r The rotation matrix R;
For i=1,…,m;
calculating f i =K(x i ,M)C;
Generation of f by Algorithm 1 i Corresponding hash code b i
And (3) outputting: hash code B Q ∈{0,1} m×r
Step S812, calculating the similarity between the target hash code and the candidate hash codes corresponding to the candidate data of the target data according to the target hash code to obtain a similarity result; based on the similarity result, screening out similar hash codes meeting the similarity condition from the candidate hash codes; and taking the candidate data corresponding to the similar hash codes as the retrieval result of the target data.
Specifically, the similarity result may be obtained by calculating a hamming distance between the target hash code and the candidate hash codes corresponding to the candidate data of the target data according to the target hash code. Then, based on the similarity result, similar hash codes meeting the similarity condition are screened out from the candidate hash codes; and taking the candidate data corresponding to the similar hash codes as the retrieval result of the target data, so that the candidate data corresponding to the retrieval result can be recalled to realize data display or data recommendation.
In order to verify the effectiveness of the method of the present embodiment in data retrieval in practical application scenarios, experiments were performed on the public image data set. And randomly selecting 512-dimensional floating point type features corresponding to 3783 images as a training set and a retrieval database, and randomly selecting the features of 1000 images from the rest data as a test set. And verifying the performance of data retrieval based on the hash coding by using the evaluation index MAP @ K.
Specifically, the evaluation index MAP @ K refers to c given query samples
Figure BDA0003697638460000281
And appointing the number of K samples before returning, and defining the average retrieval precision as:
Figure BDA0003697638460000291
wherein, g j Is composed of
Figure BDA0003697638460000292
P (i) is defined as the prediction accuracy of the first i search results, and σ (i) is an indicator function, if the i-th search sample is predicted to be correctly equal to 1, otherwise, the i-th search sample is 0. The larger the value of MAP @ K, the better the reflective search performance.
By adopting the method of the embodiment, 2048 samples are randomly selected from the training set as anchor points and the parameter sigma in the Gaussian kernel function 2 The values of (A) are set as follows:
Figure BDA0003697638460000293
wherein mu is the Euclidean distance between anchor points M i -m j || 2 β is set to 0.5. An iterative quantization (ITQ) algorithm and an angle quantization hash code (AQ) algorithm are used as comparison methods, the retrieval performance is compared under different hash code lengths of 64 bits, 128 bits, 256 bits and 512 bits, and meanwhile, the retrieval performance corresponding to the original 512-dimensional features is listed as a standard.
A graph of the search performance comparison scheme is shown in fig. 9. It can be determined that, under different hash code length settings, the method of the present embodiment has an obvious search performance improvement compared with the iterative quantization algorithm and the angle quantization hash code algorithm, for example, under 64 bits, the method of the present embodiment has an improvement of 7.40% compared with the iterative quantization algorithm and an improvement of 5.75% compared with the angle quantization hash code algorithm; under 512 bits, the method of the embodiment is improved by 2.86% compared with the iterative quantization algorithm and is improved by 1.83% compared with the angular quantization hash code algorithm. Compared with the angle quantization hash code algorithm, the method has the advantages that the obvious improvement is caused by the conversion of the Gaussian kernel to the similarity value domain, and the effectiveness of the method in data retrieval is verified. Moreover, as the length of the hash code increases, the retrieval performance of the hash code is also rising. Under 512 bits, the performance of the quantized binary hash code of the method of the present embodiment is only reduced by 1.35% compared with the same-dimension floating-point type feature. Therefore, the method of the embodiment has the effectiveness of data retrieval in practical application scenarios.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a data retrieval device for implementing the above-mentioned data retrieval method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so specific limitations in one or more embodiments of the data retrieval device provided below can be referred to the limitations of the data retrieval method in the foregoing, and are not described herein again.
In one embodiment, as shown in fig. 10, there is provided a data retrieval apparatus including: a data acquisition module 10, a similarity calculation module 20, a score calculation module 30, a data encoding module 40, and a result determination module 50, wherein:
the data acquisition module 10 is configured to acquire a target feature matrix of target data, and acquire a similarity parameter matrix between anchor points and a parameter relationship matrix between anchor points and sample data, where the similarity parameter matrix is obtained by performing hash coding training on a sample feature matrix based on the sample data; the anchor point is a clustering center of the sample data.
And a similarity calculation module 20, configured to calculate a target similarity matrix matched to the target data according to the similarity parameter matrix and a kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix.
And the score calculation module 30 is configured to perform score calculation on the target similarity matrix based on the parameter relationship matrix to obtain a target score matrix of the target data.
And the data encoding module 40 is configured to generate a target hash code matched with the target data according to the target score matrix.
And a result determining module 50, configured to perform data retrieval through the target hash code, and determine a retrieval result of the target data.
In one embodiment, the data acquisition module 10 includes:
and the matrix acquisition unit is used for carrying out feature extraction processing on the target data to obtain an initial feature matrix of the target data.
And the matrix processing unit is used for carrying out regularization processing on the initial characteristic matrix to obtain a target characteristic matrix of the target data.
In one embodiment, the similarity calculation module 20 includes:
and the similarity matrix calculation unit is used for calculating the Gaussian kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix through a Gaussian kernel function to obtain the Gaussian kernel similarity between the anchor point and the target data.
And the matrix product calculation unit is used for performing matrix product operation on the Gaussian kernel similarity matrix and the similarity parameter matrix to obtain a target similarity matrix matched with the target data.
In one embodiment, the score calculating module 30 includes:
and the score calculation unit is used for calculating the score of the target similarity matrix based on the parameter relation matrix and determining an initial score matrix of the target data.
And the element sorting unit is used for respectively sorting the elements of each score vector in the initial score matrix to obtain a target score matrix of the target data.
In one embodiment, the data encoding module 40 includes:
and the numerical value comparison unit is used for comparing the numerical value of each score vector with the numerical value of the score vector positioned before the score vector according to the position of each score vector in the target score matrix to obtain the comparison result of each score vector.
And the code assignment unit is used for carrying out hash code assignment based on the comparison result of each score vector to generate the hash code corresponding to each score vector.
And the code combination unit is used for combining the Hash codes of each score vector according to the score vector sequence to obtain the target Hash codes matched with the target data.
In one embodiment, the result determination module 50 includes:
and the similarity calculation unit is used for calculating the similarity between the target hash code and the candidate hash codes corresponding to the candidate data of the target data according to the target hash code to obtain a similarity result.
And the similarity screening unit is used for screening out similar hash codes meeting a similarity condition from each candidate hash code based on the similarity result.
And the retrieval result determining unit is used for taking the candidate data corresponding to the similar hash codes as the retrieval result of the target data.
In one embodiment, the apparatus further comprises a data determination module.
In one embodiment, the data determination module includes:
and the Gaussian kernel similarity matrix calculation unit is used for calculating the Gaussian kernel similarity between the anchor points through a Gaussian kernel function to obtain the Gaussian kernel similarity matrix between the anchor points.
And the similarity parameter matrix determining unit is used for performing matrix decomposition processing on the inverse matrix corresponding to the Gaussian kernel similarity matrix between the anchor points to obtain the similarity parameter matrix between the anchor points.
In one embodiment, the apparatus further comprises a data update module.
In one embodiment, the data update module includes:
and the sample similarity matrix calculation unit is used for calculating the sample similarity matrix matched with the sample data according to the similarity parameter matrix and the nuclear similarity between the anchor point feature matrix of the anchor point and the sample feature matrix.
And the initial parameter relation matrix determining unit is used for performing matrix decomposition processing according to the sample similarity matrix to obtain an initial parameter relation matrix between the anchor point and the sample data.
And the hash code loss function calculation unit is used for performing loss function calculation on the hash codes matched with the sample characteristics based on the sample similarity matrix and the initial parameter relation matrix to obtain the hash code loss function of the sample data.
And the data updating unit is used for updating the data of the initial parameter relation matrix according to the Hash coding loss function to obtain a parameter relation matrix between the anchor point and the sample data.
In one embodiment, the initial parameter relationship matrix determining unit further includes:
and the product calculation unit is used for calculating the product of the transpose of the sample similarity matrix and the sample similarity matrix to obtain a matrix product result.
And the matrix determining unit is used for performing matrix decomposition processing on the matrix multiplication result to obtain an initial parameter relation matrix between the anchor point and the sample data.
In one embodiment, the hash coding loss function calculation unit further includes:
and the first data updating unit is used for updating the hash coding matrix matched with the sample data by taking the numerical minimization of the hash coding loss function as a target and the initial parameter relation matrix as a fixed value to obtain an updated coding matrix.
And the second data updating unit is used for updating the initial parameter relation matrix by taking the updated coding matrix as a fixed value, and obtaining the parameter relation matrix between the anchor point and the sample data when the data updating reaches an updating ending condition.
The modules in the data retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing various data related to the data retrieval method. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a data retrieval method.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data retrieval method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the configurations shown in fig. 11 and 12 are merely block diagrams of portions of configurations related to aspects of the present application, and do not constitute limitations on the computing devices to which aspects of the present application may be applied, as particular computing devices may include more or less components than shown, or combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory in which a computer program is stored and a processor, which when executing the computer program performs the steps of the data retrieval method described above.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the data retrieval method described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of the data retrieval method described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims (14)

1. A method for data retrieval, the method comprising:
acquiring a target characteristic matrix of target data, and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix are obtained by performing Hash coding training on a sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
performing score calculation on the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data;
generating a target hash code matched with the target data according to the target score matrix;
and performing data retrieval through the target hash code, and determining a retrieval result of the target data.
2. The method of claim 1, wherein obtaining the target feature matrix of the target data comprises:
performing feature extraction processing on target data to obtain an initial feature matrix of the target data;
and carrying out regularization processing on the initial characteristic matrix to obtain a target characteristic matrix of the target data.
3. The method of claim 1, wherein the calculating a target similarity matrix to which the target data is matched according to the similarity parameter matrix and a kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix comprises:
calculating the Gaussian kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix through a Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor point and the target data;
and performing matrix product operation on the Gaussian kernel similarity matrix and the similarity parameter matrix to obtain a target similarity matrix matched with the target data.
4. The method according to claim 1, wherein the performing score calculation on the target similarity matrix based on the parametric relationship matrix to obtain a target score matrix of the target data comprises:
calculating the score of the target similarity matrix based on the parameter relation matrix, and determining an initial score matrix of the target data;
and sequencing elements of each score vector in the initial score matrix respectively to obtain a target score matrix of the target data.
5. The method of claim 4, wherein generating the target hash code matched to the target data according to the target score matrix comprises:
according to the position of each score vector in the target score matrix, performing numerical comparison on the numerical value of each score vector and the numerical value of the score vector positioned in front of the score vector to obtain a comparison result of each score vector;
performing hash code assignment based on the comparison result of each score vector to generate a hash code corresponding to each score vector;
and combining the Hash codes of each score vector according to the score vector sequence to obtain the target Hash code matched with the target data.
6. The method according to any one of claims 1 to 5, wherein the data retrieval by the target hash code, and determining the retrieval result of the target data, comprises:
according to the target hash code, calculating the similarity between the target hash code and the candidate hash codes corresponding to the candidate data of the target data to obtain a similarity result;
based on the similarity result, screening out similar hash codes meeting a similarity condition from the candidate hash codes;
and taking the candidate data corresponding to the similar hash codes as the retrieval result of the target data.
7. The method of claim 1, further comprising:
calculating the Gaussian kernel similarity between the anchor points through a Gaussian kernel function to obtain a Gaussian kernel similarity matrix between the anchor points;
and carrying out matrix decomposition processing on an inverse matrix corresponding to the Gaussian kernel similarity matrix between the anchor points to obtain a similarity parameter matrix between the anchor points.
8. The method of claim 1, further comprising:
calculating a sample similarity matrix matched with the sample data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the sample feature matrix;
performing matrix decomposition processing according to the sample similarity matrix to obtain an initial parameter relation matrix between the anchor point and the sample data;
based on the sample similarity matrix and the initial parameter relation matrix, performing loss function calculation on the Hash codes matched with the sample characteristics to obtain a Hash code loss function of the sample data;
and updating data of the initial parameter relation matrix according to the Hash coding loss function to obtain a parameter relation matrix between the anchor point and the sample data.
9. The method of claim 8, wherein performing a matrix decomposition process according to the sample similarity matrix to obtain an initial parameter relationship matrix between the anchor point and the sample data comprises:
calculating the product of the transpose of the sample similarity matrix and the sample similarity matrix to obtain a matrix product result;
and performing matrix decomposition on the matrix product result to obtain an initial parameter relation matrix between the anchor point and the sample data.
10. The method of claim 8, wherein the updating the initial parameter relationship matrix according to the hash coding loss function to obtain the parameter relationship matrix between the anchor point and the sample data comprises:
taking the numerical minimization of the Hash coding loss function as a target, taking the initial parameter relation matrix as a fixed value, and updating the Hash coding matrix matched with the sample data to obtain an updated coding matrix;
and updating the initial parameter relation matrix by taking the updated coding matrix as a fixed value, and obtaining the parameter relation matrix between the anchor point and the sample data when the data update reaches the update finishing condition.
11. A data retrieval device, the device comprising:
the data acquisition module is used for acquiring a target characteristic matrix of target data and acquiring a similarity parameter matrix between anchor points and a parameter relation matrix between the anchor points and the sample data, wherein the similarity parameter matrix between the anchor points and the parameter relation matrix between the anchor points and the sample data are obtained by carrying out Hash coding training on the sample characteristic matrix based on the sample data; the anchor point is a clustering center of the sample data;
the similarity calculation module is used for calculating a target similarity matrix matched with the target data according to the similarity parameter matrix and the kernel similarity between the anchor point feature matrix of the anchor point and the target feature matrix;
the score calculation module is used for calculating scores of the target similarity matrix based on the parameter relation matrix to obtain a target score matrix of the target data;
the data coding module is used for generating a target hash code matched with the target data according to the target score matrix;
and the result determining module is used for performing data retrieval through the target hash code and determining the retrieval result of the target data.
12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10.
14. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 10 when executed by a processor.
CN202210679144.9A 2022-06-16 2022-06-16 Data retrieval method, data retrieval device, computer equipment and storage medium Pending CN115129713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210679144.9A CN115129713A (en) 2022-06-16 2022-06-16 Data retrieval method, data retrieval device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210679144.9A CN115129713A (en) 2022-06-16 2022-06-16 Data retrieval method, data retrieval device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115129713A true CN115129713A (en) 2022-09-30

Family

ID=83378792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210679144.9A Pending CN115129713A (en) 2022-06-16 2022-06-16 Data retrieval method, data retrieval device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115129713A (en)

Similar Documents

Publication Publication Date Title
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
Van Der Maaten Barnes-hut-sne
Zhou et al. Deep forest hashing for image retrieval
US10846588B2 (en) Scalable and compressive neural network data storage system
Gu et al. Clustering-driven unsupervised deep hashing for image retrieval
US8428397B1 (en) Systems and methods for large scale, high-dimensional searches
Hu et al. Semi-supervised metric learning-based anchor graph hashing for large-scale image retrieval
WO2022105117A1 (en) Method and device for image quality assessment, computer device, and storage medium
Wei et al. Projected residual vector quantization for ANN search
US11714921B2 (en) Image processing method with ash code on local feature vectors, image processing device and storage medium
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
Pan et al. Product quantization with dual codebooks for approximate nearest neighbor search
Deng et al. Adaptive multi-bit quantization for hashing
Hu et al. Cosine metric supervised deep hashing with balanced similarity
Zhang et al. Deep unsupervised self-evolutionary hashing for image retrieval
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN115795065A (en) Multimedia data cross-modal retrieval method and system based on weighted hash code
US20230222762A1 (en) Adversarially robust visual fingerprinting and image provenance models
Guo et al. Improved image clustering with deep semantic embedding
Weng et al. A fast online spherical hashing method based on data sampling for large scale image retrieval
Hemati et al. A non-alternating graph hashing algorithm for large-scale image search
CN107480273B (en) Picture hash code generation method and device and picture retrieval method and device
Guo et al. Parametric and nonparametric residual vector quantization optimizations for ANN search
CN116186708A (en) Class identification model generation method, device, computer equipment and storage medium
Tian et al. Learning decorrelated hashing codes with label relaxation for multimodal retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination