CN114791966A - Index construction method and device, vector search method and retrieval system - Google Patents

Index construction method and device, vector search method and retrieval system Download PDF

Info

Publication number
CN114791966A
CN114791966A CN202210406518.XA CN202210406518A CN114791966A CN 114791966 A CN114791966 A CN 114791966A CN 202210406518 A CN202210406518 A CN 202210406518A CN 114791966 A CN114791966 A CN 114791966A
Authority
CN
China
Prior art keywords
cluster set
representative point
distance
vector
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210406518.XA
Other languages
Chinese (zh)
Inventor
谢超
许维芷
程倩雅
易小萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xuyu Intelligent Technology Co ltd
Original Assignee
Shanghai Xuyu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xuyu Intelligent Technology Co ltd filed Critical Shanghai Xuyu Intelligent Technology Co ltd
Priority to CN202210406518.XA priority Critical patent/CN114791966A/en
Publication of CN114791966A publication Critical patent/CN114791966A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Abstract

The application relates to the technical field of data retrieval, and discloses an index construction method and device, a vector search method and a retrieval system. The index construction method comprises the following steps: determining a target vector and a first cluster set in which the target vector is located, wherein the first cluster set has a corresponding first representative point; determining at least one second representative point meeting a preset condition for the first cluster set; index associations between the target vector and the first and second representative points are established, respectively. Based on the scheme, in the subsequent vector retrieval process, the distances between the first representative point and the query vector and the distances between the second representative point and the query vector are respectively calculated to query the corresponding target cluster set, so that the target vector corresponding to the query vector can be more accurately obtained, the problem of low search accuracy caused by the fact that the central point of the cluster set cannot completely represent all the feature vectors in the cluster set can be effectively avoided, and the search accuracy is effectively improved.

Description

Index construction method and device, vector search method and retrieval system
Technical Field
The present application relates to the field of data retrieval technologies, and in particular, to an index construction method, an index construction device, a vector search method, and a search system.
Background
With the rapid growth of data, data retrieval is widely applied in the fields of image, video, voice, protein molecular structure retrieval, and the like, and since various data, such as picture data, and the like, can be abstracted into feature vectors with high dimension, the similarity between data can be quantized into the distance between feature vectors in a vector space. For example, the closer the distance between two feature vectors is, the higher the similarity of the original data corresponding to the two feature vectors is. Therefore, data retrieval can be converted into vector search in a vector space, namely, a process of searching a plurality of data similar to the data to be queried in the database is converted into a process of searching a plurality of feature vectors closest to the query vector corresponding to the data to be queried in the database.
Currently, some retrieval systems build an inverted index for the database to facilitate user retrieval. The construction method of the inverted index comprises the steps of firstly, dividing the whole vector space into a plurality of cluster sets through clustering processing of characteristic vectors corresponding to data in a database, for example, through k-means clustering processing, wherein each cluster set is provided with a corresponding representative point, and classifying each characteristic vector into a cluster set corresponding to the representative point closest to the characteristic vector. Thus, when the query vector is retrieved, the system determines the representative point closest to the query vector according to the distance between the query vector and the plurality of representative points, and searches all the feature vectors in the cluster set where the representative point is located. And then taking a plurality of feature vectors in the cluster set which are closer to the query vector as search results.
However, since the retrieval system determines the representative point closest to the query vector, and then only uses a plurality of feature vectors that are closer to the query vector in the cluster set where the representative point is located as the search result, there may be a case where a plurality of feature vectors that are closer to the query vector exist in the cluster set where other representative points are located, so that a target vector of a more accurate query vector is not obtained, resulting in lower accuracy of the retrieval result.
Disclosure of Invention
In order to solve the problem of low accuracy of a retrieval result of a vector retrieval method, the embodiment of the application provides an index construction method, an index construction device, a vector search method and a retrieval system.
In a first aspect, an embodiment of the present application provides an index building method, including:
determining a target vector and a first cluster set in which the target vector is located, wherein the first cluster set has a corresponding first representative point;
determining at least one second representative point meeting a preset condition for the first cluster set;
establishing index associations between the target vectors and the first and second representative points, respectively.
It can be understood that, in addition to determining the first representative point, the index construction method provided in the embodiment of the present application can determine at least one second representative point, that is, an edge point, which is relatively far away from the first representative point of each cluster set. Therefore, in the process of retrieval, the distances between the query vector and the first representative point and the distance between the query vector and the second representative point can be compared, namely, the distance between the edge point far away from the central point and the query vector is considered, and the problem that in the prior art, because the target vector is the edge point, and the distance between the center of the prior art and the central point is only considered, a more accurate target vector cannot be found is solved. In the query process, the distances between the first representative point and the query vector and the distances between the second representative point and the query vector are respectively calculated to query the corresponding target cluster set, so that the target vector corresponding to the query vector can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster set cannot completely represent all the feature vectors in the cluster set can be effectively solved, and the search accuracy is effectively improved.
It is to be understood that the target vector in the first cluster set mentioned in the embodiment of the present application may be any vector in the first cluster set.
It is to be understood that the edge point may be an edge feature vector in the first cluster set, that is, a feature vector farther from the first representative point of the first cluster set.
In a possible implementation of the first aspect, determining at least one second representative point meeting a preset condition for the first cluster set includes:
determining a first distance between each feature vector in the first cluster set and the first representative point;
and according to the first distance, determining a first sequence from the feature vectors, and taking a preset number of feature vectors in the first sequence as the second representative points.
It can be understood that in the embodiment of the present application, the second representative point may be determined according to the distance between each feature vector in the first cluster set and the first representative point, so that a feature vector farther from the first representative point can be used as the second representative point, and the selection of the second representative point is more accurate.
In a possible implementation of the first aspect, determining a first sequence from the feature vectors according to the distance, and using a preset number of feature vectors in the first sequence as the second representative point includes:
sequencing the feature vectors from big to small according to the first distance to obtain the first sequence;
and taking the feature vectors of the previous preset number in the first sequence as the second representative point.
It can be understood that, in the embodiment of the present application, the feature vectors may be sorted from far to near according to the distance between each feature vector in the first cluster set and the first representative point and according to the distance between each feature vector and the first representative point, so that it is more convenient to select the corresponding second representative point according to the set number. For example, if the set number of second representative points is five, the first five feature vectors in the rank, i.e., the first sequence, may be directly used as the second representative points.
In a possible implementation of the first aspect, determining a first sequence from the feature vectors according to the first distance, and taking a preset number of feature vectors in the first sequence as the second representative point includes:
taking the feature vector of which the first distance is greater than a set distance as a first sequence;
taking a preset number of feature vectors in the first sequence as the second representative point;
or, sorting the feature vectors in the first sequence from far to near according to the first distance, and determining a preset number of feature vectors before sorting as the second representative point.
It can be understood that, in the embodiment of the present application, the first sequence may be obtained according to the feature vector in which the distance between each feature vector in the first cluster set and the first representative point is greater than the set distance, that is, the feature vector in the first sequence is a vector which is farther from the first representative point and is greater than the set distance, and a manner of selecting the second cluster set may be more standard and simpler. For example, the second representative point may be obtained only by setting a corresponding distance threshold parameter in the algorithm for selecting the second representative point.
In a possible implementation of the first aspect, determining at least one second representative point meeting a preset condition for the first cluster set includes:
determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set;
acquiring first distances between the feature vectors in the first clustering set and first representative points of the first clustering set respectively; and a second distance of each feature vector from the first representative point of each second cluster set;
and for each feature vector, acquiring the first sequence corresponding to each second clustering set according to the first distance and the second distance, and taking a preset number of feature vectors in the first sequence as the second representative points.
It can be understood that, if the distance between each feature vector in each first cluster set and the first representative point of each first cluster set is larger, the feature vector is farther from the center of the first cluster set and is more likely to become an edge point of the center of the first cluster; the second cluster set is a cluster set which is closer to the first cluster set, a second distance from at least one adjacent second cluster set is obtained, and when the second distance is smaller, the feature vector is closer to the second cluster set, namely, the feature vector is closer to the edge of the first cluster set, so that the second representative point can be more accurately and reasonably determined according to the first distance and the second distance.
In a possible implementation of the first aspect, for each feature vector, obtaining the first sequence corresponding to each second cluster set according to the first distance and the second distance, and using a preset number of feature vectors in the first sequence as second representative points includes:
and sequencing the feature vectors according to the sum of the first distance and the second distance from large to small to obtain a first sequence, and taking a preset number of feature vectors in the first sequence as second representative points.
It can be understood that, the first sequence is obtained by sorting the sum of the first distance and the second distance from large to small, and the feature vector with a large distance sum in the obtained first sequence is the feature vector farther from the first representative point of the first cluster set and farther from the first representative point of the second cluster set, that is, the feature vector may be located at an edge point of the first representative point in the opposite direction of the first representative point pointing to the second representative point. When the first sequence corresponding to the second cluster set in each direction of the first cluster set is obtained, the edge points in each direction of the first cluster set can be obtained. Therefore, the edge points in all directions of the first cluster set are used as the second representative points, and the searching accuracy can be effectively improved.
In a possible implementation of the first aspect, at least one second representative point meeting a preset condition is determined for the first cluster set; the method comprises the following steps:
determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set;
acquiring a second distance between each feature vector in the first clustering set and the first representative point of each second clustering set;
and for each feature vector, acquiring a first sequence corresponding to each second clustering set according to the second distance, and taking the feature vectors with preset number in the first sequence as second representative points.
It is understood that the smaller the second distance, the closer the feature vector is to the second cluster set, i.e. to the edge of the first cluster set, so that the second representative point, i.e. the edge point of the first cluster set, can be determined more accurately according to the second distance.
In a possible implementation of the foregoing first aspect, the obtaining, for each of the feature vectors, the first sequence corresponding to each of the second cluster sets according to the second distance, and taking a preset number of feature vectors in the first sequence as the second representative points includes:
and sequencing the feature vectors from small to large according to the second distance to obtain a first sequence corresponding to each second clustering set, and taking the feature vectors with the preset number in the first sequence as second representative points.
In one possible implementation of the first aspect, the determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set; the method comprises the following steps:
obtaining the distance between the representative point corresponding to each other cluster set and the first representative point corresponding to the first cluster set;
and according to the distance, determining a second sequence from other cluster sets except the first cluster set, and determining a preset number of cluster sets in the second sequence as a second cluster set.
In a possible implementation of the foregoing first aspect, determining a first cluster set in which the target vector is located includes:
obtaining the distance between the target vector and the first representative point corresponding to all the cluster sets;
and combining the cluster set with the closest distance between the corresponding first representative point and the target vector in all the cluster sets as the first cluster set in which the target vector is positioned.
It can be understood that, in the embodiment of the present application, according to the distances between the target vector and the first representative points corresponding to all cluster sets, the cluster set having the closest distance between the target vector and the representative point corresponding to the cluster set is used as the first cluster set where the target vector is located, which is convenient for determining the cluster set where the target vector is located according to the distance between the representative point and the query vector, so as to find the target vector more efficiently.
In a possible implementation of the first aspect, the distance includes at least one of: euclidean distance, inner product distance, and hamming distance.
It is understood that the distance including the euclidean distance and the inner product distance in the embodiment of the present application is only an illustration of the distance, and may be represented by any other practicable distance.
In a second aspect, an index building apparatus is provided in an embodiment of the present application, including:
a first determining unit, configured to determine a target vector and a first cluster set in which the target vector is located, where the first cluster set has a corresponding first representative point;
a second determining unit, configured to determine, for the first cluster set, at least one second representative point that meets a preset condition.
And the association unit is used for establishing index association between the target vector and the first representative point and the second representative point respectively.
In a third aspect, an embodiment of the present application provides a vector search method, including:
acquiring a query vector;
acquiring a first distance between the query vector and a first representative point and a second representative point corresponding to each cluster set in the database;
determining at least one representative point corresponding to the query vector according to the first distance, and using a cluster set corresponding to the at least one representative point as a target cluster set;
acquiring a second distance between each feature vector in the target cluster set and the query vector;
and determining a target vector corresponding to the query vector according to the second distance.
According to the vector searching method provided by the embodiment of the application, in the process of retrieval, the distance between the query vector and the first representative point and the distance between the query vector and the second representative point can be compared, namely, the distance between the edge point far away from the central point and the query vector is considered, so that the problem that in the prior art, as the target vector is the edge point and the distance between the center point and the central point is only considered, a more accurate target vector is not found in the prior art is solved. In the query process, the edge points are taken as the second representative points, the distances between the edge points and the query vectors are respectively calculated by combining the first representative points, so as to query the corresponding target cluster set, the target vectors corresponding to the query vectors can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster set cannot completely represent all the feature vectors in the cluster set can be effectively avoided, and the search accuracy is further effectively improved.
In a possible implementation of the third aspect, the method further includes: determining a target vector corresponding to the query vector according to the distance; the method comprises the following steps:
and taking the characteristic vector of which the distance is within a set range as a target vector corresponding to the query vector.
It can be understood that, by taking the feature vector with the distance between the query vector and the feature vector in the target cluster set within the set range as the target vector corresponding to the query vector, the similarity between the target vector and the query vector can be ensured.
In a fourth aspect, an embodiment of the present application provides a search apparatus, applied to a database: the method comprises the following steps:
a first obtaining unit, configured to obtain a query vector;
a second obtaining unit, configured to obtain a distance between the query vector and a first representative point and a second representative point that correspond to each cluster set in the database;
a first determining unit, configured to determine a representative point closest to the query vector, and cooperate a cluster set corresponding to the representative point closest to the query vector as a target cluster set;
a third obtaining unit, configured to obtain a distance between each feature vector in the target cluster set and the query vector;
and the second determining unit is used for determining a target vector corresponding to the query vector according to the distance.
In a fifth aspect, an embodiment of the present application provides a retrieval system, which includes the above-mentioned index building apparatus, and/or the above-mentioned search apparatus.
In a sixth aspect, an embodiment of the present application provides an index structure, including a representative point item and an inverted file item; the representative point item comprises a first representative point and at least one second representative point corresponding to each cluster set in a database, and the inverted file item comprises inverted files corresponding to each cluster set;
the inverted file comprises each feature vector in the cluster set corresponding to the inverted file.
In a seventh aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions for execution by one or more processors of an electronic device, and a processor, which is one of the one or more processors of the electronic device, for performing the index building method of any of the above first aspects or the vector searching method of any of the above third aspects.
In an eighth aspect, an embodiment of the present application provides a readable medium, where the readable medium has instructions stored thereon, and the instructions, when executed on an electronic device, cause a machine to perform the index building method according to any one of the above first aspects or the vector search method according to any one of the above third aspects.
In a ninth aspect, an embodiment of the present application provides a computer program product, which includes instructions for implementing the index building method described in any one of the above first aspects or the vector search method described in any one of the above third aspects.
Based on the scheme, the method has the following beneficial effects:
the index construction method provided by the application provides multiple modes for accurately selecting the second representative point, and can realize that the edge points in all directions of the first cluster set are selected as the second representative point, and the index association is established between each feature vector and the corresponding first representative point and second representative point respectively. In the process of retrieval, the distances between the query vector and the first representative point and the distances between the query vector and the second representative point can be compared, and therefore the target cluster set is determined according to the comparison result. Therefore, the target vectors corresponding to the query vectors can be obtained more accurately, that is, only the cluster set to which the first representative point closer to the query vectors belongs can be prevented from being searched, target vectors existing in other cluster sets to which the first representative points farther from the query vectors belong can be omitted, and the searching accuracy is effectively improved.
Drawings
FIG. 1a shows a schematic diagram of a picture database S, according to some embodiments of the present application;
FIG. 1b illustrates a schematic diagram of an inverted index construction for a picture database S, according to some embodiments of the present application;
FIG. 2 illustrates a schematic diagram of determining a second set of clusters of cluster set S1, according to some embodiments of the present application;
FIG. 3a illustrates a schematic diagram of a feature vector X1 obtaining a distance from a first representative point of a second cluster set, according to some embodiments of the present application;
FIG. 3b illustrates a schematic diagram of a feature vector X1 obtaining distances from a first representative point of a second set of clusters, according to some embodiments of the present application;
FIG. 4 illustrates a schematic diagram of determining a second set of clusters of cluster set S1, according to some embodiments of the present application;
FIG. 5a illustrates a schematic representation of representative points of a database S, according to some embodiments of the present application;
FIG. 5b illustrates a diagram of an index structure corresponding to a database S, according to some embodiments of the present application;
FIG. 6a illustrates a diagram of representative points in a query database S, according to some embodiments of the present application;
FIG. 6b illustrates a schematic diagram of obtaining a target feature vector in a database S, according to some embodiments of the present application; (ii) a
FIG. 7 illustrates a flow diagram of a method of index construction, according to some embodiments of the present application;
FIG. 8 illustrates a flow diagram of a search method, according to some embodiments of the present application;
FIG. 9 illustrates a schematic diagram of an index building apparatus, according to some embodiments of the present application;
FIG. 10 illustrates a schematic diagram of a search apparatus, according to some embodiments of the present application;
FIG. 11 illustrates a schematic diagram of a retrieval system, according to some embodiments of the present application;
fig. 12 illustrates a block diagram of an electronic device, according to some embodiments of the present application.
Detailed Description
The illustrative embodiments of the present application include, but are not limited to, an index building method, apparatus, data system, and search method.
As mentioned above, the accuracy of the search result obtained by constructing the inverted index in the database is low.
For example, fig. 1a is a schematic diagram of a database S of pictures in a search system, where the database S includes feature vectors corresponding to the pictures. Fig. 1b is a schematic diagram of reverse index construction for the picture database S. As shown in fig. 1b, first, feature vectors in the database S are assigned to four cluster sets, i.e., cluster set S1, cluster set S2, cluster set S3, and cluster set S4, respectively, through a clustering process, where cluster set S1 has a corresponding representative point C1, cluster set S2 has a corresponding representative point C2, cluster set S3 has a corresponding representative point C3, and cluster set S4 has a corresponding representative point C4.
It is to be understood that the representative point mentioned in the embodiment of the present application may be a central point of the cluster set, that is, an arithmetic average of each point (each feature vector) in the cluster set. In practice, the central point may also be a point whose difference between the distances from each point in the cluster set is within a preset range; or may be a point made based on other rules.
When performing vector search on the query vector a corresponding to the query picture based on the database S, the search system determines the distance between each representative point and the query vector a, for example, as shown in fig. 1b, the search system determines that the distance between the representative point C1 and the query vector a is d1, the distance between the representative point C2 and the query vector a is d2, the distance between the representative point C3 and the query vector a is d3, and the distance between the representative point C4 and the query vector a is d 4; assuming that d4 > d3 > d1 > d2, the system determines the representative point closest to the query vector a among the four representative points as the representative point C2, and then searches each feature vector in the cluster set S2 where the representative point C2 is located, and it can be understood that the search of each feature vector is to obtain the distance between each feature vector and the query vector a. And the feature vector Y1 and the feature vector Y2 which are closer to the query vector A or within a set distance range in the cluster set S2 are used as target vectors of the query vector A. And then outputting the original picture data corresponding to the target vector to the client.
However, as shown in fig. 1b, in fact, the feature vector X1 in the cluster set S1 is the feature vector closest to the query vector a in the database S, but since the representative point C1 of the cluster set S1 is not closest to the query vector, the system does not search for the feature vector in the cluster set S1, so that an accurate target vector of the query vector a is not obtained, which results in that the most exact search result is missed, and the accuracy of the search is affected.
In order to solve the above problem, an embodiment of the present application provides an index constructing method, which specifically includes: acquiring a characteristic vector corresponding to each data in a database of an index to be created; clustering each feature vector to obtain a plurality of cluster sets, wherein each cluster set is provided with a corresponding first representative point; respectively distributing each feature vector to a first clustering set corresponding to a first representative point closest to each feature vector; determining at least one second representative point meeting a preset condition for the first cluster set; and for each target vector in the cluster set, respectively establishing index association between the target vector and a first representative point and a second representative point in the cluster set.
It is to be understood that the first representative point may be a center point of the corresponding cluster set, and the second representative point may be a point farther from the center point in the cluster set, or an edge point in the cluster set.
It is understood that the clustering method may be K-means clustering. The K-means clustering mode can be that firstly, any number of cluster sets and first representative points corresponding to the cluster sets are preset, and then the distance between each feature vector in the database and the first representative point corresponding to each cluster set is obtained; and respectively distributing the feature vectors to the cluster set corresponding to the first representative point with the closest distance.
There may be various ways for determining the at least one second representative point meeting the preset condition for any cluster set, for example, the first cluster set, and the following examples describe some of them:
in a first practical solution, the manner of determining at least one second representative point meeting the preset condition may be: acquiring first distances between a first representative point of a first cluster set and all feature vectors in the first cluster set; sorting the feature vectors from far to near according to a first distance to obtain a first sequence corresponding to the first cluster set; and determining the characteristic vectors of the first preset number in the first sequence as second representative points in the cluster set.
It can be understood that, in the embodiment of the present application, the second representative point may be determined according to the distance between each feature vector in the first cluster set and the first representative point, so that a feature vector farther from the first representative point can be used as the second representative point, and the selection of the second representative point is more accurate.
In addition, in the embodiment of the application, the feature vectors are sorted from far to near according to the distance between each feature vector in the first clustering set and the first representative point, that is, according to the distance between each feature vector and the first representative point, so that the selection of the corresponding second representative points according to the set number is more convenient. For example, if the set number of second representative points is five, the first five feature vectors in the rank, i.e., the first sequence, may be directly used as the second representative points.
In a second practical solution, the manner of determining at least one second representative point meeting the preset condition may be: acquiring first distances between a first representative point of a first cluster set and all feature vectors in the first cluster set; and taking the feature vectors with the first distance larger than the set distance as feature vectors in a first sequence, then taking a preset number of feature vectors in the first sequence as second representative points, or sequencing the feature vectors in the first sequence from far to near according to the first distance, and taking the feature vectors with the front preset number in the sequencing as second representative points.
It can be understood that, in the embodiment of the present application, the first sequence may be obtained according to the feature vector in which the distance between each feature vector in the first cluster set and the first representative point is greater than the set distance, that is, the feature vector in the first sequence is a vector which is farther from the first representative point and is greater than the set distance, and the manner of selecting the second cluster set may be more normative and simpler. For example, the second representative point may be obtained only by setting a corresponding distance threshold parameter in the algorithm for selecting the second representative point.
In a third possible implementation, the determining at least one second representative point meeting the preset condition may be: determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set; acquiring the distance between each feature vector in the first clustering set and the first representative point of each second clustering set; sorting the feature vectors from near to far according to the distance to obtain a first sequence corresponding to each second clustering set; the second representative point is the previously set number of feature vectors in the first sequence.
It will be appreciated that a smaller second distance indicates that the feature vector is closer to the second set of clusters, i.e. closer to the edge of the first set of clusters. Therefore, the manner of obtaining the second representative points by the distances between the feature vectors in the first cluster combination and the adjacent second cluster sets is convenient to obtain the edge points of the first cluster set in each direction, the edge points of the first cluster set in each direction are taken as the second representative points, so that during retrieval, the representative points closest to the query vector can be determined according to the distances between the second representative points and the query vector and the first representative points are searched for the cluster sets corresponding to the representative points, and for the query vectors which are far from the first representative points of each cluster set and are close to the second representative points, namely the edge points, the probability that the target vector is found can be increased in the query process, thereby effectively improving the search accuracy.
The manner of obtaining at least one second cluster set adjacent to the first cluster set may be: sorting all the cluster sets except the first cluster set in the plurality of cluster sets from near to far according to the distance between the first representative point of each cluster set and the first representative point of the first cluster set, and acquiring a second sequence corresponding to the first cluster set; and combining the first preset number of clusters in the second sequence into a second cluster set adjacent to the first cluster set.
The manner of obtaining the second representative point in the first cluster set S1 is described below by taking the first cluster set S1 in fig. 1b as an example:
at least one second cluster set adjacent to the first cluster set S1 may be obtained first. The manner in which the adjacent at least one second cluster set of the first cluster set S1 is obtained may be: distances to the first representative point of the other three cluster sets except the cluster set S1 and the first representative point C1 of the first cluster set S1 are obtained. For example, as shown in fig. 2, the distance between the first representative point C2 of the cluster set S2 and the first representative point C1 of the first cluster set S1 is d5, the distance between the first representative point C3 of the cluster set S3 and the first representative point C1 of the first cluster set S1 is d6, the distance between the first representative point C4 of the cluster set S4 and the first representative point C1 of the first cluster set S1 is d7, and then the three cluster sets are sorted according to the distance between the first representative point of each of the three cluster sets and the first representative point C1 of the first cluster set S1 from near to far, so as to obtain a second sequence corresponding to the first representative point C1; if d7 < d5 < d6, the second sequence corresponding to the first representative point C1 is a cluster set S4-a cluster set S2-a cluster set S3; at this time, the first representative point C1 is indexed with the first preset number of cluster sets in the second sequence. Assuming that the preset number is 2, the cluster set S2 and the cluster set S4 may be regarded as a second cluster set adjacent to the first cluster set S1.
Then, the distance between each feature vector in the first cluster set S1 and the first representative point C2 of the second cluster set S2 and the distance between each feature vector in the first cluster set S1 and the first representative point C4 of the second cluster set S4 are obtained. As shown in fig. 3a, for the feature vector X1, a distance d2_1 between the feature vector X1 in the first cluster set S1 and the first representative point C2 of the second cluster set S2 is obtained, and the distances between other feature vectors and the first representative point C2 are not described herein again; sorting the feature vectors from near to far according to the distance between the feature vectors and the first representative point C2, and assuming that the first sequence obtained by sorting the 16 feature vectors in the first cluster set S1 from far to near is X1-X3-X5-.. times-X16 according to the distance. Assuming that the preset number of second representative points of the corresponding cluster set according to the retrieval of each neighboring set is one, the first eigenvector in the first sequence corresponding to the second cluster set S2, i.e., eigenvector X1, may be used as the second representative point. Meanwhile, the distance between each feature vector in the first cluster set S1 and the first representative point C4 of the second cluster set S4 is obtained, as shown in fig. 3a, for the feature vector X1, the distance d4_1 between the feature vector X1 in the first cluster set S1 and the first representative point C4 of the second cluster set S4 is obtained, and the distances between other feature vectors and the first representative point C4 are not described herein again; sorting the feature vectors from near to far according to the distances between the feature vectors and the first representative point C4, and assuming that the first sequence obtained by sorting the 16 feature vectors in the first cluster set S1 from near to far is X7-X8-X1-.. times-X14 according to the distances. Assuming that the preset number of second representative points of the corresponding cluster set according to the acquisition of each neighboring set is one, the first eigenvector in the first sequence corresponding to the fourth cluster set S4, i.e., eigenvector X7, may be used as the second representative point.
It is to be understood that the above manner of obtaining the neighboring cluster sets of the first cluster set, i.e. the second cluster set, by sorting according to the distance between the representative point of each cluster set and the first representative point of the first cluster set is only exemplary. In this embodiment, the method for obtaining the neighboring cluster set of the first cluster set, that is, the second cluster set, may also be implemented in other manners.
The method for obtaining the second cluster set may further be: forming cluster sets of which the distances between the first representative points of the cluster sets except the first cluster set and the first representative points of the first cluster set are smaller than a set distance into cluster sets in a second sequence; and then, any preset number of cluster sets in the second sequence are used as a second cluster set, or the distances between the first representative points of the cluster sets in the second sequence and the first representative points of the first cluster sets are sorted from near to far, and the cluster sets with the preset number in the sorting are used as the second cluster set.
For example, as shown in fig. 2, the way to determine the second cluster set adjacent to cluster set S1 may be: the distance between the first representative point of the other cluster set than the cluster set S1 and the first representative point of the cluster set S1 is first obtained. For example, the distance between the first representative point C2 of the cluster set S2 and the first representative point C1 of the first cluster set S1 is d5, the distance between the first representative point C3 of the cluster set S3 and the first representative point C1 of the first cluster set S1 is d6, and the distance between the first representative point C4 of the cluster set S4 and the first representative point C1 of the first cluster set S1 is d 7. Then D5, D6 and D7 are compared with the set distance D, and if D5 and D7 are smaller than the set distance D, the cluster set S2 and the cluster set S4 form a cluster set in the second sequence; then sorting the distances d5 and d7 between the cluster set S2, the first representative point in the cluster set S4 and the first representative point in the cluster set S1 in the second sequence from near to far, and if d7 is smaller than d5, sorting the clusters into a cluster set S4-a cluster set S2. If the preset number of the second cluster set is one, the cluster set S4 is regarded as the second cluster set.
For another example, the manner of obtaining the second cluster set may also be: estimating approximate radius distance of each cluster set except a first cluster set where a first representative point of the first cluster set is located, obtaining difference values of the distance between the first representative point of each cluster set and the first representative point of the first cluster set and the approximate radius distance, sorting each cluster set from near to far according to the difference values to obtain a second sequence, and combining a preset number of cluster sets in the second sequence into the second cluster set.
It will be appreciated that in some embodiments, the above-described manner of estimating the approximate radial distance of each cluster set other than the first cluster set in which the target feature vector itself is located may be: and obtaining the distance between all the feature vectors in each cluster set and the representative point in each cluster set, and taking the maximum distance between all the feature vectors and the representative point as the approximate radius distance of each cluster set.
For example, as shown in fig. 3b, the way to determine the second set of clusters adjacent to cluster set S1 may be: firstly, determining cluster sets S2, S3 and S4 except for the cluster set S1, taking the farthest distance between each feature vector in the cluster set S2 and the first representative point C2 of the cluster set S2 as the approximate radius distance of the cluster set S2, and taking r1 as the approximate radius distance of the cluster set S2 on the assumption that the farthest distance is the distance r1 between the feature vector Y3 and the first representative point C2 of the cluster set S2; taking the farthest distance from the first central point C3 in each feature vector in the cluster set S3 as the approximate radius distance of the cluster set S2, and assuming that the farthest distance is the distance r2 between the feature vector Z1 and the first representative point C3 of the cluster set S3, taking r2 as the approximate radius distance of the cluster set S3; the farthest distance from the first center point C4 in each feature vector in the cluster set S4 is taken as the approximate radius distance of the cluster set S4, and assuming that the farthest distance is the distance r3 between the feature vector W5 and the first representative point C4 of the cluster set S4, r3 is taken as the approximate radius distance r3 of the cluster set S4.
Then, the difference between the distance d5 between the first representative point C2 of the cluster set S2 and the first representative point C1 of the cluster set S1 and the radial distance r1 between the cluster set S2 is determined as d5-r1, the difference between the distance d6 between the first representative point C3 of the cluster set S3 and the first representative point C1 of the cluster set S1 and the radial distance r2 between the cluster set S3 is determined as d6-r2, the difference between the distance d1 between the first representative point C4 of the cluster set S4 and the first representative point C1 of the cluster set S1 and the radial distance r1 of the cluster set S1 is determined as d 1-r 1, the difference between the first representative point C1 of the cluster set S4974 and the cluster set S1 is d 1-r 1, the cluster sets S1 and S1 are sorted from near to far to obtain a second sequence, and S1 are ordered to obtain a second sequence, and the second cluster set S1 is assumed that the number of the cluster set S1-1 is < the number of the second representative point S1-r 1 of the cluster set S1-1, and the second representative point S1 is set S1, and the number of the second cluster set S1 is set S1 is set < S1-1, assuming that the first 1 number of clusters in the second sequence are grouped as the second cluster group, the second cluster group is the cluster group S1.
In some embodiments, the approximate radius of each cluster set may be estimated in any other practical manner, such as by a neural network model or a correlation algorithm.
It is to be understood that the difference between the approximate radius and the distance between the representative point of any one cluster set and the first representative point of the first cluster set may approximately represent the distance between the edge feature vector in the cluster set and the first representative point of the first cluster set. When the distance between the edge feature vector and the first representative point of the first cluster set is closer, there may be a case that the cluster set where the edge feature vector is located is a neighboring set of the first cluster set.
In a fourth possible implementation, the manner of determining at least one second representative point meeting the preset condition may be: determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set; acquiring a first distance between each feature vector in a first cluster set and a first representative point of the first cluster set and a second distance between each feature vector in the first cluster set and a first representative point of each second cluster set, and summing the distances according to the first distance and the second distance; sequencing all the feature vectors from far to near according to the sum of the distances to obtain the first sequence corresponding to each second clustering set; and taking the characteristic vectors of the preset number in the first sequence as second representative points. It is to be understood that the manner of obtaining at least one second cluster set adjacent to the first cluster set in the fourth scheme is the same as the manner of obtaining the second cluster set described in the third scheme, and is not described herein again.
It can be understood that, the first sequence is obtained by sorting the sum of the first distance and the second distance from large to small, and the feature vector with a large distance sum in the obtained first sequence is the feature vector farther from the first representative point of the first cluster set and farther from the first representative point of the second cluster set, that is, the feature vector may be located at an edge point of the first representative point in the opposite direction of the first representative point pointing to the second representative point. When the first sequence corresponding to the second cluster set in each direction of the first cluster set is obtained, the edge points in each direction of the first cluster set can be obtained. Therefore, the edge points in all directions of the first cluster set are used as the second representative points, and the searching accuracy can be effectively improved.
The manner of obtaining the second representative point in the fourth embodiment in the present embodiment is described below by taking as an example the obtaining of at least one second representative point of the cluster set S1 in fig. 4:
as shown in FIG. 4, assume that the second set of clusters adjacent to the first set of clusters S1 are the set of clusters S2 and the set of clusters S4 described above in FIG. 3 a. According to the above-mentioned determined adjacent second cluster set S2 of the first cluster set S1, the sum of the distances of each feature vector in the first cluster set S1 from the first representative point C1 of the first cluster set S1 and from the second distance from the first representative point C2 of the second cluster set S2 is obtained.
For example, as shown in fig. 4, the sum of the first distance d1_1 between the eigenvector X1 in the first cluster set S1 and the first representative point C1 and the second distance d2_1 from the first representative point C2 is d _ S2, and the manner of obtaining the sum of the distances of the other 15 eigenvectors is the same, which is not described herein again; assuming that a first sequence corresponding to a cluster set S2 obtained by sorting 16 eigenvectors in the first cluster set S1 from far to near according to the sum of distances is X2-X3-X1-. -X14, assuming that a preset number of second representative points of the corresponding cluster set obtained according to each neighboring set is one, a first eigenvector in the first sequence corresponding to the second cluster set S2, that is, the eigenvector X2, may be used as the second representative point.
Meanwhile, according to the determined adjacent second cluster set S4 of the first cluster set S1, a sum of distances between each feature vector in the first cluster set S1 and the first distance between the feature vector in the first cluster set S1 and the first representative point C1 of the first cluster set S1 and the second distance between the feature vector in the second cluster set S4 and the first representative point C4 is obtained, for example, in fig. 4, a sum d _ S4 of distances between the feature vector X1 in the first cluster set S1 and the first distance d1_1 of the first representative point C1 and the second distance d4_1 of the first representative point C4 is the same as a sum of distances obtained by other 15 feature vectors, which is not described herein again; assuming that a first sequence obtained by sorting 16 eigenvectors in the first cluster set S1 from far to near according to the sum of distances is X4-X1-X3-.,.... times-X15, assuming that the number of preset second representative points of the corresponding cluster set obtained according to each adjacent set is one, the first eigenvector in the first sequence corresponding to the fourth cluster set S4, i.e., the eigenvector X4, can be used as the second representative point.
Based on the index construction method, index association can be established between each feature vector in each cluster set and the first representative point and the second representative point of the cluster set. In this way, during query, the distance between the query vector and the first representative point, i.e. the center point, of the first cluster set may be obtained, and the distance between the query vector and the second representative point, i.e. the edge point, of the first cluster set may also be obtained. Therefore, in the process of retrieval, the distances between the query vector and the first representative point and the distances between the query vector and the second representative point can be compared, namely, the distance between the edge point far away from the central point and the query vector is considered, and the problem that in the prior art, because the target vector is the edge point, and the distance between the center of the prior art and the central point is only considered, a more accurate target vector cannot be found is solved. In the query process, the edge points are taken as second representative points, the distances between the edge points and the query vectors are respectively calculated by combining the first representative points, so as to query the corresponding target cluster set, the target vectors corresponding to the query vectors can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster set cannot completely represent all the feature vectors in the cluster set can be effectively solved, and the search accuracy is effectively improved.
It is understood that, when the second representative point of each cluster set of the whole database is obtained, the preset number of the second representative points of each cluster set can be determined according to the following method:
in a first practical solution, the method for determining the preset number of second representative points of each cluster set is: determining the preset number of the second representative points of each cluster set according to the preset total number of the second representative points in the database, the number or the radius of the characteristic vectors in each cluster set, the preset number of the second representative points of each cluster set according to the preset total number of the second representative points and the ratio of the number or the radius of the characteristic vectors in each cluster set; it is to be understood that, in some embodiments, when the calculated preset number of second representative points of any one cluster set is not an integer, the final preset number may be obtained according to the rounding rule. For example, the rounding rule may be to adopt a value obtained by adding 1 to an integer number of a value corresponding to the current quantity and truncating a fractional part as a value corresponding to the final preset quantity.
For example, the preset total number of the second representative points is 4, and as shown in the cluster set corresponding to the database S in fig. 1b, there are 16 eigenvectors in the cluster set S1, 14 eigenvectors in the cluster set S2, 10 eigenvectors in the cluster set S3, and 17 eigenvectors in the cluster set S4, then the ratio of the number of the second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3, and the cluster set S4 is 16:14:10: 17; the number of second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3 and the cluster set S4 is 1.1, 0.9, 0.7 and 1.7 according to the ratio of the preset total number of the second representative points to the number of the second representative points is 16:14:10: 17. According to the above rounding rule, the preset numbers of the second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3 and the cluster set S4 are 2, 1, 1 and 2, respectively.
It can be understood that the above manner of determining the preset number of the second representative points of each cluster set according to the preset total number of the second representative points and the number of the feature vectors or the ratio of the radii in each cluster set may make the distribution of the second representative points more uniform, thereby avoiding the problem of the search accuracy being decreased due to the fact that some cluster sets are small but a large number of second representative points are determined, which causes resource waste, and other larger cluster sets determine fewer second representative points, which makes it difficult for the representative points determined in the larger cluster set to fully represent the corresponding cluster set.
In a second practical solution, the method for determining the preset number of the second representative points of each cluster set is: and selecting a set number of second representative points from each cluster set according to the preset total number of the second representative points in the database and the sequence from large to small of the number or the radius of the characteristic vectors in each cluster set until the preset total number is selected.
For example, the preset total number of the second representative points is 4, as determined by the number of feature vectors in each cluster set in the cluster set corresponding to the database S shown in fig. 1b, 16 feature vectors in the cluster set S1, 14 feature vectors in the cluster set S2, 10 feature vectors in the cluster set S3, and 17 feature vectors in the cluster set S4, the order obtained by arranging the number of feature vectors in each cluster set from large to small is cluster set S4-cluster set S1-cluster set S2-cluster set S3;
when the set number of the second representative points selected in each cluster set is 2, the preset number of the second representative points to be determined in the cluster set S4 and the cluster set S1 is two according to the sequence of the cluster set S4-the cluster set S1-the cluster set S2-the cluster set S3. Since the preset total number of second representative points has been reached at this time. The subsequent cluster set S2 and cluster set S3 will no longer pick the second representative point, i.e., the number of cluster sets S2 and S3 that are arranged behind the cluster set S1 is 0.
It can be understood that, by selecting the set number of second representative points from each cluster set according to the preset total number of the second representative points and sorting the number or radius of the feature vectors in each cluster set from large to small, the second representative points with a smaller number can be determined in a smaller cluster set, and the second representative points with a larger number can be determined in other larger cluster sets, so that the determined representative points can fully represent the corresponding cluster sets, and the search accuracy is effectively improved.
It is understood that after the second representative point of each cluster set is obtained, for each feature vector in the cluster set, an index association between the feature vector and the first representative point and the second representative point in the cluster set is respectively established.
For example, fig. 5a shows the corresponding first and second representative points for the database S shown in fig. 1a, and fig. 5b shows a schematic diagram of the corresponding index structure. As shown in fig. 5a, each representative point of database S is cluster set S1 with corresponding first representative point C1, second representative points X1, X7, cluster set S2 with corresponding first representative point C2, second representative point Y1 and second representative point Y2, cluster set S3 with corresponding first representative point C3, second representative point Z1, and cluster set S4 with corresponding first representative point C4, second representative points W1, W2, W3 and W4. As shown in fig. 5b, the index structure of the database S includes a representative point item and an inverted file item. The representative point entries include cluster set S1 having corresponding first representative point C1, second representative point X1, X7, cluster set S2 having corresponding first representative point C2, second representative point Y1, and second representative point Y2, cluster set S3 having corresponding first representative point C3, second representative point Z1, and cluster set S4 having corresponding first representative point C4, second representative point W1, W2, W3, and W4.
The corresponding first representative point C1, second representative point X1 and X7 of the cluster set S1 have D1 of the corresponding inverted file, and the inverted file D1 includes the feature vectors in the cluster set S1 corresponding to the corresponding representative point C1.
The corresponding first representative point C2, second representative point Y1 and second representative point Y2 of the cluster set S2 have the corresponding inverted file D2, and the inverted file D2 includes the feature vectors in the cluster set S2 corresponding to the corresponding representative point C2.
And D3 of a corresponding inverted file of the corresponding first representative point C3 and the second representative point Z1 of the cluster set S3, wherein each feature vector in the cluster set S3 corresponding to the corresponding representative point C3 is included in the inverted file D3.
The corresponding first representative point C4, second representative points W1, W2, W3 and W4 of the cluster set S4 have D4 of the corresponding inverted file, and each feature vector in the cluster set S4 corresponding to the corresponding representative point C4 is included in the inverted file D4.
Corresponding to the above vector index construction method, when searching for such a vector, the search system may first obtain a query vector; obtaining the distance between the query vector and a first representative point and a second representative point corresponding to each cluster set in the database; determining the representative points with the preset number closest to the query vector, and taking the cluster set corresponding to the representative points with the preset number closest to the query vector as a target cluster set; obtaining the distance between each feature vector in the target cluster set and the query vector; and determining a target vector corresponding to the query vector according to the distance.
It is to be understood that when there are duplicate cluster sets in the determined target cluster set, only one of the duplicate cluster sets may be reserved.
It can be understood that the index construction method provided by the embodiment of the application can determine at least one second representative point which is farther away from the first representative point of each cluster set. In this way, in the process of searching, the distances between the query vector and the first representative point and the distances between the query vector and the second representative point can be both compared, and the representative points with the preset number closest to the query vector can be obtained. Therefore, when index construction is carried out, the distance between the edge point far away from the central point and the query vector is considered in the retrieval process by taking the edge point as the representative point, and the problem that in the prior art, because the target vector is the edge point, and the distance between the center of the prior art and the central point is only considered, a more accurate target vector is not found is solved. In the query process, the distance between the first representative point and the second representative point of each cluster set and the query vector is obtained and is used for comparing with the query vector to query the corresponding target cluster set, so that the nearest target vector of the query vector can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster sets cannot completely represent all the feature vectors in the cluster sets can be effectively solved, and the search accuracy is effectively improved.
For example, the index is constructed for the picture database S shown in fig. 1. The system firstly distributes each feature vector in the database S to four cluster sets through the k-means clustering process to obtain four cluster sets shown in FIG. 1a, wherein the four cluster sets are respectively a cluster set S1, a cluster set S2, a cluster set S3 and a cluster set S4, the four cluster sets all have corresponding first representative points and second representative points, and the second representative points corresponding to each cluster set are determined according to one mode of determining at least one second representative point of each cluster set which meets preset conditions. For example, as shown in fig. 5a, cluster set S1 has corresponding first representative point C1, second representative points X1, X7, cluster set S2 has corresponding first representative point C2, second representative point Y1 and second representative point Y2, cluster set S3 has corresponding first representative point C3, second representative point Z1, and cluster set S4 has corresponding first representative point C4, second representative point W1, W2, W3 and W4. At this time, an index structure as shown in fig. 5b is established by using an inverted file method according to the first representative point and the second representative point of each cluster set and the feature vectors in the cluster sets to which the center points belong.
When the query vector a is vector-searched by the database S based on the index structure as shown in fig. 5b established as above, the search system determines the distance between the first representative point and the second representative point of each cluster set in the database S and the query vector a, for example, as shown in fig. 6a, the search system determines the distance between the first representative point C1, the second representative point X1 and X7 of the cluster set S1, the first representative point C2, the second representative point Y1 and the second representative point Y2 of the cluster set S2, the distance between the 13 representative points C3 and Z1 of the cluster set S3, the first representative point C4, the second representative point W1, W2, W3 and W4 of the cluster set S4 and the query vector a, the distance between the two former representative points of the 13 representative points and the query vector a, and the nearest two representative points of the cluster set S1, which are the second representative point X1 and the query vector a, respectively, The second representative point Y1 of the cluster set S2 then searches each feature vector in the cluster set S1 where the representative point X1 is located and each feature vector in the cluster set S2 where the representative point Y1 is located according to the index structure, that is, obtains the distance between each feature vector in the cluster set S1 and the cluster set S2 and the query vector.
In some embodiments, a feature vector whose distance from the query vector a satisfies less than a set value may be taken as the target vector. Assuming that the distances between the feature vector X1 in the cluster set S1 and the feature vectors Y1 and Y2 in the cluster set S2 and the query vector a satisfy a feature vector smaller than a set value, as shown in fig. 6b, the feature vector X1 and the feature vectors Y1 and Y2 are used as target vectors.
In other embodiments, the feature vectors in each target cluster set may also be sorted from near to far according to distance, and the top set number of feature quantities of the sorting may be selected as the target vectors of the query vector. Assume feature vector X1 in cluster set S1 and feature vectors Y1, Y2 in cluster set S2 are the first 3 feature vectors in the rank. Then, as shown in fig. 6b, the eigenvector X1 and eigenvectors Y1, Y2 are used as target vectors.
It can be understood that the index construction method applied to the index construction in the picture database in the embodiment of the present application is only an example, and the index construction method provided in the embodiment of the present application may be applied to databases of various video, voice, protein molecular structures, and the like, that is, the index construction method provided in the embodiment of the present application may be widely applied to the fields of image, video, voice, protein molecular structure retrieval, and the like.
It is understood that the retrieval system mentioned in the embodiment of the present application may include at least one database, for example, may include the above-mentioned picture database, and may further include a video database, a document database, and the like. The image database comprises a plurality of feature vectors corresponding to original image data, the video database comprises a plurality of feature vectors corresponding to original video data, and the document database comprises a plurality of feature vectors corresponding to original document data. In practice, each database can be indexed by the index construction method.
The index construction method provided by the embodiment of the present application is described in detail below. It can be understood that the index construction method provided in the embodiment of the present application may be executed by the retrieval system, and may also be executed by other electronic devices, that is, the other electronic devices construct the index for the database, and then deploy the database that has been constructed and indexed into the retrieval system.
The index construction method provided in the embodiment of the present application is described in detail below by taking the index construction method as an example, and fig. 7 shows a flowchart of an index construction method in the embodiment of the present application.
As shown in fig. 7, the index building method in the embodiment of the present application may include:
701: and acquiring a characteristic vector corresponding to each datum in the database of the index to be created.
It will be appreciated that the database may be a database comprising one or more of various picture, video, voice, protein molecular structure, etc. data structures. Because various data can be converted into high-dimensional feature vectors, the retrieval system can firstly convert each original data in the database into a corresponding feature vector, and it can be understood that each original data is converted into a corresponding feature vector, which is each target vector. It will be appreciated that the various raw data in the database may still be retained and stored in the database.
For example, for the above-mentioned picture database, the retrieval system may first convert each picture data in the picture database into a corresponding feature vector.
702: and clustering each feature vector to obtain a plurality of cluster sets, wherein each cluster set is provided with a corresponding first representative point.
It is to be understood that the first representative point may be a center point of the corresponding cluster set.
It can be understood that, in the embodiment of the present application, a manner of clustering each feature vector may be a k-means clustering manner, where the k-means clustering manner specifically includes:
acquiring K initial clustering representative points and preset K clustering sets, wherein each clustering set corresponds to one initial first representative point; it is understood that the cluster set at this time has not been assigned a feature vector, but is a virtual set with a corresponding initial representative point. The value of K, i.e. the number of the cluster sets, can be set manually according to actual requirements.
Obtaining the distance between each feature vector in the database and a first representative point corresponding to each cluster set; and respectively distributing the feature vectors to the cluster set corresponding to the first representative point with the closest distance. And then calculating a loss function according to each feature vector acquired from each clustering set, and iterating the steps until the loss function reaches a preset threshold value and then stopping iteration.
It is understood that the distances mentioned in the embodiments of the present application may be euclidean distances, inner product distances, hamming distances, and other distances.
In other embodiments, the first representative point in each cluster set may be re-determined. For example, the average vector of all feature vectors in the cluster set may be used as the first representative point.
For example, for any target vector, the distance between the target vector and the first representative point of each cluster set may be obtained, and the target vector is assigned to the first cluster set corresponding to the closest first representative point.
It is understood that the above-mentioned clustering method for each feature vector may be other clustering methods.
703: and determining a preset number of second representative points meeting preset conditions for each cluster set.
It can be understood that the preset number of the second representative points corresponding to each cluster set is determined according to the preset total number of the second representative points in the database, and when the preset total number of the second representative points is less, there may be some cluster sets in which the preset number of the second representative points is 0, and it is not necessary to determine the second representative points of the cluster sets in which the preset number is 0.
It is to be understood that the second representative point may be an edge point of the corresponding cluster set.
The following describes a manner of determining a second representative point in the embodiment of the present application by taking an example of determining at least one second representative point meeting a preset condition for a first cluster set where any target vector is located.
In some implementations, the determining the at least one second representative point meeting the preset condition may be: acquiring the distance between the first representative point of each first cluster set and all the feature vectors in the cluster center; sequencing the feature vectors from far to near according to the first distance to obtain a first sequence corresponding to each first cluster set; and determining the first preset number of feature vectors in the first sequence as second representative points in the cluster set.
In the embodiment of the application, the second representative point is determined according to the distance between each feature vector in the first cluster set and the first representative point, so that the feature vector farther away from the first representative point can be used as the second representative point, and the selection of the second representative point is more accurate.
In the embodiment of the application, the feature vectors are sorted from far to near according to the distance between each feature vector in the first clustering set and the first representative point, namely according to the distance between each feature vector and the first representative point, so that the corresponding second representative points can be selected more conveniently according to the set number. For example, if the set number of second representative points is five, the first five feature vectors in the rank, i.e., the first sequence, may be directly used as the second representative points.
In some implementations, the determining the at least one second representative point meeting the preset condition may be: acquiring the distance between the first representative point of each first cluster set and all the feature vectors in the cluster center; and taking the feature vectors with the first distance larger than the set distance as feature vectors in a first sequence, then taking all the feature vectors in the first sequence as a second representative point, or sequencing the feature vectors in the first sequence from far to near according to the first distance, and taking the feature vectors in the sequence in the preset order as a second representative point.
In the embodiment of the present application, the first sequence is obtained according to the feature vectors of which the distance between each feature vector in the first cluster set and the first representative point is greater than the set distance, that is, the feature vectors in the first sequence are vectors which are farther from the first representative point and are greater than the set distance, and the manner of selecting the second cluster set can be more standard and simpler. For example, the second representative point may be obtained only by setting a corresponding distance threshold parameter in the algorithm for selecting the second representative point.
In some implementations, the determining at least one second representative point meeting the preset condition may be: determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set; acquiring the distance between each feature vector in the first clustering set and the first representative point of each second clustering set; sorting the feature vectors from near to far according to the distance to obtain a first sequence corresponding to each second clustering set; and taking the characteristic vectors of the preset number in the first sequence as second representative points.
In the embodiment of the application, the second representative point is determined according to the distance between each feature vector in each first cluster set and at least one adjacent second cluster set. It is understood that the smaller the distance between the feature vector and the adjacent second cluster set is, the closer the feature vector is to the second cluster set, i.e. the edge of the first cluster set is, so that the second representative point can be determined more accurately according to the distance between the feature vector and the adjacent second cluster set.
The manner of obtaining at least one second cluster set adjacent to the first cluster set may be:
sorting all cluster sets except the first cluster set in the plurality of cluster sets from near to far according to the distance between the first representative point of each cluster set and the first representative point of the first cluster set, and acquiring a second sequence corresponding to the first cluster set; and combining the first preset number of clusters in the second sequence into a second cluster set.
For example, for each cluster set shown in fig. 2, taking the first cluster set S1 as an example, distances d5, d6 and d7 between the first center point of the first cluster set S1 and the first center points C2, C3 and C4 of other cluster sets are obtained, sorting is performed from near to far according to the distances, and if d7 < d5 < d6, the first sequence corresponding to the first representative point C1 is the cluster set S4-cluster set S2-cluster set S3; at this time, the first representative point C1 is index-associated with the first preset number of cluster sets in the two sequences. Assuming that the preset number is 2, the cluster set S2, the cluster set S4 may be regarded as a neighboring second cluster set of the first cluster set S1.
After the second cluster set is determined, for example, as shown in fig. 3a, for any vector in the first cluster set, taking a feature vector X1 as an example, a distance d2_1 between the feature vector X1 in the first cluster set S1 and the first representative point C2 of the second cluster set S2 is obtained, and the distances between other feature vectors and the first representative point C2 are not described herein again; sorting the feature vectors from near to far according to the distance between each feature vector and the first representative point C2, and assuming that a first sequence obtained by sorting 16 feature vectors in the first cluster set S1 from far to near according to the distance is X1-X3-X5-. said. -X16, and the preset number is 1, then the second representative point corresponding to the second cluster set S2 is X1. Meanwhile, the distance between each feature vector in the first cluster set S1 and the first representative point C4 of the second cluster set S4 is obtained, for the feature vector X1, the distance d4_1 between the feature vector X1 in the first cluster set S1 and the first representative point C4 of the second cluster set S4 is obtained, and the distances between other feature vectors and the first representative point C4 are not described herein again; sorting the feature vectors from near to far according to the distances between the feature vectors and the first representative point C4, and assuming that a first sequence obtained by sorting 16 feature vectors in the first cluster set S1 from near to far is X7-X8-X1-.. copy. -X14 according to the distances, and the preset number is 1, then the second representative point corresponding to the second cluster set S4 is X7.
It is to be understood that the above manner of obtaining neighboring cluster sets of the first cluster set by sorting according to the distance between the representative point of each cluster set and the first representative point of the first cluster set, i.e. the second cluster set, is only an example. In this embodiment, the manner of obtaining the second cluster set may also be other manners.
For example, the manner of obtaining the second cluster set may also be: forming cluster sets of which the distances between the first representative points of the cluster sets except the first cluster set and the first representative points of the first cluster set are smaller than a set distance into cluster sets in a second sequence; then, any preset number of cluster sets in the second sequence are used as a second cluster set, or the distances between the first representative points of the cluster sets in the second sequence and the first representative points of the first cluster sets are sorted from near to far, and the cluster sets in the preset sequence in the sorting are used as the second cluster set.
For example, as shown in fig. 2, the way to determine the second cluster set adjacent to cluster set S1 may be: the distance between the first representative point of the other cluster set except the cluster set S1 and the first representative point of the cluster set S1 is first obtained. For example, the distance between the first representative point C2 of the cluster set S2 and the first representative point C1 of the first cluster set S1 is d5, the distance between the first representative point C3 of the cluster set S3 and the first representative point C1 of the first cluster set S1 is d6, and the distance between the first representative point C4 of the cluster set S4 and the first representative point C1 of the first cluster set S1 is d 7. Then D5, D6 and D7 are compared with the set distance D, and if D5 and D7 are smaller than the set distance D, the cluster set S2 and the cluster set S4 form a cluster set in the second sequence; then sorting the distances d5 and d7 between the cluster set S2, the first representative point in the cluster set S4 and the first representative point in the cluster set S1 in the second sequence from near to far, and if d7 is smaller than d5, sorting the clusters into a cluster set S4-a cluster set S2. If the preset number of the second cluster set is one, the cluster set S4 is regarded as the second cluster set.
For another example, the manner of obtaining the second aggregation may also be: estimating approximate radius distances of all cluster sets except a first cluster set where a first representative point of the first cluster set is located, obtaining a difference value between the distance between the first representative point of each cluster set and the first representative point of the first cluster set and the approximate radius distance, sequencing all cluster sets from near to far according to the difference value distance to obtain a second sequence, and combining a preset number of cluster sets in the second sequence into the second cluster set.
It will be appreciated that in some embodiments, the above-described manner of estimating the approximate radial distance of each cluster set other than the first cluster set in which the target feature vector itself is located may be: and acquiring the distance between all the feature vectors in each cluster set and the representative point in each cluster set, and taking the maximum distance between all the feature vectors and the representative point as the approximate radius distance of each cluster set.
For example, as shown in fig. 3b, the way to determine the second cluster set adjacent to cluster set S1 may be: firstly, determining each cluster set S2, S3 and S4 except the cluster set S1, taking the farthest distance between each feature vector in the cluster set S2 and the first representative point C2 of the cluster set S2 as the approximate radius distance of the cluster set S2, and taking r1 as the approximate radius distance of the cluster set S2 on the assumption that the farthest distance is the distance r1 between the feature vector Y3 and the first representative point C2 of the cluster set S2; taking the farthest distance from the first central point C3 in each feature vector in the cluster set S3 as the approximate radius distance of the cluster set S2, and assuming that the farthest distance is the distance r2 between the feature vector Z1 and the first representative point C3 of the cluster set S3, taking r2 as the approximate radius distance of the cluster set S3; the farthest distance from the first center point C4 in each feature vector in the cluster set S4 is taken as the approximate radius distance of the cluster set S4, and assuming that the farthest distance is the distance r3 between the feature vector W5 and the first representative point C4 of the cluster set S4, r3 is taken as the approximate radius distance r3 of the cluster set S4.
Then, the difference between the distance d5 between the first representative point C2 of the cluster set S2 and the first representative point C1 of the cluster set S1 and the radial distance r1 between the cluster set S2 is determined as d5-r1, the difference between the distance d6 between the first representative point C3 of the cluster set S3 and the first representative point C1 of the cluster set S1 and the radial distance r2 between the cluster set S3 is determined as d6-r2, the difference between the distance d1 between the first representative point C4 of the cluster set S4 and the first representative point C1 of the cluster set S1 and the radial distance r1 of the cluster set S1 is determined as d 1-r 1, the difference between the first representative point C1 of the cluster set S4974 and the cluster set S1 is d 1-r 1, the cluster sets S1 and S1 are sorted from near to far to obtain a second sequence, and S1 are ordered to obtain a second sequence, and the second cluster set S1 is assumed that the number of the cluster set S1-1 is < the number of the second representative point S1-r 1 of the cluster set S1-1, and the second representative point S1 is set S1, and the number of the second cluster set S1 is set S1 is set < S1-1, assuming that the first 1 number of clusters in the second sequence are aggregated as the second cluster set, the second cluster set is the cluster set S1.
In some embodiments, the approximate radius of each cluster set may be estimated in any other practical manner, such as by a neural network model or a correlation algorithm.
In other implementations, the determining at least one second representative point meeting the preset condition may be: determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set; acquiring a first distance between each feature vector in a first cluster set and a first representative point of the first cluster set and a second distance between each feature vector in the first cluster set and a first representative point of each second cluster set, and summing the distances according to the first distance and the second distance; sequencing all the feature vectors from far to near according to the sum of the distances to obtain the first sequence corresponding to each second clustering set; the second representative point is the previously set number of feature vectors in the first sequence.
It is to be understood that the manner of obtaining at least one second cluster set adjacent to the first cluster set in the fourth scheme is the same as the manner of obtaining the second cluster set described in the third scheme, and is not described herein again.
It is understood that the feature vector having the larger sum of the first distance and the second distance is the feature vector farther from the first representative point of the first cluster, and farther from the first representative point of the second cluster, that is, the feature vector may be located at an edge point of the first representative point in the opposite direction from the second representative point. When the first sequence corresponding to the second cluster set in each direction of the first cluster set is obtained, the edge points in each direction of the first cluster set can be obtained. Therefore, the edge points in all directions of the first cluster set are used as the second representative points, and the searching accuracy can be effectively improved.
As shown in the foregoing fig. 4, the manner of obtaining the second representative point of the cluster set S1 in the fourth embodiment in the example of the present application may be:
assume that the neighboring second cluster sets of the first cluster set S1 are the cluster set S2 and the cluster set S4 described in fig. 3a above. According to the above determined adjacent second cluster set S2 of the first cluster set S1, a sum of distances of a first distance between each feature vector in the first cluster set S1 and the first representative point C1 of the first cluster set S1 and a second distance from the first representative point C2 of the second cluster set S2 is obtained.
For example, as shown in fig. 4, the sum of the first distance d1_1 between the eigenvector X1 in the first cluster set S1 and the first representative point C1 and the second distance d2_1 from the first representative point C2 is d _ S2, and the manner of obtaining the sum of the distances of the other 15 eigenvectors is the same, which is not described herein again; assuming that a first sequence corresponding to a cluster set S2 obtained by sorting 16 eigenvectors in the first cluster set S1 from far to near according to the sum of distances is X2-X3-X1-. -X14, assuming that the number of preset second representative points of the corresponding cluster set obtained according to each neighboring set is one, a first eigenvector in the first sequence corresponding to the second cluster set S2, that is, the eigenvector X2, can be used as the second representative point. Meanwhile, according to the above determined adjacent second cluster set S4 of the first cluster set S1, a sum of distances between each eigenvector in the first cluster set S1 and the first representative point C1 of the first cluster set S1 and the second distance between each eigenvector in the first cluster set S1 and the first representative point C4 of the second cluster set S4 is obtained, for example, in fig. 4, a sum d _ S4 of distances between the eigenvector X1 in the first cluster set S1 and the first distance d1_1 of the first representative point C1 and the second distance d4_1 of the first representative point C4 are the same as a sum of distances obtained by other 15 eigenvectors, which is not described herein again; assuming that a first sequence obtained by sorting 16 eigenvectors in the first cluster set S1 from far to near according to the sum of distances is X4-X1-X3-.,.... times-X15, assuming that the number of preset second representative points of the corresponding cluster set obtained according to each adjacent set is one, the first eigenvector in the first sequence corresponding to the fourth cluster set S4, i.e., the eigenvector X4, can be used as the second representative point. .
It is understood that, when the second representative point of each cluster set of the whole database is obtained, the preset number of the second representative points of each cluster set can be determined according to the following method:
in a first implementable scenario, the method for determining the preset number of second representative points of each cluster set is: determining the preset number of the second representative points of each cluster set according to the preset total number of the second representative points in the database, the number or the radius of the characteristic vectors in each cluster set, the preset number of the second representative points of each cluster set according to the preset total number of the second representative points and the ratio of the number or the radius of the characteristic vectors in each cluster set; it is understood that, in some embodiments, when the calculated preset number of second representative points of any one cluster set is not an integer, the final preset number may be obtained according to the rounding rule. For example, the rounding rule may be to adopt a value obtained by adding 1 to an integer number of a value corresponding to the current quantity and truncating a fractional part as a value corresponding to the final preset quantity.
For example, the preset total number of the second representative points is 4, and as shown in the cluster set corresponding to the database S in fig. 1b, there are 16 eigenvectors in the cluster set S1, 14 eigenvectors in the cluster set S2, 10 eigenvectors in the cluster set S3, and 17 eigenvectors in the cluster set S4, then the ratio of the number of the second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3, and the cluster set S4 is 16:14:10: 17; the number of second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3 and the cluster set S4 is 1.1, 0.9, 0.7 and 1.7 according to the ratio of the preset total number of the second representative points to the number of the second representative points is 16:14:10: 17. According to the above rounding rule, the preset numbers of the second representative points to be determined in the cluster set S1, the cluster set S2, the cluster set S3 and the cluster set S4 are 2, 1, 1 and 2, respectively.
It can be understood that the above manner of determining the preset number of the second representative points of each cluster set according to the preset total number of the second representative points and the number of the feature vectors or the ratio of the radii in each cluster set may make the distribution of the second representative points more uniform, thereby avoiding the problem of the search accuracy being decreased due to the fact that some cluster sets are small but a large number of second representative points are determined, which causes resource waste, and other larger cluster sets determine fewer second representative points, which makes it difficult for the representative points determined in the larger cluster set to fully represent the corresponding cluster set.
In a second practical solution, the method for determining the preset number of the second representative points of each cluster set is: and selecting a set number of second representative points from each cluster set according to the preset total number of the second representative points in the database and the sequence from large to small of the number or radius of the characteristic vectors in each cluster set until the preset total number is selected.
For example, the preset total number of the second representative points is 4, as determined by the number of feature vectors in each cluster set in the cluster set corresponding to the database S shown in fig. 1b, 16 feature vectors in the cluster set S1, 14 feature vectors in the cluster set S2, 10 feature vectors in the cluster set S3, and 17 feature vectors in the cluster set S4, the order obtained by arranging the number of feature vectors in each cluster set from large to small is cluster set S4-cluster set S1-cluster set S2-cluster set S3;
when the set number of the second representative points selected in each cluster set is 2, the preset number of the second representative points to be determined in the cluster set S4 and the cluster set S1 is two according to the sequence of the cluster set S4-the cluster set S1-the cluster set S2-the cluster set S3. Since the preset total number of second representative points has been reached at this time. The subsequent cluster set S2 and cluster set S3 will not pick the second representative point, that is, the number of the cluster set S2 and the cluster set S3 arranged after the cluster set S1 is 0.
It can be understood that, by selecting the set number of second representative points from each cluster set according to the preset total number of the second representative points and sorting the number or radius of the feature vectors in each cluster set from large to small, the second representative points with a smaller number can be determined in a smaller cluster set, and the second representative points with a larger number can be determined in other larger cluster sets, so that the determined representative points can fully represent the corresponding cluster sets, and the search accuracy is effectively improved.
704: and establishing index association between each feature vector in each cluster set and the first representative point and the second representative point of the corresponding cluster set respectively.
And performing index association on the feature vectors in each first cluster set with the first representative point and the second representative point in the cluster set, wherein the feature vectors in each first cluster set are target vectors. In this embodiment of the present application, after determining that the first cluster set determines at least one second representative point meeting the preset condition, index association may be established between a target vector in each first cluster set and a first representative point and a second representative point of the first cluster set, respectively.
It can be understood that, in addition to determining the first representative point, the index construction method provided in the embodiment of the present application can determine at least one second representative point, that is, an edge point, which is relatively far away from the first representative point of each cluster set. Therefore, in the process of retrieval, the distances between the query vector and the first representative point and the distance between the query vector and the second representative point can be compared, namely, the distance between the edge point far away from the central point and the query vector is considered, and the problem that in the prior art, because the target vector is the edge point, and the distance between the center of the prior art and the central point is only considered, a more accurate target vector cannot be found is solved. In the query process, the edge points are taken as second representative points, the distances between the edge points and the query vectors are respectively calculated by combining the first representative points, so as to query the corresponding target cluster set, the target vectors corresponding to the query vectors can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster set cannot completely represent all the feature vectors in the cluster set can be effectively solved, and the search accuracy is effectively improved.
The following describes an index structure constructed by using the index construction method in the embodiment of the present application. The index association may include two parts, a representative point item and an inverted file item, respectively. The representative point item may include a first representative point and a second representative point corresponding to each cluster set, and the inverted file item includes an inverted file corresponding to each representative point. Each inverted file comprises feature vector information in the cluster set corresponding to each representative point in the corresponding representative point item. It can be understood that, since the first representative point and the second representative point belonging to the same cluster set correspond to the same feature vector, the inverted file contents corresponding to the first representative point and the second representative point belonging to the same cluster set are the same.
For example, fig. 5a shows the corresponding representative points for the database S shown in fig. 1a, and fig. 5b shows a schematic diagram of the index structure. As shown in fig. 5a, each representative point of the database S is a cluster set S1 having corresponding first representative point C1, second representative point X1, X7, cluster set S2 having corresponding first representative point C2, second representative point Y1 and second representative point Y2, cluster set S3 having corresponding first representative point C3, second representative point Z1, and cluster set S4 having corresponding first representative point C4, second representative point W1, W2, W3 and W4. As shown in fig. 5b, the index structure of the database S includes a representative point entry and an inverted file entry. The representative point entries include cluster set S1 having corresponding first representative point C1, second representative point X1, X7, cluster set S2 having corresponding first representative point C2, second representative point Y1, and second representative point Y2, cluster set S3 having corresponding first representative point C3, second representative point Z1, and cluster set S4 having corresponding first representative point C4, second representative point W1, W2, W3, and W4.
The corresponding first representative point C1, second representative point X1, and X7 of the cluster set S1 have the corresponding inverted file D1, and the inverted file D1 includes the feature vectors in the cluster set S1 corresponding to the corresponding representative point C1.
The corresponding first representative point C2, second representative point Y1 and second representative point Y2 of the cluster set S2 have the corresponding inverted file D2, and the inverted file D2 includes the feature vectors in the cluster set S2 corresponding to the corresponding representative point C2.
And D3 of a corresponding inverted file of the corresponding first representative point C3 and the second representative point Z1 of the cluster set S3, wherein each feature vector in the cluster set S3 corresponding to the corresponding representative point C3 is included in the inverted file D3.
The corresponding first representative point C4, second representative points W1, W2, W3 and W4 of the cluster set S4 have D4 of the corresponding inverted file, and each feature vector in the cluster set S4 corresponding to the corresponding representative point C4 is included in the inverted file D4.
The embodiment of the application further comprises a database, the database adopts the index construction method to construct the index, and the database comprises: a plurality of cluster sets, each cluster set having at least one feature vector therein; wherein each cluster set has a corresponding first representative point or second representative point; the feature vector in each cluster set has index association with the first representative point or the second representative point of the cluster set.
Fig. 8 shows a flowchart of a searching method in an embodiment of the present application, the searching method may be applied to various databases including the above index structure, and the searching method may be executed by a retrieval system including the above databases. As shown in fig. 8, the search method in the embodiment of the present application may include:
801: acquiring a query vector;
it can be understood that when the user performs information retrieval, the user may input corresponding query data in a search window or a search box of the retrieval system, and after the retrieval system acquires the query data, the retrieval system may convert the query data into a corresponding query vector. It will be appreciated that the query data entered by the user may be in any format.
For example, the query data input by the user may be in a picture format, and after the retrieval system acquires the picture, the picture may be converted into a corresponding query vector.
For example, the query data input by the user may be in a text format, and after the retrieval system obtains the text, the text may be converted into a corresponding query vector.
For example, the query data input by the user may be in a video format, and after the retrieval system acquires the video, the video may be converted into a corresponding query vector.
802: and acquiring a first distance between the query vector and a first representative point and a second representative point corresponding to each cluster set in the database.
It is to be understood that, in the embodiment of the present application, the first distance may be a euclidean distance or an inner product distance.
In some implementations, after obtaining the query vector, the retrieval system determines a first representative point in the database from which each cluster set was obtained and a first distance between the second representative point and the query vector.
For example, as shown in FIG. 6a, the retrieval system may determine a first distance from query vector A for 13 representative points of first representative point C1, second representative points X1, X7 of cluster set S1, first representative point C2, second representative point Y1, and second representative point Y2 of cluster set S2, first representative point C3, second representative point Z1 of cluster set S3, and first representative point C4, second representative point W1, W2, W3, and W4 of cluster set S4.
803: and determining at least one representative point corresponding to the query vector according to the obtained first distance, and taking a cluster set corresponding to the at least one representative point as a target cluster set.
It can be understood that the specific number of the representative points in the at least one representative point corresponding to the query vector may be set by the user as required, or may be set according to the total feature vector number corresponding to the database or the number of each cluster set after the database completes clustering. For example, according to the number range division of the total number of feature vectors in the database, assuming that the total number of feature vectors in the database is 0-1000, the number of determined representative points may be set to 2, and assuming that the total number of feature vectors in the database is 1000-. It should be understood that the above-mentioned manner of setting the number of representative points is only an illustration of the distance, and the present application is not limited to any other implementable manner.
In some implementations, clusters corresponding to representative points whose distances from the query vector a are smaller than a set value may be combined as the target cluster set.
In other embodiments, the first set number of feature quantities may be selected as the target cluster set of the query vector according to the near-to-far ordering of the first distance.
For example, with the set number of target vectors being two, the representative points are sorted according to the first distances from the 13 representative points to the query vector a in the above step, and if the first two representative points in the sorting are the second representative point X1 of the cluster set S1 and the second representative point Y1 of the cluster set S2, the cluster set S1 and the cluster set S2 are selected as the target cluster set to be searched.
804: and acquiring a second distance between each feature vector in the target cluster set and the query vector.
In some implementations, a second distance between each feature vector in the target cluster set and the query vector is obtained according to the established index association.
For example, a second distance between each feature vector in the set of clusters S1, S2 and the query vector A may be obtained.
805: and determining a target vector corresponding to the query vector according to the second distance.
It is to be understood that, in the embodiment of the present application, the second distance may be a euclidean distance or an inner product distance.
In some implementations, a feature vector whose second distance from the query vector a satisfies less than a set value may be taken as the target vector.
For example, assuming that the distances between the feature vector X1 in the cluster set S1 and the feature vectors Y1 and Y2 in the cluster set S2 and the query vector a satisfy a feature vector smaller than a set value, as shown in fig. 6b, the feature vector X1 and the feature vectors Y1 and Y2 are used as target vectors.
In other embodiments, the feature vectors may also be sorted according to the second distance, and the top set number of feature vectors sorted are selected as target vectors of the query vector.
For example, the preset number is 3, and after the distance between the query vector a and each feature vector in the cluster set S1 and the cluster set S2 is obtained, each feature vector is sorted from near to far according to the distance. As shown in fig. 6b, assuming that the eigenvector Y1, the eigenvector Y2 in the cluster set S2 and the eigenvector X1 in the distance cluster set S are the first three eigenvectors in the ranking, the eigenvector Y1, the eigenvector Y2 in the cluster set S2 and the eigenvector X1 in the distance cluster set S are used as target vectors corresponding to the query vectors.
It can be understood that the index construction method provided by the embodiment of the application can determine at least one second representative point which is farther away from the first representative point of each cluster set. In this way, in the process of searching, the distances between the query vector and the first representative point and the distances between the query vector and the second representative point can be both compared, and the representative points with the preset number closest to the query vector can be obtained. Therefore, when index construction is carried out, the distance between the edge point far away from the central point and the query vector is considered in the retrieval process by taking the edge point as the representative point, and the problem that in the prior art, because the target vector is the edge point, and the distance between the center of the prior art and the central point is only considered, a more accurate target vector is not found is solved. In the query process, the distance between the first representative point and the second representative point of each cluster set and the query vector is obtained and is used for comparing with the query vector to query the corresponding target cluster set, so that the nearest target vector of the query vector can be more accurately obtained, the problem of low search accuracy caused by the fact that the representative points of the cluster sets cannot completely represent all the feature vectors in the cluster sets can be effectively solved, and the search accuracy is effectively improved.
It can be understood that, in the embodiment of the present application, after the retrieval system acquires the target vector, the original data corresponding to the target vector may be output to the client.
For example, when the database is a picture database, the retrieval system may output the original picture data corresponding to the target vector to the client.
It will be appreciated that the above search method may be used to search one database in a retrieval system, or may be used to search multiple databases in a retrieval system.
For example, when the retrieval system includes only one database, and when the search method is used to search the database in the retrieval system, the retrieval system may perform the search method shown in fig. 8 on the database according to the query vector, and obtain the search result output corresponding to the database. It will be appreciated that the search results may be raw data corresponding to a target vector of the query vectors retrieved in the database.
When a plurality of databases are included in the retrieval system, the retrieval system may perform the search method shown in fig. 8 described above on each of the plurality of databases based on the query vector, and obtain one search result corresponding to each database, and then output all the search results.
It will be appreciated that the format of the input data obtained by the retrieval system and the output data determined by the retrieval system may be the same or different. The output data format is related to the data format contained in the searched database, for example, when the data format in the searched database includes a text format, a picture format or a video format. The output data format is then available in text format, picture format or video format.
For example, when the input data is in a picture format, the retrieval system includes a database, and the data formats in the database are all pictures, the output data is also in a picture format.
For example, when the retrieval system includes a database, and the data format in the database includes a text format, a picture format, or a video format, the data format output by the retrieval system may be a text format, a picture format, or a video format.
For example, when the retrieval system includes a first database, a second database, and a third database, the data format in the first database includes a text format, the data format in the second database includes a picture format, and the data format in the third database includes a video format, the data format output by the retrieval system may be the text format, the picture format, or the video format.
Fig. 9 is a schematic diagram illustrating an index building apparatus according to an embodiment of the present application, and as shown in fig. 9, the index building apparatus includes:
the device comprises a first determining unit, a second determining unit and a third determining unit, wherein the first determining unit is used for determining a target vector and a first cluster set in which the target vector is positioned, and the first cluster set is provided with a corresponding first representative point;
a second determining unit, configured to determine, for the first cluster set, at least one second representative point that meets a preset condition;
and the association unit is used for establishing index association between the target vector and the first representative point and the second representative point respectively.
Fig. 10 is a schematic diagram of a search apparatus according to an embodiment of the present application, and as shown in fig. 10, the search apparatus includes:
a first obtaining unit, configured to obtain a query vector;
a second obtaining unit, configured to obtain a first distance between the query vector and a first representative point and a second representative point that correspond to each cluster set in the database;
a first determining unit, configured to determine, according to the first distance, at least one representative point corresponding to the query vector, and use a cluster set corresponding to the at least one representative point as a target cluster set;
a third obtaining unit, configured to obtain a second distance between each feature vector in the target cluster set and the query vector;
and the second determining unit is used for determining a target vector corresponding to the query vector according to the second distance.
The embodiment of the application also comprises a retrieval system which can comprise the index building device and/or at least one database and/or search device. Fig. 11 is a schematic diagram of a retrieval system according to an embodiment of the present application, where the retrieval system shown in fig. 11 includes an index building apparatus, a database, and a search apparatus.
Fig. 12 shows a block diagram of an electronic device according to an embodiment of the present application. In one embodiment, electronic device 1400 may include one or more processors 1404, system control logic 1408 coupled to at least one of processors 1404, system memory 1412 coupled to system control logic 1408, non-volatile memory (NVM)1416 coupled to system control logic 1408, and a network interface 1420 coupled to system control logic 1408.
In some embodiments, processor 1404 may include one or more single-core or multi-core processors. In some embodiments, processor 1404 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where electronic device 1400 employs an eNB (enhanced Node B) 101 or a RAN (Radio Access Network) controller 102, processor 1404 may be configured to perform various consistent embodiments, e.g., one or more of the embodiments shown in fig. 7 or 8.
In some embodiments, system control logic 1408 may include any suitable interface controllers to provide any suitable interface to at least one of processors 1404 and/or to any suitable device or component in communication with system control logic 1408.
In some embodiments, system control logic 1408 may include one or more memory controllers to provide an interface to system memory 1412. System memory 1412 may be used to load and store data and/or instructions. Memory 1412 of electronic device 1400 may include any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM), in some embodiments.
NVM/memory 1416 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory 1416 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device such as at least one of a HDD (Hard Disk Drive), CD (Compact Disc) Drive, DVD (Digital Versatile Disc) Drive.
The NVM/memory 1416 may comprise a portion of the storage resource on the device on which the electronic device 1400 is installed, or it may be accessible by, but not necessarily a part of, the device. For example, the NVM/storage 1416 may be accessed over a network via the network interface 1420.
In particular, system memory 1412 and NVM/storage 1416 may each include: temporary copies and permanent copies of instructions 1424. Instructions 1424 may include: instructions that, when executed by at least one of the processors 1404, cause the electronic device 1400 to implement a method as illustrated in fig. 3a or 5. In some embodiments, instructions 1424, hardware, firmware, and/or software components thereof may additionally/alternatively be located in system control logic 1408, network interface 1420, and/or processor 1404.
The network interface 1420 may include a transceiver to provide a radio interface for the electronic device 1400 to communicate with any other suitable devices (e.g., front end modules, antennas, etc.) over one or more networks. In some embodiments, the network interface 1420 may be integrated with other components of the electronic device 1400. For example, network interface 1420 may be integrated with at least one of processor 1404, system memory 1412, NVM/storage 1416, and a firmware device (not shown) having instructions that, when executed by at least one of processors 1404, electronic device 1400 implements the methods shown in fig. 7 or 8.
Network interface 1420 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 1420 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.
In one embodiment, at least one of processors 1404 may be packaged together with logic for one or more controllers of system control logic 1408 to form a System In Package (SiP). In one embodiment, at least one of processors 1404 may be integrated on the same die with logic for one or more controllers of system control logic 1408 to form a system on a chip (SoC).
The electronic device 1400 may further include: input/output (I/O) devices 1432. The I/O device 1432 may include a user interface to enable a user to interact with the electronic device 1400; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 1400. In some embodiments, the electronic device 1400 further comprises sensors for determining at least one of environmental conditions and location information related to the electronic device 1400.
In some embodiments, the user interface may include, but is not limited to, a display (e.g., a liquid crystal display, a touch screen display, etc.), a speaker, a microphone, one or more cameras (e.g., still image cameras and/or video cameras), a flashlight (e.g., a light emitting diode flash), and a keyboard.
In some embodiments, the peripheral component interfaces may include, but are not limited to, a non-volatile memory port, an audio jack, and a power interface.
In some embodiments, the sensors may include, but are not limited to, a gyroscope sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of the network interface 1420 or interact with the network interface 1420 to communicate with components of a positioning network, such as Global Positioning System (GPS) satellites.
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
An embodiment of the present application provides a computer program product, which includes an instruction for implementing the index building method or the vector search method.
In the drawings, some features of structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the figures. In addition, the inclusion of a structural or methodological feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments may not be included or may be combined with other features.
It should be noted that, in each device embodiment of the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solving the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.
It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.
While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the application.

Claims (20)

1. An index building method, comprising:
determining a target vector and a first cluster set in which the target vector is located, wherein the first cluster set has a corresponding first representative point;
determining at least one second representative point meeting a preset condition for the first cluster set;
establishing index associations between the target vectors and the first and second representative points, respectively.
2. The index building method according to claim 1, wherein determining at least one second representative point meeting a preset condition for the first cluster set comprises:
determining a first distance between each feature vector in the first cluster set and the first representative point;
and determining a first sequence from the feature vectors according to the first distance, and taking a preset number of feature vectors in the first sequence as the second representative points.
3. The index building method according to claim 2, wherein the determining the distance from the feature vectors to a first sequence, and using a preset number of feature vectors in the first sequence as the second representative point comprises:
sequencing the feature vectors from large to small according to the first distance to obtain the first sequence;
and taking the feature vectors of the previous preset number in the first sequence as the second representative point.
4. The index building method according to claim 2, wherein determining a first sequence from the feature vectors according to the first distance, and using a preset number of feature vectors in the first sequence as the second representative point comprises:
taking the feature vectors of which the first distance is greater than a set distance as a first sequence;
taking a preset number of feature vectors in the first sequence as the second representative point;
or, the feature vectors in the first sequence are sorted from far to near according to the first distance, and the feature vectors with the preset number in the top of the sorting are determined as the second representative point.
5. The index building method according to claim 1, wherein determining at least one second representative point meeting a preset condition for the first cluster set comprises:
determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set;
acquiring a first distance between each feature vector in the first clustering set and a first representative point of the first clustering set, and a second distance between each feature vector and the first representative point of each second clustering set;
and for each feature vector, acquiring the first sequence corresponding to each second clustering set according to the first distance and the second distance, and taking the feature vectors of a preset number in the first sequence as the second representative points.
6. The index construction method according to claim 5, wherein for each feature vector, obtaining the first sequence corresponding to each second cluster set according to the first distance and the second distance, and taking a preset number of feature vectors in the first sequence as second representative points comprises:
and sequencing the feature vectors according to the sum of the first distance and the second distance from large to small to obtain a first sequence, and taking a preset number of feature vectors in the first sequence as the second representative points.
7. The method of claim 1, wherein determining at least one second representative point meeting a predetermined condition for the first cluster set comprises:
determining at least one second cluster set adjacent to the first cluster set in other cluster sets except the first cluster set;
acquiring a second distance between each feature vector in the first clustering set and the first representative point of each second clustering set;
and for each feature vector, acquiring a first sequence corresponding to each second clustering set according to the second distance, and taking the feature vectors of a preset number in the first sequence as the second representative points.
8. The index construction method according to claim 7, wherein for each feature vector, obtaining the first sequence corresponding to each second cluster set according to the second distance, and using a preset number of feature vectors in the first sequence as second representative points includes:
and sequencing the characteristic vectors according to the second distance from small to large to obtain a first sequence corresponding to each second clustering set, and taking the characteristic vectors with the preset number in the first sequence as the second representative points.
9. The index construction method according to any one of claims 5 to 8, wherein determining at least one second cluster set adjacent to the first cluster set among other cluster sets except the first cluster set comprises:
obtaining the distance between the representative point corresponding to each other cluster set and the first representative point corresponding to the first cluster set;
and according to the distance, determining a second sequence from other cluster sets except the first cluster set, and determining a preset number of cluster sets in the second sequence as the second cluster set.
10. The method of any of claim 1, wherein determining the first set of clusters in which the target vector is located comprises:
obtaining the distance between the target vector and the first representative points corresponding to all the cluster sets;
and combining the cluster set with the closest distance between the corresponding first representative point and the target vector in all the cluster sets as the first cluster set in which the target vector is positioned.
11. The method of any one of claims 1-8 or 10, wherein the distance comprises at least one of: euclidean distance, inner product distance, and hamming distance.
12. An index building apparatus, comprising:
a first determining unit, configured to determine a target vector and a first cluster set in which the target vector is located, where the first cluster set has a corresponding first representative point;
a second determining unit, configured to determine, for the first cluster set, at least one second representative point that meets a preset condition.
And the association unit is used for respectively establishing index association between the target vector and the first representative point and the second representative point.
13. A vector search method, comprising:
acquiring a query vector;
acquiring a first distance between the query vector and a first representative point and a second representative point corresponding to each cluster set in the database;
determining at least one representative point corresponding to the query vector according to the first distance, and taking a cluster set corresponding to the at least one representative point as a target cluster set;
obtaining a second distance between each feature vector in the target cluster set and the query vector;
and determining a target vector corresponding to the query vector according to the second distance.
14. The method of claim 13, further comprising: determining a target vector corresponding to the query vector according to the first distance; the method comprises the following steps:
and taking the characteristic vector of which the distance is within a set range as a target vector corresponding to the query vector.
15. A search apparatus, applied to a database: the method comprises the following steps:
a first obtaining unit, configured to obtain a query vector;
a second obtaining unit, configured to obtain a first distance between the query vector and a first representative point and a second representative point corresponding to each cluster set in the database;
a first determining unit, configured to determine at least one representative point corresponding to the query vector according to the first distance, and use a cluster set corresponding to the at least one representative point as a target cluster set;
a third obtaining unit, configured to obtain a second distance between each feature vector in the target cluster set and the query vector;
and the second determining unit is used for determining a target vector corresponding to the query vector according to the second distance.
16. A retrieval system comprising the index building apparatus of claim 12 and/or the search apparatus of claim 15.
17. An index structure comprising a representative point item and an inverted file item;
the representative point item comprises a first representative point and at least one second representative point corresponding to each cluster set in a database, and the inverted file item comprises inverted files corresponding to each cluster set;
the inverted file comprises each feature vector in the cluster set corresponding to the inverted file.
18. An electronic device, comprising: a memory for storing instructions for execution by one or more processors of an electronic device, and a processor, which is one of the one or more processors of the electronic device, for performing the index building method of any one of claims 1 to 11 or the vector searching method of any one of claims 13 to 14.
19. A readable medium having stored thereon instructions that, when executed on an electronic device, cause a machine to perform the index building method of any one of claims 1 to 11 or the search method of any one of claims 13 to 14.
20. A computer program product comprising instructions for implementing the index building method of any one of claims 1 to 11 or the search method of any one of claims 13 to 14.
CN202210406518.XA 2022-04-18 2022-04-18 Index construction method and device, vector search method and retrieval system Pending CN114791966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210406518.XA CN114791966A (en) 2022-04-18 2022-04-18 Index construction method and device, vector search method and retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210406518.XA CN114791966A (en) 2022-04-18 2022-04-18 Index construction method and device, vector search method and retrieval system

Publications (1)

Publication Number Publication Date
CN114791966A true CN114791966A (en) 2022-07-26

Family

ID=82461694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210406518.XA Pending CN114791966A (en) 2022-04-18 2022-04-18 Index construction method and device, vector search method and retrieval system

Country Status (1)

Country Link
CN (1) CN114791966A (en)

Similar Documents

Publication Publication Date Title
US7634465B2 (en) Indexing and caching strategy for local queries
US9710493B2 (en) Approximate K-means via cluster closures
KR101015324B1 (en) Multidimensional data object searching using bit vector indices
US20090083275A1 (en) Method, Apparatus and Computer Program Product for Performing a Visual Search Using Grid-Based Feature Organization
CN102693266B (en) Search for method, the navigation equipment and method of generation index structure of database
Lu et al. Flexible and efficient resolution of skyline query size constraints
US20100169326A1 (en) Method, apparatus and computer program product for providing analysis and visualization of content items association
CN107341178B (en) Data retrieval method based on self-adaptive binary quantization Hash coding
US11100073B2 (en) Method and system for data assignment in a distributed system
Tiakas et al. Skyline queries: An introduction
CN103353881A (en) Method and device for searching application
WO2017095439A1 (en) Incremental clustering of a data stream via an orthogonal transform based indexing
CN114817657A (en) To-be-retrieved data processing method, data retrieval method, electronic device and medium
CN111026922A (en) Distributed vector indexing method, system, plug-in and electronic equipment
US20180285693A1 (en) Incremental update of a neighbor graph via an orthogonal transform based indexing
CN114691940A (en) Index construction method and device, vector search method and retrieval system
US11593412B2 (en) Providing approximate top-k nearest neighbours using an inverted list
CN111190893B (en) Method and device for establishing feature index
CN115357609B (en) Method, device, equipment and medium for processing data of Internet of things
CN111309946A (en) Established file optimization method and device
CN114791966A (en) Index construction method and device, vector search method and retrieval system
CN114117260B (en) Spatiotemporal trajectory indexing and query processing method, device, equipment and medium
CN115757896A (en) Vector retrieval method, device, equipment and readable storage medium
US10803053B2 (en) Automatic selection of neighbor lists to be incrementally updated
US11500937B1 (en) Data retrieval system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination