CN115934724A - Method for constructing database index, retrieval method, device, equipment and medium - Google Patents

Method for constructing database index, retrieval method, device, equipment and medium Download PDF

Info

Publication number
CN115934724A
CN115934724A CN202211633718.5A CN202211633718A CN115934724A CN 115934724 A CN115934724 A CN 115934724A CN 202211633718 A CN202211633718 A CN 202211633718A CN 115934724 A CN115934724 A CN 115934724A
Authority
CN
China
Prior art keywords
cluster
vector
candidate
clusters
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211633718.5A
Other languages
Chinese (zh)
Inventor
付琰
许顺楠
陈亮辉
范斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211633718.5A priority Critical patent/CN115934724A/en
Publication of CN115934724A publication Critical patent/CN115934724A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a method, a retrieval device, equipment and a medium for constructing database indexes, and relates to the technical field of computers, in particular to the technical fields of big data, artificial intelligence, data retrieval and the like. The specific implementation scheme is as follows: acquiring a vector to be put in storage; screening out a vector cluster with the highest similarity to a vector to be warehoused from a database to obtain a cluster to be added; adding the vector to be put into a warehouse into the inverted index of the cluster to be added; screening at least one supplementary cluster from candidate vector clusters except the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value; and respectively adding the vectors to be put into the database into the inverted index of each supplementary cluster. The vector recall rate can be improved based on the mode provided by the embodiment of the disclosure, and the retrieval efficiency is improved.

Description

Method for constructing database index, retrieval method, device, equipment and medium
Technical Field
The present disclosure relates to the field of computer technology, and more particularly, to the field of big data, artificial intelligence, data retrieval, and the like.
Background
With the rapid growth of data, data retrieval is widely applied to the fields of image, video search, recommendation and the like. Text, pictures, etc. can be abstracted into high-dimensional vectors, and the similarity between data can be quantified as the distance between vectors in vector space. The closer the distance between two vectors is, the higher the similarity of the original data corresponding to the two vectors is. Data retrieval can therefore be translated into a vector search in vector space. That is, under the condition of obtaining the query vector corresponding to the query (query statement), searching a database for a plurality of vectors closest to the query vector.
With the continuous expansion of vectors in databases, the retrieval recall rate needs to be continuously improved.
Disclosure of Invention
The disclosure provides a method for constructing database index, a retrieval method, a device, equipment and a medium.
According to an aspect of the present disclosure, there is provided a method of constructing a database index, including:
acquiring a vector to be put in storage;
screening out a vector cluster with the highest similarity to a vector to be warehoused from a database to obtain a cluster to be added;
adding the vector to be put into a warehouse into the inverted index of the cluster to be added;
screening at least one supplementary cluster from candidate vector clusters except the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value;
and respectively adding the vectors to be put into the database into the inverted index of each supplementary cluster.
According to another aspect of the present disclosure, there is provided a retrieval method applied to an index constructed by the method for constructing a database index in any embodiment, including:
acquiring a query vector;
screening a first specified number of vector clusters from a database as clusters to be queried based on the similarity between the query vectors and the vector clusters;
determining vectors contained in each cluster to be queried based on the inverted index of each cluster to be queried;
removing duplication of vectors contained in each cluster to be queried to obtain a set of vectors to be queried;
and screening out vectors matched with the query vectors from the vector set to be queried.
According to another aspect of the present disclosure, there is provided an apparatus for constructing a database index, including:
the first acquisition module is used for acquiring a vector to be put into a warehouse;
the first screening module is used for screening out a vector cluster with the highest similarity with a vector to be put in storage from a database to obtain a cluster to be added;
the first adding module is used for adding the vectors to be put into a storage into the inverted index of the cluster to be added;
the second screening module is used for screening at least one supplementary cluster from candidate vector clusters except the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value;
and the second adding module is used for respectively adding the vectors to be put into the database into the inverted index of each supplementary cluster.
According to another aspect of the present disclosure, there is provided a retrieval apparatus applied to an index constructed by the apparatus for constructing a database index in any embodiment, including:
the second acquisition module is used for acquiring the query vector;
the third screening module is used for screening the vector clusters with the first specified number from the database as clusters to be queried based on the similarity between the query vectors and the vector clusters;
the determining module is used for determining vectors contained in each cluster to be queried based on the inverted index of each cluster to be queried;
the duplication elimination module is used for eliminating duplication of vectors contained in each cluster to be inquired to obtain a vector set to be inquired;
and the matching module is used for screening out the vectors matched with the query vectors from the vector set to be queried.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.
In the embodiment of the present disclosure, based on the similarity between the vector to be put in storage and the closest vector cluster, an edge vector may be determined. And then the edge vectors are supplemented into more supplementary clusters, so that the fact that the multiple clusters all contain the vectors to be put into storage is achieved. During recalling, the clusters to be added or the supplementary clusters can be recalled to recall the edge vectors, so that the recall rate of the edge vectors is improved, and the retrieval efficiency is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 (a) is a schematic diagram of failing to recall an edge vector in an embodiment in accordance with the present disclosure;
FIG. 1 (b) is a schematic diagram of adding a vector to be binned to a plurality of vector clusters according to an embodiment of the present disclosure;
fig. 1 (c) is another schematic diagram of adding vectors to be binned to a plurality of vector clusters in accordance with an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating a method for constructing a database index according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of deriving a supplemental cluster using preset conditions according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of adjusting a cluster radius according to an embodiment of the present disclosure;
FIG. 5 is a flowchart illustrating a method for splitting a vector cluster according to an embodiment of the present disclosure;
FIG. 6 is a schematic flow chart diagram of a retrieval method according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of an apparatus for constructing a database index according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a retrieval apparatus provided according to an embodiment of the present disclosure;
FIG. 9 is a block diagram of an electronic device for implementing a method of building a database index and/or a retrieval method of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, it will be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The number of vectors in the database for retrieval is enormous. Taking pictures as an example, each picture can obtain a corresponding vector through feature extraction, and the vector is stored in a database. When a query vector is obtained, if the query vector is compared with each known vector in the database, it is necessary to traverse all vectors in the database. The traversal mode is obviously low in efficiency and wastes operation resources.
To improve the efficiency of the search, the vectors in the database may be classified. All vectors in the database can be divided into limited categories by a clustering method, each category corresponds to one vector cluster, and each vector cluster corresponds to a cluster representative vector. As the name implies, the cluster representative vector refers to a vector capable of representing a vector cluster, and the cluster representative vector can be selected from vectors of the vector cluster, and can also be obtained by calculating the average value of each vector in the vector cluster.
During retrieval, firstly, the query vector is compared with the cluster representative vector, and partial vector clusters are screened. And for the remaining vector clusters, finding the vectors close to the query vectors in a traversal mode, and returning the query result.
Although the search efficiency can be improved to some extent by means of clustering, the recall rate for the edge vectors needs to be improved. For example, as shown in fig. 1 (a), it is assumed that the database includes a vector cluster A1, a vector cluster A2, and a vector cluster A3. The vector a1, the vector a2, and the vector a3 are cluster representative vectors of each vector cluster. The distance between the query vector B and the cluster representative vector A2 of the vector cluster A2 is the closest, so that a similar vector to the query vector B is retrieved in the vector cluster A2. However, it may be the case that although the cluster representative vector A2 of the vector cluster A2 has a low similarity to the query vector B, the edge vector a4 in the vector cluster A1 may be the closest vector in the database to the query vector B. Therefore, the retrieval recall rate of the edge vectors needs to be improved by simply adopting a vector cluster mode.
In summary, an edge vector can be understood as a vector with low similarity to any one of the vector clusters, and none of the vector clusters can well contain the edge vector, or can be understood as an edge vector located at an edge (i.e., farther from the cluster representative vector) of the vector clusters. In order to improve the retrieval recall rate, the embodiment of the disclosure provides a method for constructing a database index. In this method, it is desirable to store the edge vectors into a plurality of vector clusters, so that the edge vectors can be recalled through the plurality of vector clusters, so as to improve the recall rate of the edge vectors, thereby improving the retrieval efficiency.
The method provided by the embodiment of the disclosure is suitable for the fields of image retrieval, video retrieval, text retrieval, voice retrieval, protein structure retrieval and the like. In implementation, application scenarios requiring vector retrieval are all applicable to the embodiments of the present disclosure.
FIG. 2 is a schematic flow chart of a method for constructing a database index, which includes the following steps:
s201, obtaining a vector to be put in storage.
For example, when the picture library is expanded, feature extraction may be performed on the newly added picture to obtain a vector corresponding to the newly added picture, and the vector is used as a vector to be put into a storage for retrieval.
For another example, when the protein structure is expanded, feature extraction may be performed on the structure of the newly added protein to obtain a structure vector of the newly added protein structure, and the vector is used as a vector to be put into a database for being added into the database for retrieval.
For another example, when a piece of text is expanded, a text vector of the text may be extracted, and the text vector may be used as a vector to be put into a database so as to be added to the database for retrieval.
S202, screening out the vector cluster with the highest similarity with the vector to be warehoused from the database to obtain the cluster to be added.
S203, adding the vector to be put into the database into the inverted index of the cluster to be added.
Continuing with fig. 1 (B), assuming that the vector B is a to-be-binned vector, a vector cluster A2 most similar to the vector B can be screened from the database as a to-be-added cluster of the vector B. Thus, vector B may be added to vector cluster A2. The vector cluster A2 has an inverted index, and the vector index value of each vector included in the vector cluster A2 is recorded in the inverted index. The expansion of the vector cluster A2 is achieved by adding the vector B to the inverted index of the vector cluster A2, so that the vector B can be recalled through the vector cluster A2 at the time of retrieval.
S204, under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value, screening at least one supplementary cluster from candidate vector clusters except the cluster to be added.
In the embodiment of the disclosure, under the condition that the similarity between the cluster to be added and the vector to be warehoused is lower than the first threshold, the vector to be warehoused is determined to be an edge vector, and one cluster cannot well contain the vector to be warehoused. To increase the recall rate, the vectors to be binned may be added to more vector clusters. S205 can be performed in the case where the supplementary cluster is screened out, in order to improve the recall rate of the vector.
For example, as shown in fig. 1 (B), a vector B to be binned may be added to the vector cluster A1 in fig. 1.
And S205, adding the vectors to be put into the database into the inverted index of each supplementary cluster respectively.
And adding the vector to be put into a database into the supplementary cluster, namely realizing the expansion of the supplementary cluster. The manner of adding the inverted index to the supplementary cluster is similar to that set forth above and will not be described further herein.
In summary, in the embodiment of the present disclosure, the closest cluster to be added is first screened for the vector to be put in storage, and is added to the cluster to be added. Based on the similarity between the vector to be binned and the closest vector cluster, an edge vector may be determined. And then the edge vectors are supplemented into more supplementary clusters, so that the fact that the multiple clusters all contain the vectors to be put into storage is achieved. During recalling, the clusters to be added or the supplementary clusters can be recalled to recall the edge vectors, so that the recall rate of the edge vectors is improved, and the retrieval efficiency is improved.
In one possible implementation, a vector cluster with a higher similarity to the cluster to be added can be searched as a supplementary cluster. For example, on the basis of fig. 1 (a), taking fig. 1 (c) as an example, assuming that the vector to be binned is the vector a4, and the similarity between the vector cluster A1 and the vector a4 to be binned is the highest, a plurality of clusters similar to the vector cluster A1 may be searched as supplementary clusters. Assuming that the similarity between the vector cluster A2 and the vector cluster A1 is high, the vector a4 to be binned may also be added to the vector cluster A2. Therefore, when the search recall is performed after encountering the query vector B, the edge vectors a4 are recalled together no matter the recall vector cluster A1 or the recall vector cluster A2, so that the vector a4 closest to the query vector B is found, and the search precision is improved.
In practice, candidate vector clusters having a similarity higher than the similarity threshold with the cluster to be added (e.g., vector cluster A2) may be selected as the supplementary clusters. And selecting a specified number of candidate vector clusters which are similar to the cluster to be added as a supplementary cluster according to the similarity.
In another embodiment, the method for screening out at least one supplementary cluster from candidate vector clusters other than the cluster to be added can also be implemented as shown in fig. 3:
screening candidate vector clusters meeting preset conditions from the candidate vector clusters to be added as supplementary clusters; wherein the preset condition comprises at least one of the following:
the method comprises the following steps that 1, the similarity between a vector to be put in storage and a candidate vector cluster is larger than a second threshold; the second threshold is greater than the first threshold. Namely, the candidate vector cluster which is closer to the vector to be warehoused is screened out to be used as a supplementary cluster, so that the vector to be warehoused is added into the supplementary cluster. As shown in fig. 3, the candidate vector cluster a has a higher similarity to the vector to be binned, and therefore serves as a complementary cluster.
And 2, enabling the vector to be put in storage to be in the cluster radius range of the candidate vector cluster. That is, candidate vector clusters which may cover the vectors to be put in storage are screened out as supplementary clusters through the cluster radius. As in fig. 3, the vector to be binned is within the cluster radius of candidate vector cluster B, so candidate vector cluster B serves as a supplemental cluster.
Therefore, in the embodiment of the disclosure, the vector cluster which is similar to the vector to be put in storage or the cluster radius of which can cover the vector to be put in storage is screened through the preset conditions, so that a proper supplementary cluster is screened for the vector to be put in storage, the vector to be put in storage is added to a proper vector cluster, and the recall rate is improved.
In implementation, the similarity between the vector to be put in storage and any vector cluster can be represented by determining the distance between the vector to be put in storage and the cluster representative vector. For the similarity between the vector to be put into a library and the candidate vector cluster, the similarity can be determined based on the following method:
step A1, a cluster representative vector of the candidate vector cluster is obtained.
As described in fig. 1 (a), the cluster representative vector is a vector capable of representing a corresponding cluster, and may be a vector in the vector cluster, or may be a vector mean in the vector cluster.
And step A2, determining the similarity between the vector to be warehoused and the cluster representative vector to obtain the similarity between the vector to be warehoused and the candidate vector cluster.
The similarity between the vectors can be represented by cosine similarity and Euclidean distance.
Therefore, the difference between the vectors to be put in storage and the vector clusters can be expressed through the distance between the vectors, so that a proper cluster to be added or a proper supplementary cluster is conveniently found for the vectors to be put in storage, and the recall rate of the edge vectors is improved.
In some embodiments, for the cluster radius, the similarity between all vectors in the vector cluster and the cluster representative vector respectively may be determined, and the mean value of the similarity is taken as the cluster radius. The maximum of all the similarities may also be selected as the cluster radius. In order to reasonably determine the cluster radius, the cluster radius may be automatically generated based on a neural network model in the embodiment of the present disclosure. The manner of generating the cluster radius may be implemented as:
and step B1, determining candidate radiuses of the candidate vector clusters based on learnable parameters of the radius learning network.
In the implementation, in the mapping relation expressed by the radius learning network, the learnable parameters are used as independent variables of the mapping relation, and the candidate radius is used as a dependent variable of the mapping relation, so that the candidate radius is determined.
The mapping relationship may be various non-linear relationships, and is applicable to the embodiments of the present disclosure as long as the decision radius can be mapped to a learnable parameter for learning.
And B2, determining a loss value based on the candidate radius, the vector in the candidate vector cluster and the cluster representative vector of the candidate vector cluster.
And B3, adjusting the learnable parameters based on the loss value, and determining the cluster radius of the candidate vector cluster based on the adjusted learnable parameters under the condition that the radius learning network meets the training convergence condition.
In some embodiments, in the case that the loss value tends to be stable, or the number of iterations reaches a specified threshold, it indicates that the radius learning network satisfies the training convergence condition. And if the radius learning network meets the training convergence condition, stopping learning the cluster radius, and outputting the cluster radius determined by the obtained final learnable parameters as the cluster radius of the candidate vector cluster.
For example, the radius learning network may be a softplus activation function, which fits an expression of cluster radius as shown in equation (1):
Figure BDA0004006785820000081
in the formula (1), Δ k Is a candidate for the radius of the optical disc,
Figure BDA0004006785820000082
are learnable parameters.
Based on the formula (1), Δ k The value is greater than 0, so the softplus activation function satisfies the physical property of cluster radius. I.e. the cluster radius needs to be larger than zero.
Except for softplus activation function, the radius learning network is not limited to the form shown in formula (1) in the present disclosure, and can be any type of radius learning network
Figure BDA0004006785820000083
Learning networks that are differentiable and that make the values of cluster radii all greater than 0 are all suitable for use with the embodiments of the present disclosure.
In some embodiments, it may be desirable to adjust the learnable parameters based on the loss values. The loss value needs to be calculated based on a loss function, and the target loss function for learning the cluster radius in the embodiment of the present disclosure may satisfy the following condition: for any vector in the candidate vector cluster, in the case that the distance between the vector and the cluster representative vector is greater than the candidate radius, the loss value determined based on the target loss function can reduce the candidate radius; in the case where the distance is less than or equal to the candidate radius, the loss value determined based on the target loss function can increase the candidate radius.
An alternative objective loss function is a boundary loss function, whose expression is shown in equation (2):
Figure BDA0004006785820000084
in the formula (2) z i Is the ith vector in the candidate vector cluster, c yi A cluster representative vector, | z, for the candidate vector cluster i -c yi || 2 Representing the distance between the ith vector and the representative vector of the cluster, and N representing the total number of vectors in the candidate vector cluster; delta of yi Representing the candidate radius. Wherein, delta i The obtaining mode of (2) is shown as formula (3):
Figure BDA0004006785820000085
in some embodiments, in the case that the distance between the vector in the candidate vector cluster and the cluster representative vector is greater than the candidate radius, it indicates that the vector is outside the candidate radius, which means that the current cluster radius is smaller and the cluster radius needs to be increased, so δ i The value is 1 in order to increase the cluster radius; under the condition that the distance between the vector and the cluster representative vector is not larger than the cluster radius, the fact that the vector is inside the cluster radius is shown, the current cluster radius is larger, and the cluster radius needs to be reduced, so that the delta i The value is 0 in order to reduce the cluster radius by the target loss function.
And inputting each vector in the candidate vector cluster A into a radius learning network, and obtaining learnable parameters of respective corresponding cluster radius from different candidate vector clusters by adjusting the radius learning network to obtain the cluster radius of the candidate vector cluster A. For example, as shown in part a of fig. 4, the cluster radius of the vector cluster may be increased appropriately, and as shown in part b of fig. 4, the cluster radius of the vector cluster may be decreased appropriately.
Therefore, in the embodiment of the disclosure, the cluster radius corresponding to each candidate cluster vector can be accurately obtained by learning the cluster radius, so that a suitable complementary cluster can be screened out, and the retrieval recall rate is improved.
In the embodiment of the present disclosure, adding the same vector to be put into storage to a plurality of vector clusters may result in vector redundancy in the database. In order to reduce redundancy and save storage resources of the database, the preset conditions may further include condition 3: the similarity between the candidate vector cluster and the designated cluster is not higher than a third threshold; the designated cluster includes a cluster to be added and a supplementary cluster that has satisfied a preset condition.
For example, in fig. 3, the similarity between any two clusters of the cluster to be added and the complementary cluster is required to be not higher than the third threshold.
For another example, continuing to use fig. 1 (c) as an example, adding the vector a4 to be put into storage to the vector cluster A1 closest to the vector a4, assuming that the similarity between the vector cluster A2 and the vector A1 to be put into storage is greater than the second threshold, and the similarity between the vector cluster A2 and the vector cluster A1 is less than the third threshold, the vector cluster A2 satisfies the preset condition, and the vector cluster A2 serves as a supplementary cluster of the vector a4 to be put into storage. The resulting designated cluster now includes vector cluster A1 and vector cluster A2. Assuming that the similarity between the vector cluster A3 and the vector a4 to be binned is higher than the second threshold, the similarity between the vector cluster A3 and the vector cluster A1 is lower than the third threshold, but the similarity between the vector cluster A3 and the vector cluster A2 is higher than the third threshold, since the vector cluster A3 and the vector cluster A2 are similar, the vector cluster A3 does not satisfy the condition 3 of the preset condition, and thus the vector cluster A3 is not used as a supplementary cluster of the vector a4 to be binned. This is because, during the recall, the vector cluster A3 and the vector cluster A2 are similar, and most probably, the two vector clusters can be recalled at the same time, so the vector a4 to be put in storage is placed in one of the vector clusters, that is, added to the vector cluster A2, and does not need to be added to the vector cluster A3.
Therefore, in the embodiment of the disclosure, the number of the supplementary clusters can be reduced through supplementary preset conditions, the vector to be put in storage is added into a proper vector cluster, the recall rate is improved, the vector redundancy can be reduced, and the storage resource is saved.
In some embodiments, by expanding the vector clusters, the number of vectors in the vector clusters is greater and greater. In order to avoid the deterioration of representativeness of the cluster representation, the cluster representation vector cluster is divided in the embodiment of the disclosure, so that the cluster representation vector cluster can represent a class of vectors collectively. May be implemented as shown in fig. 5:
s501, aiming at any one target cluster of the cluster to be added and the complementary cluster, splitting the target cluster into a plurality of vector clusters under the condition that the target cluster meets the splitting condition.
The cluster to be added can be used as a target cluster, the complementary cluster can be used as a target cluster, and both the clusters can be respectively used as the target clusters. Under the condition that any target cluster meets the splitting condition, the target cluster can be split.
Wherein the splitting conditions may comprise at least one of:
1) The number of vectors within the target cluster is above a number threshold.
The number threshold may be set according to a requirement, which is not limited in this disclosure.
2) The distance between the vectors within the target cluster is greater than a distance threshold.
For example, the distance between each vector in the target cluster and the cluster representative vector is determined, the maximum distance is selected, and if the maximum distance is greater than the distance threshold, the cluster representative vector cannot represent the vector farthest from the cluster representative vector well. Therefore, the target cluster needs to be disassembled and subdivided into a plurality of vector clusters.
When the target cluster is split, a clustering method can be adopted to perform clustering again on the target cluster. The clustering method may employ, for example, a K-Mean (K-means) clustering algorithm. The clustering algorithm divides the data set into n clusters, each cluster using as a cluster representation all sample means within the cluster, which means may also be referred to as "centroids".
In addition, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or hierarchical Clustering method can be adopted to split the target cluster into a plurality of vector clusters.
Of course, it should be noted that any clustering method is applicable to the embodiments of the present disclosure, and the present disclosure does not specifically limit this.
When splitting a target cluster, the vector quantity in a vector cluster obtained by splitting can be required to be less than or equal to n. In practice, the value of n may be determined as desired. The n value can be determined, for example, by comprehensively considering QPS (Query Per Second, query rate).
QPS, which represents the number of queries that can be executed per second, is a measure of how much traffic a query service processes within a specified time. Assuming that k vector clusters need to be recalled for retrieval, and that m vectors/sec are required for QPS, then (m/k) vectors are required in each vector cluster. For example, QPS requires 10000 vectors per second, with k being 10, requiring 1000 vectors in each vector cluster. Assuming that the target cluster has 5000 vectors, the target cluster needs to be split into 5 vector clusters.
And S502, respectively constructing an inverted index for each vector cluster obtained by splitting.
Namely, each vector cluster obtained by splitting is used for constructing the reverse index by using the vectors contained in the vector cluster. All vectors of the vector cluster are contained in the inverted index.
In the embodiment of the disclosure, the large cluster is split into the small clusters, so that the number of times of traversing the vector during retrieval is reduced, and the retrieval efficiency is improved.
In summary, in the embodiments of the present disclosure, the same vector to be binned is added to multiple redundant vector clusters to increase the recall rate of the edge vector.
In addition, in order to comprehensively consider the redundancy situation and reduce the waste of storage resources, in the embodiment of the disclosure, a vector cluster is selected from similar vector clusters by being limited to add a vector to be put into a storage, so that the utilization rate of the storage resources is improved.
By splitting the vector cluster, the cluster representative vector can be more representative, the intra-cluster vector can be reasonably represented, and the retrieval efficiency is improved.
Based on the same technical concept, the embodiment of the disclosure also provides a retrieval method, which is implemented based on the database index construction method. As shown in fig. 6, the method includes the following:
s601, obtaining a query vector.
S602, based on the similarity between the query vector and the vector clusters, screening out the vector clusters with a first specified number from the database as the clusters to be queried.
In implementation, for example, a specified number (e.g., k) of vector clusters may be screened out as clusters to be queried in an order from high similarity to low similarity.
S603, determining the vector contained in each cluster to be queried based on the inverted index of each cluster to be queried.
For example, if the cluster 1 to be queried contains p1 vectors and the cluster 2 to be queried contains p2 vectors, the vectors in each cluster to be queried can be found according to the inverted index of each cluster to be queried.
S604, after the vectors contained in each cluster to be queried are subjected to de-duplication, a set of vectors to be queried is obtained.
For example, as explained above, the same vector may exist in multiple vector clusters, and thus, the vectors in the multiple vector clusters need to be de-duplicated to reduce the number of vectors that need to be matched with the query vector and increase the number of searches.
S605, screening out the vector matched with the query vector from the vector set to be queried.
Continuing with the case shown in fig. 1 (c) as an example, assuming that the edge vector a4 is added to the vector cluster A1 and the vector cluster A2 by the database index construction method provided by the embodiment of the present disclosure, for the query vector B, since the vector cluster A2 closest thereto contains the edge vector a4, the edge vector a4 can be recalled from the vector cluster A2 even without recalling the vector cluster A1, so as to find and return the vector a4 closest to the query vector B.
Therefore, in the embodiment of the present disclosure, by adding a vector to a plurality of vector clusters, the recall rate of the vector can be increased, thereby improving the retrieval accuracy. In addition, the number of vectors matched with the query vector can be reduced through de-emphasis, and the retrieval efficiency is improved.
Based on the same technical concept, the present disclosure also provides an apparatus for constructing a database index, as shown in fig. 7, the apparatus including:
a first obtaining module 701, configured to obtain a vector to be put into a storage;
the first screening module 702 is configured to screen a vector cluster with the highest similarity to a vector to be put into a storage from a database, so as to obtain a cluster to be added;
a first adding module 703, configured to add the vector to be put into storage to the inverted index of the cluster to be added;
the second screening module 704 is configured to screen at least one supplementary cluster from candidate vector clusters other than the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold;
a second adding module 705, configured to add the vectors to be put into storage to the inverted index of each supplementary cluster respectively.
In some embodiments, the second filtering module 704 is configured to:
screening candidate vector clusters meeting preset conditions from the candidate vector clusters except the cluster to be added to serve as supplementary clusters;
the preset condition includes at least one of the following:
the similarity between the vector to be put in storage and the candidate vector cluster is greater than a second threshold value; the second threshold is greater than the first threshold;
and the vector to be put into the database is in the cluster radius range of the candidate vector cluster.
In some embodiments, the preset conditions further include: the similarity between the candidate vector cluster and the designated cluster is not higher than a third threshold; the designated cluster includes a cluster to be added and a supplementary cluster that has satisfied a preset condition.
In some embodiments, the first filtering module 702 is further configured to:
obtaining cluster representative vectors of the candidate vector clusters;
and determining the similarity between the vector to be warehoused and the cluster representative vector to obtain the similarity between the vector to be warehoused and the candidate vector cluster.
In some embodiments, a cluster radius generation module is further included for generating a cluster radius for a candidate vector cluster based on:
determining candidate radii of the candidate vector clusters based on learnable parameters of a radius learning network;
determining a loss value based on the candidate radius, the vector quantity in the candidate vector cluster and the cluster representative vector of the candidate vector cluster;
and adjusting the learnable parameters based on the loss value, and determining the cluster radius of the candidate vector cluster based on the adjusted learnable parameters under the condition that the radius learning network meets the training convergence condition.
In some embodiments, further comprising:
the splitting unit is used for splitting the target cluster into a plurality of vector clusters under the condition that the target cluster meets the splitting condition aiming at any one target cluster of the cluster to be added and the complementary cluster;
and the construction unit is used for respectively constructing the reverse indexes for each vector cluster obtained by splitting.
Based on the same technical concept, an embodiment of the present disclosure further provides a retrieval apparatus, which is applied to an index constructed by the apparatus shown in fig. 7, and as shown in fig. 8, the retrieval apparatus includes:
a second obtaining module 801, configured to obtain a query vector;
a third screening module 802, configured to screen, based on similarity between the query vector and the vector clusters, a first specified number of vector clusters from the database as clusters to be queried;
a determining module 803, configured to determine, based on the inverted index of each cluster to be queried, a vector included in each cluster to be queried;
the duplication elimination module 804 is used for eliminating duplication of vectors contained in each cluster to be queried to obtain a vector set to be queried;
and a matching module 805, configured to filter out a vector matched with the query vector from the set of vectors to be queried.
For a description of specific functions and examples of each module and each sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding steps in the foregoing method embodiments, and details are not repeated here.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 901 performs the various methods and processes described above, such as the method of building a database index and/or the retrieval method. For example, in some embodiments, the methods of constructing database indexes and/or the retrieval methods may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM902 and/or communications unit 909. When loaded into RAM903 and executed by computing unit 901, may perform one or more of the steps of the above-described methods of constructing a database index and/or methods of retrieval. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the method of building a database index and/or the retrieval method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A method of constructing a database index, comprising:
acquiring a vector to be put in storage;
screening out a vector cluster with the highest similarity with the vector to be put in storage from a database to obtain a cluster to be added;
adding the vector to be put into a warehouse into the inverted index of the cluster to be added;
screening at least one supplementary cluster from candidate vector clusters except the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value;
and respectively adding the vectors to be put into storage into the inverted index of each supplementary cluster.
2. The method of claim 1, wherein screening at least one complementary cluster from the candidate vector clusters other than the to-be-added cluster comprises:
screening candidate vector clusters meeting preset conditions from the candidate vector clusters except the cluster to be added to serve as supplementary clusters;
the preset condition comprises at least one of the following:
the similarity between the vector to be put into the warehouse and the candidate vector cluster is larger than a second threshold value; the second threshold is greater than the first threshold;
and the vector to be put in storage is within the cluster radius range of the candidate vector cluster.
3. The method of claim 2, the preset condition further comprising: the similarity between the candidate vector cluster and the designated cluster is not higher than a third threshold; the designated cluster comprises the cluster to be added and a supplementary cluster which meets the preset condition.
4. The method of claim 2 or 3, further comprising determining a similarity between the vector to be binned and the cluster of candidate vectors based on:
obtaining cluster representative vectors of the candidate vector clusters;
and determining the similarity between the vector to be put in storage and the representative vector of the cluster to obtain the similarity between the vector to be put in storage and the candidate vector cluster.
5. The method of claim 2 or 3, further comprising generating a cluster radius for the candidate vector cluster based on:
determining candidate radii of the candidate vector cluster based on learnable parameters of a radius learning network;
determining a penalty value based on the candidate radius, an amount of anisotropy in the candidate vector cluster, and a cluster representative vector of the candidate vector cluster;
and adjusting the learnable parameters based on the loss values, and determining the cluster radius of the candidate vector cluster based on the adjusted learnable parameters under the condition that the radius learning network meets the training convergence condition.
6. The method of any of claims 1-5, further comprising:
for any target cluster of the cluster to be added and the complementary cluster, splitting the target cluster into a plurality of vector clusters under the condition that the target cluster meets the splitting condition;
and respectively constructing an inverted index for each vector cluster obtained by splitting.
7. A retrieval method applied to an index constructed by the method of any one of claims 1-6, comprising:
acquiring a query vector;
screening a first specified number of vector clusters from a database as clusters to be queried based on the similarity between the query vectors and the vector clusters;
determining vectors contained in each cluster to be queried based on the inverted index of each cluster to be queried;
removing duplication of vectors contained in each cluster to be queried to obtain a set of vectors to be queried;
and screening out the vectors matched with the query vectors from the vector set to be queried.
8. An apparatus for constructing a database index, comprising:
the first acquisition module is used for acquiring a vector to be put into a warehouse;
the first screening module is used for screening out a vector cluster with the highest similarity with the vector to be warehoused from a database to obtain a cluster to be added;
the first adding module is used for adding the vector to be put into the storage into the inverted index of the cluster to be added;
the second screening module is used for screening at least one supplementary cluster from the candidate vector clusters except the cluster to be added under the condition that the similarity between the cluster to be added and the vector to be put in storage is lower than a first threshold value;
and the second adding module is used for respectively adding the vectors to be put into storage into the inverted indexes of each supplementary cluster.
9. The apparatus of claim 8, wherein the second screening module is to:
screening candidate vector clusters meeting preset conditions from the candidate vector clusters except the cluster to be added to serve as supplementary clusters;
the preset condition comprises at least one of the following:
the similarity between the vector to be put in storage and the candidate vector cluster is greater than a second threshold value; the second threshold is greater than the first threshold;
and the vector to be put in storage is within the cluster radius range of the candidate vector cluster.
10. The apparatus of claim 9, the preset condition further comprising: the similarity between the candidate vector cluster and the designated cluster is not higher than a third threshold; the designated cluster comprises the cluster to be added and a supplementary cluster which meets the preset condition.
11. The apparatus of claim 9 or 10, the first screening module further to:
obtaining cluster representative vectors of the candidate vector clusters;
and determining the similarity between the vector to be warehoused and the cluster representative vector to obtain the similarity between the vector to be warehoused and the candidate vector cluster.
12. The apparatus according to claim 9 or 10, further comprising a cluster radius generation module for generating a cluster radius of the candidate vector cluster based on:
determining candidate radii of the candidate vector cluster based on learnable parameters of a radius learning network;
determining a loss value based on the candidate radius, an amount of each of the candidate vector clusters, and a cluster representative vector of the candidate vector clusters;
and adjusting the learnable parameters based on the loss values, and determining the cluster radius of the candidate vector cluster based on the adjusted learnable parameters under the condition that the radius learning network meets the training convergence condition.
13. The apparatus of any of claims 8-12, further comprising:
the splitting unit is used for splitting the target cluster into a plurality of vector clusters under the condition that the target cluster meets the splitting condition aiming at any one target cluster of the cluster to be added and the complementary cluster;
and the construction unit is used for respectively constructing the reverse indexes for each vector cluster obtained by splitting.
14. A retrieval apparatus applied to an index constructed by the apparatus of any one of claims 8 to 13, comprising:
the second acquisition module is used for acquiring the query vector;
the third screening module is used for screening the vector clusters with the first specified number from the database as clusters to be queried based on the similarity between the query vectors and the vector clusters;
the determining module is used for determining vectors contained in each cluster to be queried based on the inverted index of each cluster to be queried;
the duplication removing module is used for removing duplication of vectors contained in each cluster to be queried to obtain a set of vectors to be queried;
and the matching module is used for screening out the vectors matched with the query vectors from the vector set to be queried.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202211633718.5A 2022-12-19 2022-12-19 Method for constructing database index, retrieval method, device, equipment and medium Pending CN115934724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211633718.5A CN115934724A (en) 2022-12-19 2022-12-19 Method for constructing database index, retrieval method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211633718.5A CN115934724A (en) 2022-12-19 2022-12-19 Method for constructing database index, retrieval method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115934724A true CN115934724A (en) 2023-04-07

Family

ID=86648751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211633718.5A Pending CN115934724A (en) 2022-12-19 2022-12-19 Method for constructing database index, retrieval method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115934724A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467494A (en) * 2023-06-20 2023-07-21 上海爱可生信息技术股份有限公司 Vector data indexing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467494A (en) * 2023-06-20 2023-07-21 上海爱可生信息技术股份有限公司 Vector data indexing method
CN116467494B (en) * 2023-06-20 2023-08-29 上海爱可生信息技术股份有限公司 Vector data indexing method

Similar Documents

Publication Publication Date Title
US20100088342A1 (en) Incremental feature indexing for scalable location recognition
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
US20190370599A1 (en) Bounded Error Matching for Large Scale Numeric Datasets
CN112115232A (en) Data error correction method and device and server
KR102633433B1 (en) Method and device for classifying face image, electronic device and storage medium
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
US20210209146A1 (en) Method and apparatus for searching multimedia content device, and storage medium
CN112506864B (en) File retrieval method, device, electronic equipment and readable storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN114782719B (en) Training method of feature extraction model, object retrieval method and device
CN114090735A (en) Text matching method, device, equipment and storage medium
CN115934724A (en) Method for constructing database index, retrieval method, device, equipment and medium
WO2017095413A1 (en) Incremental automatic update of ranked neighbor lists based on k-th nearest neighbors
CN110390011B (en) Data classification method and device
CN112925912B (en) Text processing method, synonymous text recall method and apparatus
CN110874366A (en) Data processing and query method and device
CN113220840B (en) Text processing method, device, equipment and storage medium
CN115880508A (en) Image data processing method, device, equipment and storage medium
CN112860626B (en) Document ordering method and device and electronic equipment
US11782918B2 (en) Selecting access flow path in complex queries
CN114328855A (en) Document query method and device, electronic equipment and readable storage medium
CN113033205A (en) Entity linking method, device, equipment and storage medium
CN111666452A (en) Method and device for clustering videos
CN116304253B (en) Data storage method, data retrieval method and method for identifying similar video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination