CN111309984B

CN111309984B - Method and device for retrieving node vector from database by index

Info

Publication number: CN111309984B
Application number: CN202010163626.XA
Authority: CN
Inventors: 杨文�; 李涛; 方概; 魏宏
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2023-09-05
Anticipated expiration: 2040-03-10
Also published as: CN111309984A

Abstract

The embodiment of the specification provides a method and a device for retrieving a node vector from a database by using an index. The PostgreSQL database includes vectors of nodes, and an index divides the nodes into clusters, each cluster corresponding to a center point. During retrieval, based on the index, vector matching is carried out on the central points corresponding to the clusters and the first node to be retrieved, a target central point which is matched with the first node in the vector is determined from the plurality of central points, vector matching is carried out on the nodes in the first cluster where the target central point is located and the first node, each node is added into a matching queue according to the matching result, and the node retrieved for the first node is determined based on the node ordering in the matching queue.

Description

Method and device for retrieving node vector from database by index

Technical Field

One or more embodiments of the present disclosure relate to the field of data retrieval, and in particular, to a method and apparatus for retrieving a node vector from a database using an index.

Background

With the development of computer technology, more and more information is contained in data, and the requirement for searching the data is gradually increased. Data containing more information may be represented generally by high-dimensional vectors, e.g., image, user features, etc., may all be represented by high-dimensional vectors. In some application scenarios, there is a requirement for retrieving high-dimensional vectors. For example, upon face payment, an input face image may be retrieved from a large number of face images in a database; on a shopping site, an input commodity image may be retrieved from a large number of commodity images in a database. PostgreSQL is an open source database supporting vector retrieval with high availability and high extensibility. With the increase of the number of data and the increase of vector dimensions, the vector retrieval efficiency based on the database becomes an important direction of current technical improvement.

Accordingly, an improved scheme is desired that can improve the search efficiency when performing high-dimensional vector search based on the PostgreSQL database.

Disclosure of Invention

One or more embodiments of the present specification describe methods and apparatus for node vector retrieval from a database using an index to improve retrieval efficiency when performing high-dimensional vector retrieval based on a PostgreSQL database. The specific calculation scheme is as follows.

In a first aspect, an embodiment provides a method for retrieving a node vector from a PostgreSQL database using an index, executed by a computer; the database comprises vectors of a plurality of nodes, the index divides the plurality of nodes into a plurality of clusters, and each cluster corresponds to a central point; the method comprises the following steps:

acquiring a first node to be retrieved;

based on the index, respectively carrying out vector matching on the central points corresponding to the clusters and the first node, and determining a target central point which is matched with the first node in a vector most from the plurality of central points;

vector matching is carried out on a plurality of nodes in a first cluster where the target center point is located and the first nodes respectively, and each node is added into a matching queue according to a matching result;

Based on the ordering of the nodes in the matching queue, the nodes retrieved for the first node are determined.

In one embodiment, the index includes a plurality of center point data pages for storing vectors of respective center points and a start node data page identification of a cluster in which each center point is located, and a plurality of node data pages belonging to different clusters for storing vectors of respective nodes, the nodes in one node data page corresponding to one cluster.

In one embodiment, the step of vector matching the center points corresponding to the clusters with the first node based on the index includes:

acquiring a plurality of center point data pages from the index, acquiring vectors of center points corresponding to the clusters from the plurality of center point data pages, and respectively matching the vectors of the center points corresponding to the clusters with the vectors of the first node;

the step of performing vector matching on the plurality of nodes in the first cluster where the target center point is located and the first node respectively includes:

and acquiring a starting node data page identifier of the first cluster from a center point data page corresponding to the target center point, acquiring vectors of a plurality of nodes corresponding to the first cluster from a plurality of node data pages of the index based on the starting node data page identifier, and respectively matching the vectors of the plurality of nodes with the vector of the first node.

In one embodiment, multiple node data pages belonging to the same cluster are contiguous.

In one embodiment, a plurality of node data pages belonging to the same cluster are discontinuous; and the node data page is also used for storing the node data page identification before the node data page and the node data page identification after the node data page in the same cluster.

In one embodiment, the target center point is one or more;

the step of respectively carrying out vector matching on a plurality of nodes in a first cluster where the target center point is located and the first node, and adding each node into a matching queue according to a matching result comprises the following steps:

and carrying out vector matching on a plurality of nodes in the first cluster and the first nodes respectively aiming at the first cluster where each target center point is located, and adding the nodes in the first cluster where each target center point is located into the same matching queue according to a matching result.

In one embodiment, the database further includes a first field other than the vector field of each node; when the first node to be retrieved is acquired, the method further comprises:

acquiring a limiting field value condition for the first field;

the step of determining the node retrieved for the first node based on the ordering of the nodes in the matching queue comprises:

Aiming at a first number of nodes with highest matching degree in a matching queue, acquiring first field values of the first number of nodes from the database;

and screening out nodes meeting the condition of the limiting field values from the first number of nodes based on the acquired first field values, and obtaining the nodes retrieved for the first nodes.

In a second aspect, an embodiment provides an index creation method for node vector retrieval from a PostgreSQL database, the database including vectors of a plurality of nodes, the method comprising:

acquiring a plurality of nodes from the database;

performing node clustering based on vectors of at least part of the plurality of nodes to obtain a plurality of clusters and respective center points of the plurality of clusters;

determining clusters to which the plurality of nodes belong respectively;

and recording the central points of the clusters and the division of the clusters on the nodes by using the index for node vector retrieval.

In one embodiment, the step of clustering nodes based on vectors of at least some of the plurality of nodes includes:

sampling a first portion of nodes from the plurality of nodes;

Based on vectors of all nodes in the first part of nodes, carrying out first clustering on the first part of nodes to obtain the clusters and respective center points of the clusters;

the step of determining clusters to which the plurality of nodes belong respectively at least comprises: and determining clusters corresponding to all the nodes in the first part of nodes.

In one embodiment, the step of recording the center points of the clusters and the partitioning of the nodes by the clusters using the index includes at least:

generating a center point data page, and storing vectors of center points of a plurality of clusters into the center point data page;

generating a node data page corresponding to each cluster, respectively storing the vectors of all the nodes in the first part of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

In one embodiment, the step of generating a node data page corresponding to each cluster includes:

generating node data pages of each cluster at least based on the number of nodes corresponding to each cluster, and storing node data page identifiers after the node data pages in each cluster into the node data pages according to any node data page in each cluster.

In one embodiment, the plurality of nodes includes any one second node other than the first partial node; the step of determining clusters to which the plurality of nodes belong respectively further includes:

vector matching is carried out on the second nodes and the center points respectively, and a second cluster corresponding to the second nodes is determined according to a matching result;

the step of recording the center points of the clusters and the division of the clusters into the nodes by using the index further includes: and adding the node vector of the second node to the existing node data page of the second cluster when the existing node data page of the second cluster is not full.

In one embodiment, the step of recording the center points of the clusters and the partitioning of the nodes by the clusters using the index further includes:

and when the existing node data pages of the second cluster are full, adding the node data pages of the second cluster, storing the vector of the second node and the initial node data page identification of the second cluster into the added node data pages, updating the added node data page into the initial node data page of the second cluster, and updating the initial node data page identification into the central point data page of the second cluster.

In a third aspect, an embodiment provides an apparatus for performing node vector retrieval from a PostgreSQL database using an index, deployed in a computer; the database comprises vectors of a plurality of nodes, the index divides the plurality of nodes into a plurality of clusters, and each cluster corresponds to a central point; the device comprises:

the retrieval acquisition module is configured to acquire a first node to be retrieved;

the central point retrieval module is configured to respectively perform vector matching on the central points corresponding to the clusters and the first node based on the index, and determine a target central point which is matched with the first node in a vector most from the plurality of central points;

the node retrieval module is configured to respectively perform vector matching on a plurality of nodes in a first cluster where the target center point is located and the first nodes, and add each node into a matching queue according to a matching result;

a node determination module configured to determine a node retrieved for the first node based on a node ordering in a matching queue.

In one embodiment, the center point search module, when performing vector matching on the center points corresponding to the clusters and the first node respectively based on the indexes, includes:

the node searching module, when performing vector matching on a plurality of nodes in a first cluster where the target center point is located and the first node, includes:

In one embodiment, the target center point is one or more; the node retrieval module is specifically configured to perform vector matching on a plurality of nodes in a first cluster where each target center point is located and the first nodes respectively, and add the nodes in the first cluster where each target center point is located into the same matching queue according to a matching result.

In one embodiment, the database further includes a first field other than the vector field of each node; the retrieval obtaining module is further configured to obtain a limiting field value condition for the first field when obtaining the first node to be retrieved;

the node determining module is specifically configured to obtain, for a first number of nodes with highest matching degree in a matching queue, a first field value of the first number of nodes from the database; and screening out nodes meeting the condition of the limiting field values from the first number of nodes based on the acquired first field values, and obtaining the nodes retrieved for the first nodes.

In a fourth aspect, an embodiment provides an index creation apparatus for node vector retrieval from a PostgreSQL database, deployed in a computer, the database comprising vectors of a plurality of nodes, the apparatus comprising:

the node acquisition module is configured to acquire a plurality of nodes from the database;

the node clustering module is configured to perform node clustering based on vectors of at least part of the plurality of nodes to obtain a plurality of clusters and respective center points of the plurality of clusters;

A node attribution module configured to determine clusters to which the plurality of nodes individually belong;

and the index recording module is configured to record the central points of the clusters and the division of the clusters into the nodes by using the index, so as to be used for node vector retrieval.

In one embodiment, the node clustering module, when performing node clustering based on vectors of at least some of the plurality of nodes, includes:

sampling a first portion of nodes from the plurality of nodes;

the node attribution module is at least configured to determine clusters corresponding to all nodes in the first part of nodes.

In one embodiment, the index recording module is configured to at least:

In one embodiment, the index recording module generates the node data page corresponding to each cluster, at least generates the node data page of each cluster based on the number of the nodes corresponding to each cluster, and stores the node data page identifier after the node data page in each cluster to the node data page for any node data page in each cluster.

In one embodiment, the plurality of nodes includes any one second node other than the first partial node; the node attribution module is further configured to:

the index recording module is further configured to add a node vector of the second node to an existing node data page of the second cluster when the existing node data page of the second cluster is not full.

In one embodiment, the index recording module is further configured to:

In a fifth aspect, embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any one of the first to second aspects.

In a sixth aspect, an embodiment provides a computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any one of the first to second aspects.

In the method and the device for performing node vector retrieval from the database by using the index provided by the embodiment of the specification, the index divides a plurality of nodes in the database into a plurality of clusters, each cluster corresponds to a central point, when performing node retrieval, vector matching is performed on the central points of the plurality of clusters and a first node respectively, a first cluster which is more matched with the first node is determined from the plurality of clusters according to a matching result, and a node group which is more similar to the first node can be determined in a large direction. And then the plurality of nodes of the first cluster are respectively matched with the first node, and the node searched for the first node can be determined from the plurality of nodes corresponding to the first cluster according to the matching result, so that the searching range can be rapidly reduced, the matching times can be reduced, and the searching efficiency when the high-dimensional vector searching is performed based on the PostgreSQL database can be improved.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a database and index disclosed herein;

FIG. 2 is a schematic flow chart of an index creating method according to an embodiment;

FIG. 3 is another flow chart of an index creation method based on the method of FIG. 2;

FIG. 4 is a schematic diagram of a format of a center point data page and a node data page according to an embodiment;

FIG. 5 is a flowchart illustrating a method for performing node vector retrieval using indexes according to an embodiment;

FIG. 6 is a schematic block diagram of an apparatus for node vector retrieval from a PostgreSQL database using indexes, as provided by an embodiment;

fig. 7 is a schematic block diagram of an index creation apparatus for node vector retrieval from a PostgreSQL database provided by an embodiment.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

PostgreSQL is an open source database supporting vector retrieval with high availability and high extensibility. The database may be used to store data. For example, the database may store images, user data, or behavioral event data, among others. Each piece of data in the database can be called a node, the data of the node can comprise a plurality of fields, such as a vector field for representing the characteristics of the node, and for user data, a region field, an age field and the like; for the face image, a name field and an age field in the face image can be further included. Wherein the vector of nodes may be composed of multidimensional data. For example, the vector corresponding to the image may be multidimensional data in which each pixel point in the image is used as a vector; the vector corresponding to the user data may be obtained based on some calculation of the user information. The vector field is an important feature of the node, and is also an important retrieval field. When the dimension of the vector itself is very high, the vector is also referred to as a high-dimensional vector.

The PostgreSQL database itself supports retrieval of high-dimensional vectors. Vector-based retrieval is the process of finding one or more vectors from the database that best match the vector to be retrieved. For example, the database may be used to store various merchandise information sold by a merchant, which may include images of the merchandise, as well as other information, which may itself be stored in a high-dimensional vector form. When the client receives an article image input by a user, commodity information similar to the article image can be retrieved from the database, and the article image can be matched with the image in each commodity information in the database during the retrieval. When the image resolution is high or very high, a one-to-one match between a large number of high-dimensional vectors would be very time consuming.

In order to improve the retrieval efficiency when performing high-dimensional vector retrieval based on the PostgreSQL database, the embodiment of the present specification provides a node vector retrieval method that performs retrieval based on a previously constructed index. The index may divide a plurality of nodes in the database into a plurality of clusters, each cluster corresponding to a center point. And during retrieval, vector matching is carried out on the central points of the clusters in the index and the nodes to be retrieved respectively, the cluster which is relatively close to the nodes to be retrieved is determined from the clusters, and then vector matching is carried out on the nodes in the cluster and the nodes to be retrieved respectively. The above process includes two phases, namely an index construction phase and a vector retrieval phase. The index construction stage is first described below.

Referring to fig. 1, fig. 1 is a schematic diagram of a PostgreSQL database and an index for node vector retrieval from the database in an embodiment of the present description. The left database comprises a plurality of nodes, and the nodes are represented by black dots. The right side is an index of the database, the index divides a plurality of nodes in the database into a plurality of clusters, each cluster has a center point, and the center point is represented by an open circle.

Fig. 2 is a schematic flow chart of an index creating method according to an embodiment. The method may be performed by a computer. In particular, the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. The creation method includes the following steps S210 to S240.

Step S210, obtaining a plurality of nodes from the database. The plurality of nodes acquired may be, but are not limited to, all nodes in the database. Acquiring a plurality of nodes may be understood as acquiring vectors of the plurality of nodes. For example, vectors of n nodes are obtained from the database, n is a positive integer, and the obtained n nodes may be represented by X1, X2, X3, … …, and Xn.

And step S220, carrying out node clustering based on vectors of at least part of the plurality of nodes to obtain a plurality of clusters and respective center points of the plurality of clusters. In step S230, a cluster to which each of the plurality of nodes belongs is determined.

In the clustering, the clustering can be performed based on all nodes in the database, or based on part of the nodes in the database. In performing clustering, a variety of clustering algorithms may be employed. For example, the clustering may be performed using a k-means clustering algorithm. The implementation of this step S220 will be described below by taking the k-means clustering algorithm as an example.

Specifically, the nodes S1, S2, S3, … …, sm among the plurality of nodes S1, S2, S3, … …, sn may be clustered, m is a positive integer, and m is less than or equal to n. During clustering, the number of clusters, namely the number k of clusters, is firstly set, wherein k is a positive integer, and k is smaller than m. And randomly selecting k nodes from the nodes S1, S2, S3, … … and Sm, respectively taking the k nodes as central points of k clusters, respectively calculating the vector matching degree between each node and the central points of the k clusters, determining the nodes as the clusters corresponding to the most matched central points, after determining the clusters corresponding to the nodes, recalculating the central points of the k clusters based on the vectors divided into the nodes in the k clusters, returning to continuously executing the step of respectively calculating the vector matching degree between each node and the central points of the k clusters, and iteratively updating the central points of the k clusters. And when the vector of the central points of the k clusters is not changed any more or the iteration times reach the preset times, determining that the clustering is finished.

When the central point of k clusters is recalculated based on the vectors divided into the nodes in k clusters, the average value of the vectors divided into the nodes in k clusters can be redetermined as the central point of the cluster for each cluster in the k clusters. Thus, when clustering is completed, the center point of a cluster may be one of the nodes divided into the cluster, or may be a point calculated based on the nodes divided into the cluster.

When calculating the vector matching degree between each node and the central points of k clusters, the matching degree between the vector of each node and the vector of the central points of k clusters can be calculated, and when determining the matching degree between the vectors, an algorithm such as pearson correlation coefficient (Pearson Correlation Coefficient, euclidean distance (Euclidean Distance) or cosine similarity can be adopted.

Upon completion of the clustering in step S220, a vector of center points of k clusters may be determined, and clusters to which the respective nodes participating in the clustering belong may be determined. In step S230, a cluster to which each of the plurality of nodes belongs is determined, and at least it can be understood that the cluster to which each of the nodes participating in the clustering belongs is obtained from the result of step S220. There is a similarity in vectors between nodes belonging to the same cluster.

Step S240, record the center points of the clusters and the partition of the clusters into the nodes by using the index, for performing the node vector search. In this step, the center point of k clusters and the division of k clusters into n or m nodes may be recorded in various ways, and the recorded contents are used as index contents. The content in the index may be used to perform node vector retrieval.

As can be seen from the foregoing, when the index is created in this embodiment, a plurality of nodes in the database are divided into a plurality of clusters by a clustering algorithm, and a center point corresponding to each cluster and a cluster to which each node belongs are determined. Therefore, the nodes in the database are divided into a plurality of clusters, when the nodes are searched, the central points of the clusters can be respectively subjected to vector matching with the nodes to be searched, the cluster which is more matched with the nodes to be searched is determined from the clusters according to the matching result, the node group which is more similar to the nodes to be searched can be determined in a large direction, and then the nodes of the cluster are respectively matched with the nodes to be searched, so that the searching range can be quickly reduced, the matching times are reduced, and the searching efficiency when the high-dimensional vector searching is performed based on the PostgreSQL database can be improved.

Referring back to the above steps S210 to S240, when the number of nodes included in the database is very large, some of the n nodes acquired from the database may be clustered. Referring to a flow chart shown in fig. 3, m nodes sampled from the acquired n nodes are clustered to obtain cluster 1, cluster 2, cluster k, and the like, and corresponding center point 1, center point 2, center point k, and the like. And then, carrying out vector matching on the remaining n-m nodes in the n nodes and each central point respectively, and determining clusters to which the n-m nodes belong respectively. Each center point is stored in a center point data page, and nodes belonging to each cluster are respectively stored in node data pages corresponding to each cluster.

In this embodiment, step S220 may specifically sample a first part of nodes from a plurality of nodes, and perform a first clustering on the first part of nodes based on vectors of the nodes in the first part of nodes, to obtain a plurality of clusters and respective center points of the clusters. The first clustering may be performed using the k-means clustering algorithm described in the foregoing.

For example, m nodes are sampled from n nodes, and as a first partial node, vectors of the m nodes are first clustered to obtain k clusters and vectors of central points of the k clusters. Where m < n, even when the number of n is large, m can be much smaller than n. For example, when n is billion, m may be tens of millions.

Step S220 performs clustering purposes, including determining k clusters, and the center points of the k clusters. M nodes are sampled from n nodes, and clustering is carried out based on the m nodes, so that k clusters and the central points of the k clusters can be completely determined, and the calculation efficiency of the clustering can be greatly improved.

In step S230, when determining the clusters to which the plurality of nodes belong, the clusters corresponding to the nodes in the first part of nodes may be determined first. For example, it may be determined which cluster of k clusters the m nodes respectively belong to based on the clustering result of step S220. Typically, a node can only be assigned to one cluster.

When determining the central points of the plurality of clusters and the clusters corresponding to the nodes in the first part of nodes, the data pages provided by the PostgreSQL database can be used for recording the plurality of central points and the plurality of nodes. Step S240, when the index is used for recording the central points of the clusters and the nodes are divided by the clusters, a central point data page can be specifically generated, and the vectors of the central points of the clusters are stored in the central point data page; generating a node data page corresponding to each cluster, respectively storing vectors of all nodes in the first part of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

When generating the node data page corresponding to each cluster, a plurality of node data pages corresponding to each cluster can be generated based on the number of nodes corresponding to each cluster, the data amount of each node vector and the capacity of each node data page, and for any node data page in each cluster, the node data page identifiers after the node data page in the cluster are stored in the node data page. The previous node page identifier for the node page may also be stored in the node page.

See, for example, the second row in the data page diagram shown in fig. 4. Cluster 1 corresponds to node data page 3, node data page 5, node data page 6, etc. In each node data page, a storage area between a header area and a footer area is used for storing a vector of the node. The end area of the node data page 3 stores an identifier p5 of the node data page 5, the head area of the node data page 5 stores an identifier p3 of the node data page 9, the end area stores an identifier p6 of the node data page 6, and the head area of the node data page 6 stores the identifier p5 of the node data page 5. Thus, each node data page in the cluster 1 stores the previous node data page identifier and the next node data page identifier, and the node data pages corresponding to the cluster 1 are connected into a chain. When a plurality of nodes in the cluster are acquired, other adjacent node data pages can be conveniently found based on the identification information stored in each node data page.

Wherein the capacity of each node data page may be the same. The node data page identification may be a number or other sequence number of the node data page. The node data page corresponding to each cluster may be continuous or discontinuous.

The center point data page may be the same as the node data page in format and capacity. Typically, since the total number of centerpoints is much smaller than the total number of nodes in the database, the total number of centerpoint data pages is much smaller than the total number of node data pages in terms of the number of pages.

In generating the center point data pages, the number of center point data pages required may be determined based on the data amount of the center point vector and the capacity of each center point data page, and the center point data pages may be generated based on the number. For each center point data page, a center point data page identifier following the center point data page may be stored in the center point data page, or a previous center point data page identifier of the center point data page may be stored in the center point data page.

For example, for the first row in the data page diagram in fig. 4. The end-of-page region of the center point data page 0 stores the identity p1 of the center point data page 1, the header region of the center point data page 1 stores the identity p0 of the center point data page 0, the end-of-page region stores the identity p2 of the center point data page 2, and the header region of the center point data page 2 stores the identity p1 of the center point data page 1. In this way, the previous center point data page identifier and the next center point data page identifier are stored in the center point data pages, and when a plurality of center points are acquired, other adjacent center point data pages can be conveniently found based on the identifier information stored in each center point data page.

In addition, the center point data page of fig. 4 also stores therein a start node data page identification of each cluster. For example, in the center point data page 0, the center point 1 and the start node data page identification p3 of the corresponding cluster 1, and the center point 2 and the start node data page identification p9 of the corresponding cluster 2 are stored. In this way, each center point can be associated with a corresponding cluster. In addition, the respective center point data page identifications may also be stored in a designated area for quickly acquiring the vectors of the respective center points upon retrieval.

In this way, the center points corresponding to the plurality of clusters and the respective nodes corresponding to the plurality of clusters can be recorded in the index.

Returning to the schematic diagram shown in fig. 3, after dividing n nodes into m nodes and n-m nodes, and after completing clustering and completing the home records for each center point and m nodes, the cluster to which n-m nodes belong may also be determined.

For any second node X2 in the n-m nodes, vector matching can be performed on the second node X2 and each center point, and a second cluster C2 corresponding to the second node X2 is determined according to a matching result. When the existing node data page of the second cluster C2 is not full, the node vector of the second node X2 is added to the existing node data page of the second cluster C2. An existing node data page may be understood as an already generated, existing node data page.

When the existing node data page of the second cluster C2 is full, the node data page of the second cluster C2 is added, the vector of the second node X2 and the start node data page identifier of the second cluster C2 are stored in the added node data page, the added node data page is updated to be the start node data page of the second cluster C2, and the start node data page identifier is updated to be in the center point data page of the second cluster C2. That is, the newly added node data page is inserted into the forefront of the existing node data page as a new starting node data page.

After storing each of the centerpoints to the centerpoint data pages and each of the nodes to the node data pages of the corresponding clusters, the centerpoint data pages and the node data pages may be saved in disk.

After the index is constructed by the above embodiment, the retrieval process of the node vector can be performed based on the index. An embodiment of the vector retrieval stage is described below.

Fig. 5 is a flowchart of a method for retrieving a node vector from a PostgreSQL database using an index according to an embodiment. The method may be performed by a computer, and in particular, the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Wherein the index is created using the method shown in fig. 2. The index divides the plurality of nodes into a plurality of clusters, each cluster corresponding to a center point. The search method includes the following steps S510 to S540.

In step S510, the first node X1 to be retrieved is obtained. The first node X1 to be retrieved may be understood as a node similar in vector to it that needs to be retrieved from the database. The acquired first node X1 is an arbitrary node.

In step S520, based on the indexes, the center points corresponding to the clusters are vector-matched with the first node X1, and the target center point Xm that is the most vector-matched with the first node X1 is determined from the center points. The target center point Xm may be one or more. When determining the target central point Xm, each central point may be added to a central point matching queue according to the matching result, and one or more most matched target central points Xm may be determined from the central point matching queue.

In step S530, vector matching is performed on the plurality of nodes in the first cluster C1 where the target center point Xm is located and the first node X1, and each node is added to the matching queue according to the matching result.

When the number of the target center points Xm is multiple, vector matching is carried out on multiple nodes in the first cluster C1 and the first node X1 respectively aiming at the first cluster C1 where each target center point Xm is located, and the nodes in the first cluster C1 where each target center point Xm is located are added into the same matching queue according to a matching result. That is, according to the matching result, the nodes in different clusters are added to the same matching queue, which is more beneficial to comparing the matching degree between the nodes in different clusters and the first node X1.

Step S540, based on the node ordering in the matching queue, the nodes retrieved for the first node X1 are determined. In this step, from among the nodes of the matching queue arranged from large to small according to the matching degree, the previous preset number of nodes may be determined as the nodes retrieved for the first node X1, as the retrieval result.

As can be seen from the foregoing, in the present embodiment, when performing node search, vector matching is performed on the center points of the plurality of clusters and the first node, and the first cluster that is more matched with the first node is determined from the plurality of clusters according to the matching result, so that the node group that is more similar to the first node can be determined in a large direction. And then the plurality of nodes of the first cluster are respectively matched with the first node, and the node searched for the first node can be determined from the plurality of nodes corresponding to the first cluster according to the matching result, so that the searching range can be rapidly reduced, the matching times can be reduced, and the searching efficiency when the high-dimensional vector searching is performed based on the PostgreSQL database can be improved.

In one embodiment, for ease of implementation and to increase efficiency in searching, the index may record the center point and nodes using a center point data page and a node data page, respectively. In particular, the index may include a plurality of center point data pages and a plurality of node data pages belonging to different clusters. And the central point data page is used for storing the vector of each central point and the initial node data page identification of the cluster where each central point is located. And the node data page is used for storing the vector of each node. Nodes in a node data page correspond to a cluster. The format of the various data pages can be seen in the schematic diagram shown in fig. 4.

Wherein, a plurality of node data pages belonging to the same cluster can be continuous or discontinuous. When a plurality of node data pages belonging to the same cluster are discontinuous, the node data pages are also used for storing the node data page identification before the node data page and the node data page identification after the node data page in the same cluster.

In step S520, when the central points corresponding to the clusters are respectively vector-matched with the first node X1 based on the index, a plurality of central point data pages may be specifically obtained from the index, the vectors of the central points corresponding to the clusters are obtained from the plurality of central point data pages, and the vectors of the central points corresponding to the clusters are respectively matched with the vector of the first node X1.

In step S530, when the plurality of nodes in the first cluster C1 where the target center point Xm is located are respectively matched with the vectors of the first node C1, specifically, the start node data page identifier of the first cluster C1 may be obtained from the center point data page corresponding to the target center point, based on the start node data page identifier, the vectors of the plurality of nodes corresponding to the first cluster C1 are obtained from the plurality of node data pages indexed, and the vectors of the plurality of nodes are respectively matched with the vectors of the first node.

In the actual search, the search condition may also include a search for a field other than the vector field. The embodiment of the specification also provides a retrieval method for retrieving the vector field and other fields.

The database may further include a first field other than the vector field of each node, for example, the first field may be a non-vector field such as a region field, an age field, or a year field, and these fields may also be text fields. When the first node X1 to be retrieved is acquired in step S510, a limit field value condition for the first field X1 may also be acquired. The limit field condition may include that the value of the first field is within a certain range or that the value of the first field is equal to a preset certain value. For example, when the first field is a city, the field value may be limited to Beijing; when the first field is year, the field value may be limited to 2015-2020.

In step S540, when determining the node retrieved for the first node X1 based on the node ordering in the matching queue, the first field values of the first number of nodes may be obtained from the database for the first number of nodes with the highest matching degree in the matching queue, and based on the obtained first field values, the nodes satisfying the condition of the limiting field values may be selected from the first number of nodes, so as to obtain the node retrieved for the first node X1. Therefore, the embodiment not only can search the vector field alone, but also can search the combination of the vector field and other text fields, thereby enriching the search function.

The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying figures are not necessarily required to achieve the desired result in the particular order shown, or in a sequential order. In some embodiments, multitasking and parallel processing are also possible, or may be advantageous.

Fig. 6 is a schematic block diagram of an apparatus for node vector retrieval from a PostgreSQL database using an index, as provided by an embodiment. The apparatus 600 is deployed in a computer, the apparatus embodiment corresponding to the method embodiment shown in fig. 5. The database comprises vectors of a plurality of nodes, the index divides the plurality of nodes into a plurality of clusters, and each cluster corresponds to one central point. The apparatus 600 includes:

a retrieval obtaining module 610 configured to obtain a first node to be retrieved;

a center point search module 620 configured to vector-match center points corresponding to the plurality of clusters with the first node, respectively, based on the indexes, and determine a target center point that is most matched with the first node in the vector from the plurality of center points;

The node retrieval module 630 is configured to perform vector matching on a plurality of nodes in a first cluster where the target center point is located and the first nodes respectively, and add each node into a matching queue according to a matching result;

the node determination module 640 is configured to determine the node retrieved for the first node based on the ordering of the nodes in the matching queue.

In one embodiment, the index may include a plurality of center point data pages for storing vectors of respective center points and a start node data page identification of a cluster in which each center point is located, and a plurality of node data pages belonging to different clusters, the node data pages for storing vectors of respective nodes, the nodes in one node data page corresponding to one cluster.

In a specific embodiment, the central point searching module 620, when performing vector matching on the central points corresponding to the clusters and the first node respectively based on the indexes, includes:

acquiring a plurality of center point data pages from the index, acquiring vectors of center points corresponding to a plurality of clusters from the plurality of center point data pages, and respectively matching the vectors of the center points corresponding to the plurality of clusters with the vector of the first node;

the node retrieval module 630, when performing vector matching on a plurality of nodes in the first cluster where the target center point is located and the first node, includes:

In one embodiment, the target center point is one or more; the node retrieval module 630 is specifically configured to:

and respectively carrying out vector matching on a plurality of nodes in the first cluster and the first nodes aiming at the first cluster where each target center point is located, and adding the nodes in the first cluster where each target center point is located into the same matching queue according to a matching result.

In one embodiment, the database further includes a first field other than the vector field of each node; the retrieval obtaining module 610 is further configured to obtain a constraint field value condition for the first field when obtaining the first node to be retrieved;

The node determining module 640 is specifically configured to:

aiming at a first number of nodes with highest matching degree in a matching queue, acquiring first field values of the first number of nodes from a database;

and screening nodes meeting the condition of limiting the field values from the first number of nodes based on the acquired first field values, and obtaining the nodes retrieved for the first nodes.

Fig. 7 is a schematic block diagram of an index creation apparatus for node vector retrieval from a PostgreSQL database provided by an embodiment. The apparatus 700 is deployed in a computer, the apparatus embodiment corresponding to the method embodiment shown in fig. 2. Wherein the database comprises vectors of a plurality of nodes. The apparatus 700 comprises:

a node acquisition module 710 configured to acquire a plurality of nodes from a database;

the node clustering module 720 is configured to perform node clustering based on vectors of at least some of the plurality of nodes to obtain a plurality of clusters and respective center points of the plurality of clusters;

a node attribution module 730 configured to determine clusters to which the plurality of nodes each reside;

the index recording module 740 is configured to record the center points of the plurality of clusters and the division of the plurality of clusters into the plurality of nodes by using the index for node vector retrieval.

In a specific embodiment, the node clustering module 720, when performing node clustering based on vectors of at least some of the plurality of nodes, includes:

sampling a first portion of nodes from the plurality of nodes;

based on vectors of all nodes in the first part of nodes, carrying out first clustering on the first part of nodes to obtain a plurality of clusters and respective center points of the clusters;

the node attribution module 730 is at least configured to determine clusters corresponding to the nodes in the first part of nodes.

In one embodiment, the index recording module 740 is configured to at least:

generating a node data page corresponding to each cluster, respectively storing vectors of all nodes in the first part of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

In a specific embodiment, the index recording module 740, when generating the node data page corresponding to each cluster, includes:

In one embodiment, the plurality of nodes includes any one of the second nodes other than the first portion of nodes; node home module 730, further configured to:

vector matching is carried out on the second nodes and the central points respectively, and a second cluster corresponding to the second nodes is determined according to the matching result;

the index recording module 740 is further configured to add the node vector of the second node to the existing node data page of the second cluster when the existing node data page of the second cluster is not full.

In one embodiment, the index recording module 740 is further configured to:

when the existing node data pages of the second cluster are full, the node data pages of the second cluster are added, the vector of the second node and the initial node data page identification of the second cluster are stored in the added node data pages, the added node data page is updated to be the initial node data page of the second cluster, and the initial node data page identification is updated to be the central point data page of the second cluster.

The foregoing apparatus embodiments correspond to the method embodiments, and specific descriptions may be referred to in the method embodiment section, which is not repeated herein. The device embodiments are obtained based on corresponding method embodiments, and have the same technical effects as the corresponding method embodiments, and specific description can be found in the corresponding method embodiments.

The present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of fig. 1 to 5.

Embodiments of the present disclosure also provide a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any one of fig. 1-5.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for storage media and computing device embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing detailed description of the embodiments of the present invention further details the objects, technical solutions and advantageous effects of the embodiments of the present invention. It should be understood that the foregoing description is only specific to the embodiments of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method for retrieving node vectors from a PostgreSQL database using an index, executed by a computer; the database comprises vectors of a plurality of nodes, the index divides the plurality of nodes into a plurality of clusters, and each cluster corresponds to a central point; the index comprises a plurality of center point data pages and a plurality of node data pages belonging to different clusters, wherein the center point data pages are used for storing vectors of center points and initial node data page identifiers of the clusters where the center points are located; the node data page is used for storing the vector of the node; nodes in a node data page correspond to a cluster; the method comprises the following steps:

acquiring a first node to be retrieved;

2. The method of claim 1, wherein the step of vector matching the center points corresponding to the clusters with the first node based on the index comprises:

3. The method of claim 1, wherein a plurality of node data pages belonging to the same cluster are contiguous.

4. The method of claim 1, wherein a plurality of node data pages belonging to the same cluster are discontinuous; and the node data page is also used for storing the node data page identification before the node data page and the node data page identification after the node data page in the same cluster.

5. The method of claim 1, the target center point being one or more;

6. The method of claim 1, the database further comprising a first field other than a vector field of each node; when the first node to be retrieved is acquired, the method further comprises:

acquiring a limiting field value condition for the first field;

7. An index creation method for node vector retrieval from a PostgreSQL database, the database comprising vectors of a plurality of nodes, the method comprising:

acquiring a plurality of nodes from the database;

determining clusters to which the plurality of nodes belong respectively;

recording center points of the clusters and the division of the clusters on the nodes by using the index for node vector retrieval; wherein the step of recording the center points of the plurality of clusters and the division of the plurality of clusters into the plurality of nodes by using the index at least includes: generating a center point data page, and storing vectors of center points of a plurality of clusters into the center point data page; generating a node data page corresponding to each cluster, respectively storing the vectors of the nodes into the node data pages of the corresponding clusters, and storing the initial node data page identification of each cluster into the corresponding central point data page.

8. The method of claim 7, the step of clustering nodes based on vectors of at least some of the plurality of nodes, comprising:

sampling a first portion of nodes from the plurality of nodes;

9. The method of claim 8, the step of recording the center points of the plurality of clusters and the partitioning of the plurality of nodes by the plurality of clusters using the index, at least comprising:

and generating a node data page corresponding to each cluster, and respectively storing the vectors of all the nodes in the first part of nodes into the node data page of the corresponding cluster.

10. The method of claim 9, the step of generating a node data page corresponding to each cluster, comprising:

11. The method of claim 8, the plurality of nodes comprising any one second node other than the first partial node; the step of determining clusters to which the plurality of nodes belong respectively further includes:

12. The method of claim 11, the step of recording the center points of the plurality of clusters and the partitioning of the plurality of nodes by the plurality of clusters using the index, further comprising:

13. An apparatus for retrieving node vectors from a postgreSQL database using an index, deployed in a computer; the database comprises vectors of a plurality of nodes, the index divides the plurality of nodes into a plurality of clusters, and each cluster corresponds to a central point; the index comprises a plurality of center point data pages and a plurality of node data pages belonging to different clusters, wherein the center point data pages are used for storing vectors of center points and initial node data page identifiers of the clusters where the center points are located; the node data page is used for storing the vector of the node; nodes in a node data page correspond to a cluster; the device comprises:

14. The apparatus of claim 13, the center point retrieval module, when vector matching the center points corresponding to the plurality of clusters with the first node, respectively, based on the index, comprises:

15. The apparatus of claim 13, the target center point being one or more; the node retrieval module is specifically configured to:

16. The apparatus of claim 13, the database further comprising a first field other than a vector field of each node; the retrieval obtaining module is further configured to obtain a limiting field value condition for the first field when obtaining the first node to be retrieved;

the node determining module is specifically configured to:

17. An index creation apparatus for node vector retrieval from a PostgreSQL database deployed in a computer, the database comprising vectors of a plurality of nodes, the apparatus comprising:

an index recording module configured to record center points of the plurality of clusters and division of the plurality of nodes by the plurality of clusters by using the index for node vector retrieval; the index recording module is at least configured to generate a central point data page, and the vectors of the central points of a plurality of clusters are stored in the central point data page; generating a node data page corresponding to each cluster, respectively storing vectors of a plurality of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

18. The apparatus of claim 17, the node clustering module, when clustering nodes based on vectors of at least some of the plurality of nodes, comprising:

sampling a first portion of nodes from the plurality of nodes;

19. The apparatus of claim 18, the index recording module configured at least to:

20. The apparatus of claim 19, wherein the index recording module, when generating the node data page corresponding to each cluster, comprises:

21. The apparatus of claim 18, the plurality of nodes comprising any one second node other than the first partial node; the node attribution module is further configured to:

22. The apparatus of claim 21, the index recording module further configured to:

23. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-12.

24. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-12.